The Policy Change Index for Outbreak (PCI-Outbreak) measures the
severity of an epidemic outbreak in China, such as the coronavirus
disease 2019 (COVID-19), using the 2003 severe acute respiratory
syndrome (SARS) as the benchmark. The higher the indicator, the larger
the scale of the outbreak.
Figure: PCI-Outbreak for COVID-19 and official statistics in
China (Jan 21 to Sep 15, 2020)
The figure above plots the PCI-Outbreak for COVID-19 in China, in
comparison to the number of unresolved diagnosed cases officially
confirmed by the Chinese government.
How severe was COVID-19 in China, really? It is widely suspected that
the country’s official numbers of diagnosed cases understate the extent
of the outbreak, and even the official statistics themselves are
incoherent. On February 13, 2020, the Chinese authorities confirmed more
than 15,000 new cases nationwide—40 times the previous day’s
number—because of a change in counting criteria. Just two months after
that, they revised the death toll for the city of Wuhan, the epicenter,
upward by 50 percent, citing unspecified omissions.
The PCI-Outbreak uses a deep learning method to measure the severity
of COVID-19 in China, not through the Chinese government’s official
numbers but through how state-controlled media talked about the
outbreak. The algorithm is trained on SARS-episode articles in the
People’s Daily, the official newspaper of the Communist Party
of China, to understand the tone and tenor of the SARS-episode narrative
as the epidemic cycle evolved. The algorithm then assesses future
outbreaks’ severity against the SARS benchmark.
The PCI-Outbreak is built on the idea that words can speak louder
than (some) numbers. While it may be simple to release false statistics,
it is more difficult to conceal the truth when the government has to
address a crisis, such as a severe infectious disease, at length in the
media. Take the beginning of COVID-19 for example: When the Chinese
government announced a lockdown of Wuhan, a city with a population of 11
million, and its neighboring cities and discussed the necessity of doing
so in state media, the Chinese authorities had confirmed a total of
fewer than 600 cases of the novel coronavirus across the country.
Therefore, changes in words during the outbreak may provide us with a
clearer picture of the severity than official numbers do.
To detect how the narrative evolves, we build a deep learning
algorithm to “read” People’s Daily articles published during
the SARS episode. Using the Bidirectional Encoder Representations from
Transformers (BERT), a natural language model developed by Google, the
PCI-Outbreak algorithm takes a two-step process:
It first learns to classify whether a piece of text is related to
the SARS outbreak.
Conditionally on the text’s being SARS-related, the algorithm
then learns to infer when in the SARS news cycle the text was
Because the 2003 SARS outbreak in China went through a typical
epidemic cycle — starting in November 2002, peaking in May 2003, and
being contained in July 2003 — the timing of publication is effectively
a proxy for the spread of the disease.
Once the algorithm is trained, we deploy the program to People’s
Daily articles during the COVID-19 episode. Because the algorithm
has learned from the past episode, it would make two predictions on the
current data: whether each new piece of text is related to an epidemic
and, if so, when the text would have been published in the SARS
timeline. The second prediction will likely be factually incorrect —
classified as from 2003 but actually from 2020. But this error is
exactly what we aim for: Each date in the COVID-19 outbreak timeline is
cast back to the SARS timeline and, therefore, results in a metric of
severity in the SARS epidemic cycle and how the crisis is truly
perceived by the Chinese authorities.
The figure shown above contrasts the severity measured by the
PCI-Outbreak and China’s official count of diagnosed cases. We make
The Chinese government’s official numbers of diagnosed cases are
consistent with the PCI-Outbreak measure until the peak of the COVID-19
outbreak in China—both measures indicate that cases peaked in February
After the peak, the official numbers and the PCI-Outbreak
diverge. The PCI-Outbreak trends downward, but at a much slower rate
than the official numbers, suggesting that the Chinese government may
have exaggerated the speed at which the virus was contained
The PCI-Outbreak stays elevated from June to September, while the
official numbers have stayed low since April. This discrepancy suggests
that recent outbreaks in Beijing and Xinjiang may have been more severe
than what the official numbers indicate.
Note: More details of the project can be found in our research
paper. We have also released the source code of the project on