The Policy Change Index for Outbreak (PCI-Outbreak) measures the severity of an epidemic outbreak in China, such as the coronavirus disease 2019 (COVID-19), using the 2003 severe acute respiratory syndrome (SARS) as the benchmark. The higher the indicator, the larger the scale of the outbreak.

Figure: PCI-Outbreak and official COVID cases in China (Jan 21 to Jul 9, 2020)

The figure above plots the PCI-Outbreak for COVID-19 in China, in comparison to the number of still infected cases officially confirmed by the Chinese government.


How severe was the COVID-19 outbreak in China, really? It is widely suspected that its official numbers understate the extent of the outbreak, and even the official numbers themselves are incoherent. On February 13, 2020, the Chinese authorities confirmed over 15 thousand new cases in the country, 40 times the previous day’s number, due to a change in counting criteria. On April 17, they revised the death toll for Wuhan, the epicenter, upward by 50%, citing various omissions previously.

The PCI-Outbreak uses a machine learning method to gauge the true scale of COVID-19 in China, not through its official numbers but through how its state-controlled media covered the outbreak. The algorithm is trained on SARS-era articles in the People’s Daily, the official newspaper of the Communist Party of China, to understand the wax and wane of the narrative as the epidemic cycle evolved. The algorithm then assesses future outbreaks’ severity against the SARS benchmark.


The PCI-Outbreak is built on the idea that words can be more accurate than (some) numbers. While it may be trivial to release false statistics outright, it is more difficult to conceal the truth when the government has to address a public health crisis at length in national media. Take the beginning of COVID-19 for example: When the Chinese government announced the lockdown of Wuhan, a city with a population of 11 million people, and warned of a nationwide spread of the virus, the authorities had only confirmed fewer than 600 cases across the entire country. Changes in language, therefore, may provide us with a clearer picture of the severity than the questionable official numbers.

To detect how the narrative evolves, we build a deep learning algorithm to “read” People’s Daily articles published in 2003 in the course of the SARS outbreak. Using the Bidirectional Encoder Representations from Transformers (BERT), a natural language model developed by Google, the PCI-Outbreak algorithm takes a two-step process:

  1. It first learns to classify whether a piece of text is related to the SARS outbreak.

  2. Conditional on the text being SARS-related, the algorithm then learns to infer at what point in the course of the outbreak the text was published.

Because the 2003 SARS outbreak in China went through a typical epidemic cycle — starting in November 2002, peaking in May 2003, and disappearing in July 2003 — the timing of publication is effectively a proxy for the spread of the disease.

Once the algorithm is trained, we deploy the program to People’s Daily articles in 2020 that are relevant to the COVID-19 outbreak. Because the algorithm has learned from the past episode, it would make two predictions on the current data: whether each new piece of text is related to an epidemic and, if so, when the text was published. Moreover, the second prediction will likely be incorrect — classified as from 2003 but actually from 2020. But this error is exactly what we aim for: Each date in the COVID-19 outbreak timeline is cast back to the SARS timeline and, therefore, results in a metric of severity in the SARS epidemic cycle and how the crisis is truly perceived by the Chinese authorities.

Main Results

The figure shown above contrasts the severity measured by the PCI-Outbreak and China’s official number of daily new cases. Three observations follow:

Note: More details of the project can be found in our forthcoming research paper. We have also released the source code of the project on GitHub, which can be found here.