Estimating the real number of COVID-19 Infections: How to go beyond official numbers.

21 Apr 2020

If you are like me, you probably watch every day the tally count for coronas virus cases, with the numbers increasing at a frightening rate: 1.000.000 cases worldwide early April, and already over 2.000.000 today, just 2 weeks later. However, we can all agree on one thing: that number is not correct. It only reflects how many people were tested positive, which leads us to the question: how many cases are there in reality?

In this article, we will investigate together how the data available on the coronavirus can be used to estimate the diagnosis efficiency of each country. This will show severe disparity that can be linked to political decisions, healthcare or testing capacity of each country. We will also see that many countries with similar politics fall within the same range of values.

With this article, we aim to illustrate in a pedagogical way how logical reasoning and data analysis work together to extract insights from raw data. The output numbers shown here should however under no circumstances be taken as the truth, as they rely on a fair number of assumptions.


So, how can we reconstruct the missing data? 

It all starts from the following two hypotheses:

  • The death count is more accurate than the infection count
  • The fatality rate of the virus should not differ significantly from country to country.

There are of course several factors of influence, namely: the age repartition of the population, the lifestyle (obesity), the quality of healthcare… but these factors are unlikely to have an impact stronger than a factor of 2 or 3 (going from 2% in South Korea, to maybe 4 or 6% in some countries?).

However, we actually see countries with a fatality rate in the 30% … 15 times more than South Korea! So what is accounting for most of it? This is what we will explore, but let’s first look at how we can accurately calculate the observed fatality rate:


Delay between contamination and death

To more accurately measure the fatality rate we need to assign a death to its day of diagnostic. Indeed, the time between diagnosis and death coupled to the exponential growth of cases produces a fatality rate not constant in time and which underestimates the reality:

Figure 1: Diagnosed cases, registered fatalities and fatality ratio for Belgium as per the real date registered at the WHO


Because the data is not known for each patient, we will have to use an average duration to shift the curve in time. Let’s keep in mind that this average duration will also differ from country to country, depending on the testing capabilities and quality / availability of healthcare.

Figure 2. Delay between diagnosis and death illustrated for Italy and Switzerland


From the observation over the countries that are the most affected by the virus, we observe delays ranging typically between 7 and 11 days. We thus decided to apply an average of 9 days shift to every country to perform the corrected fatality rate calculation.

Figure 3: Diagnosed cases, registered fatalities and fatality ratio for Belgium with days of death shifted 9 days earlier
than they were actually registered at the WHO.


Before performing this time correction, we could observe an apparent fatality rate for Belgium rising, going from 5% to 10% between the 31 March and 9 April. After the time correction it now appears constant over that period of time, but at the rate of 20%!

At this point, it is important to note that the fatality rate is not 20% of people infected, as not every infected person is diagnosed. This apparent high rate is due to the very low testing capacity in Belgium, with most of tests performed on people requiring hospitalisation. We may also note that Belgium decided to include in the death count any suspect death in retirement homes regardless of diagnosis, increasing this rate even further.


A small parenthesis on South Korea

Let’s now dive a bit into the analysis of South Korea’s strategy to fight the virus. The fatality rate there is amongst the lowest of all countries, with 2% only. Why is that?

South Korea has performed very aggressive testing using apps on every citizen’s phone to track their movement. Meaning that from the moment a positive case is detected, the government can rapidly identify everybody who crossed path with the infected person and warn them so they could isolate themselves and get tested. This is how South Korea managed to contain the virus without needing to do any lockdown, saving at the same time lives and their economy. And while the 100% diagnosis is never achieved, it is safe to assume this strategy allowed them to diagnose a high number of cases, probably nearing the 100%.


How can we then estimate the diagnosis efficiency for each country?

With everything we have just seen, we can take assumptions and extrapolate what % of infected people are actually diagnosed in every country. The three assumptions being:

  1. The real fatality rate is not significantly different from a country to another
  2. The delay between diagnosis and death averages a period of 9 days
  3. South Korea achieved near 100% diagnosis.


From the observed fatality rate and the three assumptions above, we can estimate the real number of cases per country and what proportion of these real cases have been diagnosed. The results are shown in the below interactive chart:



We observe fluctuations and evolutions over time:

  • To go back to the example of Belgium. The diagnosis rate was of the order of 40% at the beginning of the crisis. However, with the number of cases increasing, the country did not manage to scale up its testing capacities and ended up missing out on a lot of cases. We now estimate that only 9% of infected people are being diagnosed. This figure is similar to a lot of other countries who also took the decision to only test at hospitalisation (France, Italy, Spain, Netherlands…).
  • Let’s look at another example: middle of March, Luxembourg decided to extend testing to everybody who would want to be tested. We can clearly see the diagnostic coverage increasing thanks to these measures.
  • Finally, we observe some countries with diagnosis rates above 100%. It could mean that South Korea was not the right benchmark and that these countries actually achieve better diagnosis results. There is however no strong evidence that they apply better diagnosis methods. This likely points out that these countries fail to count all deaths (we will leave the reader decide for himself if it is on purpose to fake numbers, or simply by lack of means).

We have thus managed to link data to events, and understand the situation. This reasoning can be valid for a wide range of applications (being politics, public health, economics, … or the performance of your business) and provides with insights that help take better decision.

While the trends shown here hold, please note that because of the high number of unknowns with the COVID-19, the numbers are only rough approximations and should not be re-used anywhere out of context.