The above graph has been making it’s rounds on the internet over the last week. It highlights the distribution of COVID-19 cases in Malaysia. As simplistic as they seem, graphs such as these can inform the ascertainment of high-risk populations, and direct control activities such as screening of populations. All visualisations have caveats, and the plot above is no different. Epidemiological intuition in interpretation of such graphs can provide us some insight into the bad, the nit-picky and the good of such visuals. We explore these themes below.
First and foremost, the issue with the above bar-plot is that it reports case counts in identifying groups that are most numerous in testing positive. In epidemiology (and common-sense), frequency of any state should be presented as a ratio, using a denominator common to all comparator groups. In epidemiology, one such metric is the incidence ratio- which is the fraction of new cases (in a particular group) over the total population. This would provide us with the true ratio of positive cases in a particular group of people. As can be seen below, the incidence ratio gives a far clearly picture of test postive numbers by age groups. This picture is also far more consistent with what we know (as of know) with regards to patterns of symptoms by age groups- as clinical manifestation appear more in older groups (Mizumoto, Kagaya, Zarebski, & Chowell, 2020).This in turn appears to directly mediate the yield of positive screening test numbers (Wang et al., 2020). Benchmarking on age-related data from other similarly affected coutries will likely provide a clearer picture of this .
Interval classes. The choice of interval classes, is heavily subjective- and is generally left to the best sense of the data analyst, epidemiologist, statistician, scientist etc. However, the use of ‘non-standard’ age groups makes the development of composite ratios, such as the incidence rate above, very challenging. Age-group data from public datasets such as the census can be rendered useless as the comparison of mid-year populations in these public datasets use a standard interval of “0-4,5-9,10-14..etc” whilst the above uses a interval class of “1-5, 6-10, 11-15…etc”. The difference may not look like much, but it makes any further use difficult- and renders graphs such as the one above a mere approximation that is susceptible to error.
However, despite the critique, there are important reasons why the graph did highlight something very important. This is illustrated in the graph below: a comparison of population counts by age group and positive tests by age group.
Here we visualise something very interesting, the population distribution (in pink) grants clues regarding populations with particularly low rates of positive tests. There are two possibilities that can exist within this hidden population:
- They have not been tested.
- They have tested negative.
The implications of either possibility are important. If this population has not been tested, then the low-cumulative positive test rates observed (~3% over the last 4 days, average remains ~7%), is a signal for the surveillance system to increase its sensitivity in screening. One possible action is for the screening machinery to focus on individuals within these age-groups instead.
The implications of the second though are more complicated. High rates of negative tests would mean high rates of false negatives- which in turn would signal (perhaps) the need to reformulate the current screening protocol- which of course is not an easy task considering the lack of scalable alternatives currently. This of course also highlight the issue of false negatives- which have been informally reported to be as high as 30% (Lanese, 2020). However, with no systematic analysis into the use of the rt-PCR  as screening- we maybe for the present moment- incapable of detecting this “hidden population”.
It is important to note that the trail of data ended at one simple graph, with no connectors. It is therefore near impossible to extrapolate possible improvements that can be made via such limited data. The take home here is succinct – In this war we fight together, never has open data been more relevant than today.
Lanese, N. (2020). Even if you test negative for COVID-19, assume you have it, experts say | Live Science. LiveScience.
Mizumoto, K., Kagaya, K., Zarebski, A., & Chowell, G. (2020). Estimating the asymptomatic proportion of coronavirus disease 2019 (COVID-19) cases on board the Diamond Princess cruise ship, Yokohama, Japan, 2020. Eurosurveillance, 25(10), 2000180. https://doi.org/10.2807/1560-7917.ES.2020.25.10.2000180
Wang, W., Xu, Y., Gao, R., Lu, R., Han, K., Wu, G., & Tan, W. (2020). Detection of SARS-CoV-2 in Different Types of Clinical Specimens. JAMA – Journal of the American Medical Association. https://doi.org/10.1001/jama.2020.3786
 Reverse Transcriptase Polymerase Chain Reaction