@article{angelopoulosidentifying, title={On Identifying and Mitigating Bias in the Estimation of the COVID-19 Case Fatality Rate}, author={Angelopoulos, Anastasios Nikolas and Pathak, Reese and Varma, Rohit and Jordan, Michael I}, journal={Harvard Data Science Review}, year={2020} }
The case fatality rate quantifies how dangerous COVID-19 is, and how risk of death varies with strata like geography, age, and race. Current estimates of the COVID-19 case fatality rate (CFR) are biased for dozens of reasons, from under-testing of asymptomatic cases to government misreporting. We provide a careful and comprehensive overview of these biases and show how statistical thinking and modeling can combat such problems. Most importantly, data quality is key to unbiased CFR estimation. We show that a relatively small dataset collected via careful contact tracing would enable simple and potentially more accurate CFR estimation.
The CFR is a measure of disease severity. Furthermore, the relative CFR (the ratio of CFRs between two subpopulations) is a useful target for data-informed resource-allocation protocols because it measures relative risk. In other words, the CFR tells us how drastic our response needs to be; the relative CFR helps us allocate scarce resources to populations that have a higher risk of death.
Although the CFR is defined as the number of fatal infections, we can not expect that dividing the number of deaths by the number of cases will give us a good estimate of the CFR. The problem is that both the numerator (#deaths) and the denominator (#infections) of this fraction are uncertain for systematic reasons due to the way data is collected. For this reason, we call that estimator
In short, because the data is biased, we are losing at least 99.8% of our sample efficiency. There's a well known ''butterfly effect'' in statistics: a tiny correlation between your sampling method and the quantity you're seeking can have huge, destructive effects on your estimator. Even assuming a tiny 0.005 correlation between the population we test and the population infected, testing 10,000 people for SARS-CoV-2 is equivalent to testing 20 individuals randomly. For estimating the fatality rate, the situation is even worse, since we have many reasons to believe that severe cases are preferentially diagnosed and reported. In the words of Xiao-Li Meng, ''compensating for [data] quality with quantity is a doomed game.'' In our HDSR article, we show that in order for
The primary source of COVID-19 data is population surveillance: county-level aggregate statistics reported by medical providers who diagnose patients on-site. Usually, somebody feels sick and goes to a hospital, where they get tested and diagnosed. The hospital reports the number of cases, deaths, and sometimes recoveries to local authorities, who release the data usually on a weekly basis. Of course, this is an idealized model, and in reality, there are many differences in data collection between nations, local governments, and even hospitals.
Dozens of biases are induced by this method of surveillance, falling into roughly five categories: under-ascertainment of mild cases, time lags, interventions, group characteristics (e.g. age, sex, race), and imperfect reporting and attribution. An extensive (but not exhaustive) discussion of the magnitude and direction of these biases is in our article. Without mincing words, this data is extremely low quality. The vast majority of people who get COVID-19 go undiagnosed, there are misattributions of symptoms and deaths, data reported by governments is often (and perhaps purposefully) incorrect, cases are defined inconsistently across countries, and there are many time-lags (for example, cases are counted as 'diagnosed' before they are 'fatal', leading to a downward bias in the CFR if the number of cases is growing over time). Figure 1 has a graphical model describing these many relationships; look to the paper for a very detailed explanation of what biases occur across each edge.
Correcting for biases is sometimes possible using outside data sources, but can result in a worse estimator overall due to partial bias cancellation. This is easier to see through example than it is to explain. Assume the true CFR is some value
The mathematical form of
In our article, we outline a testing procedure that helps fix some of the above dataset biases. If we collect data properly, we think even
This protocol is meant to decrease the covariance between fatality and diagnosis. If patients commit to testing before they develop symptoms, there cannot be a covariance between disease severity and diagnosis. However, there may still be issues with people dropping out of the study; however, if this is a problem in practice, it can be mitigated by a combination of incentives (payments) and post-stratification.
Figure 2 represents an idealized version of this study. In the best case scenario, there is no covariance between death and diagnosis. In that case, we only need
This strategy mostly resolves what we believe is the largest set of biases in CFR estimation -- under-ascertainment of mild cases and time-lags. However, there will still be lots of room for improvement, like understanding the dependency of CFR on age, sex, and race. (In other words, the CFR is a random quantity itself, depending on the population being sampled.) Distinctions between CFRs of these strata may be quite small, requiring a lot of high-quality data to analyze. If
I'd like to re-emphasize a point here: collecting data as above will make
If you read our academic article, we provide some thoughts on how to use time-series data and outside information to correct time-lags and relative reporting rates. Our work was very heavily based on one of Nick Reich's papers. However, as I claimed earlier, even fancy estimators cannot overcome fundamental problems with data collection. I'll defer discussion of that estimator, and the results we got from it, to the article. It's best parsed by experts looking for a perspective on how to perform these estimations properly and honestly. If you're reading this and think, "that's me," then I'd love to hear your thoughts.
CFR estimation is clearly a difficult problem — but with proper data collection and estimation guided by data scientists, I still believe that we can get a useful CFR estimate. This will help guide public policy decisions about this urgent and ongoing pandemic.
The first public release of this work was on March 19, 2020, and the last update to this page was on June 22, 2020.