TDWI Articles

Challenges of Modeling COVID-19

Machine learning algorithms and data acceleration platforms offer new models for timely analysis of pandemic data.

Sophisticated algorithms can enable healthcare providers to understand the spread of contagious diseases. They can predict the surge in incoming patients that might cause shortages, achieve an optimal patient-to-staff ratio, identify individuals with a higher risk of developing chronic conditions, provide personalized treatments and predict deterioration of health conditions. For example, at the University of Pennsylvania, doctors leverage a predictive analytics tool that helps identify patients who might fall victim to severe sepsis or septic shock 12 hours before the onset of the condition.

For Further Reading:

Diversifying Data Leadership During COVID-19

3 Ways to Think About Data Post-COVID-19

5 Data Management Lessons Learned from the Front Lines of COVID-19

COVID-19 presents unique challenges when trying to predict the spread of the virus. Not all countries are fully transparent about infections and fatality rates, and there are different rules and assumptions regarding whether a fatality should be attributed to COVID-19 if the patient has underlying conditions. Hospitals in the U.S. have even been accused of purposely inflating the number of COVID -19 fatalities to receive higher government reimbursements. Also, with health resources often taxed to the limit, collecting data isn't the highest priority, resulting in missing or inaccurate information.

However, as imperfect as COVID-19 data might be, shooting in the dark isn't an option. Mathematical models are still our best weapon for having a better understanding to help formulate policies to protect public health.

Epidemiological Versus Machine Learning Models

The main purpose of coronavirus models is to predict the number of beds and the amount of protective gear and equipment that will be needed to treat infected patients. Other goals include having a better understanding of how the virus spreads, in order to determine the most effective social distancing policies and help hospitals plan resuming necessary services while treating an influx of COVID-19 patients.

There are two types of models that are being used to track COVID-19: epidemiological and machine learning models, and each type has its own data challenges.

Epidemiological models segment populations based on specific categories (such as susceptible to infection, infectious, or fully recovered) and track how people move through each of these phases. These models take into account disease-related factors such as mode of transmission, latent period, infectious period, susceptibility, and resistance, as well as social, cultural, demographic, economic, and geographic factors.

The challenge is that every one of these factors is highly dynamic. Take, for example, transmission rate. External factors such as how strictly quarantines are enforced, the level of social distancing dictated, how intensely movements and contacts are tracked, and the willingness of people to voluntarily abide by social distancing recommendations, all have a huge impact on how quickly COVID-19 spreads. As a result, each of these factors needs to be quantified, fed into the model, and regularly updated.

Another approach for tracking the virus is to use machine learning, which allows algorithms to learn and improve without being explicitly programmed. Machine learning automates the discovery of patterns in very large data sets and then is programmed to improve its own performance. The challenges for machine learning models are finding useful COVID-19-related data sets and transforming them into a consumable format.

New data points are continuously added, tested to see if they improve forecasts and then tuned for each geography. For example, population size, population density, age distribution, smoking rates, economic indicators, and nationwide lockdown dates were considered the most relevant data sets. Later, the data set tests conducted by country were found to have high impact, so they were added.

Taming the Data Tsunami

Part of the success of models used to predict the spread of the virus is based on the quantity of data. The more data ingested, the more accurate the model, and there is a lot of data that needs to be fed into the model. Take, for example, the quantity of data needed to trace all the contacts of people confirmed infected or suspected of being infected by the virus.

The challenge is that the amount of data required can stress out standard data processing systems, resulting in models that take too long to run or sacrifice quality by reading in less data due to limitations of computing power.

There is technology that can enable models to ingest higher volumes of data and run more quickly. A data analytics acceleration platform can speed up model run time and scale to read in more data while providing a single unified view to enable more data scientists to access the data. This platform eliminates the need for pre-aggregation or pre-modeling of data, which streamlines data preparation. To reduce model run time, GPU-powered servers can deliver faster performance at a fraction of the cost of competing CPU-only solutions. For example, query time can be reduced from days to hours and hours to minutes with the ability to add even more data without increasing the query time.

Data analytics is critical for tracking the spread of COVID-19. For both epidemiological and machine learning models, organizations need to integrate data quickly and reliably from both internal and external sources. Smarter architectures designed to scale and run models more quickly can lead to better and faster insights to enable healthcare organizations to be better prepared to treat patients and reduce the spread of COVID-19.

About the Author

Razi Shoshani is the CTO and founder of SQream where he is responsible for SQream’s next-generation technology innovation. You can reach Razi via email or LinkedIn.


TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.