Statistical Tool Developed at Columbia Will Track Covid-19 Data as New York Emerges From Lockdown

During a recent press conference, New York Gov. Andrew M. Cuomo referred to Samir Bhatt, the state’s advisor on Covid-19 data. His team of researchers will collect data about testing, hospitalizations, and transmission rates to quantify geographic variation in the spread of coronavirus as New York’s sectors and counties emerge from lockdown.

“I think Dr. Bhatt deserves all our thanks because they really helped us all through this to date,” Gov. Cuomo said. “I want to thank him very much for taking the time to advise us, not just on how we constructed our model to date, but what happens going forward as we increase the economic activity and we start to see numbers change.”

Bhatt, a senior lecturer in geostatistics at Imperial College London, is quick to point out that he has a team, the scientific community, and statistical and computational tools to support this work, including Stan, a statistical programming language created by a team led by Columbia University statistics and political science professor and Data Science Institute member Andrew Gelman.

Created in 2012 by a team of statisticians, computer scientists, and applied researchers, including Gelman, computational linguist Bob Carpenter, and computer scientist Matt Hoffman, Stan performs Bayesian inference, which is a statistical method for combining information from multiple sources. The open-source code is continually tweaked by developers from around the world, and the program has tens of thousands of users.

Epidemiologists have models of how disease spreads, but to understand how many lives are at risk or when hospitals will be overrun with patients, a model needs to produce numbers. Using Stan along with a modern statistical workflow, epidemiologists can build a quantitative version of a conceptual model, and also quantify for decision-makers how much trust should be placed in the model.

“Stan will help us approach the Covid crisis from a data-driven perspective and to reach a scientific consensus on the coronavirus,” Bhatt says. “And that knowledge can then be used by Gov. Cuomo to make the most informed decisions.”

Bhatt says it would take the Imperial College team months or even years to write and debug a program that fits a statistical model sophisticated enough to contend with the ever-changing coronavirus data. Using Stan allows the team to focus on the science while letting “Stan get the numbers right without bugs or inference problems.”

Sometimes, those who develop statistical tools are overlooked, but Bhatt gives Gelman and the Stan team credit for using state-of-the-art computational and sound statistical ideas to build a program that allows scientists and researchers to do their jobs more effectively.

“Stan is the best tool for making inferences under uncertainty, and the ability to make inferences is the most important aspect of tracking the virus,” Bhatt says. There are other software systems that contend with uncertainty, he adds, but Stan has a means of getting uncertainty quantification as accurate as possible. And it tells you if your results make sense, “so you can’t cheat yourself.”

— Robert Florida, Data Science Institute

About rfowler