Researchers often seek to predict outcomes, detect distinct subgroups within their data, or estimate causal treatment effects. But in empirical analyses, outcomes with many observed zeros can complicate these tasks. For example, when we measure viral load in a patient, loads lower than an instrument's limit of detection will yield a zero (even if a very small amount is present). In ecology, we often want to measure outcomes such as rainfall. However on many days we have no rain. In economics, we may model income over time—and see many 0 values if someone goes through a spell of unemployment. Data distributions that exhibit zero-inflation—and skewness, rather than a nice, clean bell curve—pose obstacles even to basics such as comparing adjusted average costs between two medical treatment groups. “In these settings, standard regression models are untenable as they assume away these complexities,” comments biostatistics PhD candidate Arman Oganisian. “We need highly flexible, data-adaptive modeling.”
Mr. Oganisian recently led development of a method that offers researchers faced with such outcomes a more robust way of doing regression analyses. In the resulting paper, he, Nandita Mitra, PhD, and Jason Roy, PhD, present a multi-purpose Bayesian nonparametric model for continuous, zero-inflated outcomes, which predicts structural zeros, captures skewness, and clusters patients with similar joint-data distributions. “The impact of this work is really to expand our statistical toolkit—to offer researchers faced with these skewed, zero-inflated outcomes a new way of doing regression and clustering analyses that doesn't make any of the naive assumptions of standard models and can adapt to the complexities of their data,” says Oganisian.
In an application, the researchers analyzed Medicare costs for inpatient hospitalization of endometrial cancer patients treated via either chemotherapy or radiation therapy, as reflected in the SEER-Medicare database. Here the zeros originated from some patients never having inpatient hospitalizations. “It’s important to note, these zeros may be driven by the treatment itself: Maybe treatment A has fewer adverse events that lead to emergency room visits, and so we see more zero costs,” comments Mr. Oganisian. “We can't just ignore these zeros—we need to allow them to be functions of treatment and covariates.”
The researchers demonstrate that the model can be coherently incorporated into a standardization procedure for computing causal effect estimates that work well even given such data pathologies. Uncertainty at all levels of the model flow through to the causal effect estimates that are of interest.
The team published an open-source, peer-reviewed software that implements the new method in R, and a companion web site with documentation, installation instructions and several examples. And in an additional, working paper, they extended the new method—comparing not just the cost but the cost effectiveness of two cancer treatments, while adjusting for baseline differences between the two treatment groups. “For cancer, effectiveness is often measured as increased survival time. If a drug is more costly but also prolongs survival, we would want to capture that,” says Mr. Oganisian. So the team extended the cost model, adding in a model for the survival time component. The second paper provides a framework for identifying subgroups of patients where cost-efficacy differs (e.g., treatment A may be more cost-effective among Hispanic females, and less cost-effective among males over the age of 50).
“In this era of increasing medical expenditures, it is critical to have robust and theoretically sound statistical methods for health care cost analyses, which improve upon current methods used by economists and health policy makers,” commented Dr. Mitra. “In addition, it is important to make these methods accessible for easy application. I am incredibly proud of Arman’s rigorous methodology and his well-documented, easily implementable tools.”