Statistics That Tell Whole Truth? It's As Easy As ABC

It's said that statistics don't lie, but they often don't tell the whole truth, either.

A Cornell statistics expert has come up with a method he believes can boost statistical power and significantly reduce bias - vital for research involving outcomes that differ by socioeconomics, race, sex and other variables.

Dan Kowal, M.S. '15, Ph.D. '17, associate professor of statistics and data science in the College of Agriculture and Life Sciences, has devised a method he calls "ABCs" - abundance-based constraints - designed to make it easier to study how the factors that affect health and life outcomes vary among different groups of people. Typically, adding this kind of complexity to a statistical model changes the estimates from simpler models.

Not so with ABCs.

"With this method, we can learn about subpopulation effects and heterogeneities without the statistical downsides that we usually expect," said Kowal, the author of "Facilitating Heterogeneous Effect Estimation via Statistically Efficient Categorical Modifiers," published March 10 in the Journal of the American Statistical Association.

Kowal said this work began when he was at Rice University, and helping environmental epidemiologists at Rice and elsewhere study how social and environmental stressors impact human health.

"They were studying the impacts of these exposures, how those adverse effects combine and accumulate on childhood health and educational outcomes," he said.

They needed a statistical method that would yield both overall effects and outcomes for specific groups, including racial and socioeconomic. But accounting for different subpopulation effects in a statistical model can change overall effects, often inflating their standard errors (a measure of a statistic's precision) and introducing group biases.

Most methods of statistical analysis - including reference group encoding, the most used - require encoding categorical variables numerically, which can lead to potential biases when subgroups are compared to a "reference" group. Typically, say in matters of comparisons by race, the reference group is white - and implied as being "normal" - and the model compares all other groups to this group. That's where bias can creep in.

For example: In the paper, Kowal estimates how fourth-grade reading scores vary by both racial residential isolation (RI) and the mother's race. Using standard statistical approaches, the output presents the RI effect for the white, or reference, group as if it is the overall effect, which could lead researchers to view the effect as small.

This is misleading, Kowal said.

"The effect doesn't appear to be significant, so researchers would incorrectly think that racial residential isolation is unimportant," he said. "ABCs makes it clear that, in fact, the overall RI effect is highly significant, and negative, and that the RI effect for Black students is significantly worse than average."

Kowal's ABC method proposes a scheme for estimating both overall and subgroup effects, without having to pick a reference group. Kowal shows that, with ABCs, expanding a statistical model to include subgroup effects actually leaves the overall effects unchanged and can enhance their statistical power.

"Consider those race-specific effects of racial residential isolation," he said. "A natural overall effect is the average of those effects, weighted by subgroup abundance. That's what ABCs do. So the interpretation is clean and doesn't prioritize any single group.

"That this approach also has statistical efficiency properties is a wonderful surprise," Kowal said.

The primary contribution, Kowal said, is "raising the alarm" about perhaps the most fundamental and widely used statistical method - linear regression - a foundational course in every statistics and data science curriculum, and a method of choice in science, industry and medicine, he said.

"Linear regression is vital for learning how effects vary by subpopulations," Kowal said, "but working with variables like race has to be done carefully. ABCs were designed with that in mind. I wanted a statistical tool that can find the important differences among subpopulations, but without losing sight of the big picture."

Support for this work came from the National Institute of Environmental Health Sciences, part of the National Institutes of Health, and the National Science Foundation.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.