Position: PhD Candidate
Current Institution: University of Washington
Abstract: Principles and Interactive Systems for Authoring Valid Statistical Analyses
Statistical models play a critical role in how people evaluate data and make decisions. Policy makers rely on models to track disease inform health recommendations and allocate resources. Scientists use models to develop evaluate and compare theories. Faulty statistical models can lead to spurious estimations of disease spread findings that do not generalize or reproduce and a misinformed public. The challenge in developing accurate statistical models lies not in a lack of access to mathematical tools of which there are many (e.g. R Python SPSS) but in accurately applying them to answer key analysis questions. To identify the barriers to accurate statistical analysis we observed data scientists. From these observations we developed the theory of hypothesis formalization. The theory states that in order for analysts to translate their motivating questions into statistical modeling programs they must refine their domain theory and iterate on modeling implementations under constraints of data and statistical knowledge. However there is a mismatch between the interfaces existing statistical tools provide and the needs of analysts during this process especially for analysts who have domain knowledge but lack deep statistical expertise (e.g. many researchers). To address this need we developed Tea a high-level language that uses constraint solving to automate statistical test selection and Tisane an interactive system for authoring generalized linear models. Both systems leverage the key insight that analysts have implicit knowledge about their domain and data that can be used to infer valid statistical models. We found that Tea and Tisane both catch and avoid common analysis mistakes that threaten the validity of findings. Researchers also report that Tisane helps them focus on their analysis goals and assumptions. By conducting empirical research and developing tools utilizing techniques from human-computer interaction and programming languages we can empower researchers to author valid statistical analyses.
Eunice Jun is a PhD student at the University of Washington where she is advised by Jeffrey Heer and Rene Just. Her research is at the intersection of human-computer interaction programming languages and applied statistics. Her research mission is to make authoring valid statistical analyses easier for domain experts who are not also statisticians. To do this she develops theories about how people analyze data and develops novel domain-specific languages and interactive systems to address key challenges. Collaborations with industry data scientists and researchers in public health computer science and psychology inspire her work. She has received a National Science Foundation (NSF) graduate research fellowship.