{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2022,5,16]],"date-time":"2022-05-16T07:50:49Z","timestamp":1652687449201},"reference-count":87,"publisher":"Association for Computing Machinery (ACM)","issue":"1","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["ACM Trans. Comput.-Hum. Interact."],"published-print":{"date-parts":[[2022,2,28]]},"abstract":"\n Data analysis requires translating higher level questions and hypotheses into computable statistical models. We present a mixed-methods study aimed at identifying the steps, considerations, and challenges involved in operationalizing hypotheses into statistical models, a process we refer to as\n hypothesis formalization<\/jats:italic>\n . In a formative content analysis of 50 research papers, we find that researchers highlight decomposing a hypothesis into sub-hypotheses, selecting proxy variables, and formulating statistical models based on data collection design as key steps. In a lab study, we find that analysts fixated on implementation and shaped their analyses to fit familiar approaches, even if sub-optimal. In an analysis of software tools, we find that tools provide inconsistent, low-level abstractions that may limit the statistical models analysts use to formalize hypotheses. Based on these observations, we characterize hypothesis formalization as a dual-search process balancing conceptual and statistical considerations constrained by data and computation and discuss implications for future tools.\n <\/jats:p>","DOI":"10.1145\/3476980","type":"journal-article","created":{"date-parts":[[2022,1,7]],"date-time":"2022-01-07T14:50:16Z","timestamp":1641567016000},"page":"1-28","source":"Crossref","is-referenced-by-count":2,"title":["Hypothesis Formalization: Empirical Findings, Software Limitations, and Design Implications"],"prefix":"10.1145","volume":"29","author":[{"ORCID":"http:\/\/orcid.org\/0000-0002-4050-4284","authenticated-orcid":false,"given":"Eunice","family":"Jun","sequence":"first","affiliation":[{"name":"University of Washington, Seattle, WA"}]},{"ORCID":"http:\/\/orcid.org\/0000-0002-6119-0544","authenticated-orcid":false,"given":"Melissa","family":"Birchfield","sequence":"additional","affiliation":[{"name":"University of Washington, Seattle, WA"}]},{"ORCID":"http:\/\/orcid.org\/0000-0002-2521-5126","authenticated-orcid":false,"given":"Nicole","family":"De Moura","sequence":"additional","affiliation":[{"name":"Eastlake High School, Seattle, WA"}]},{"given":"Jeffrey","family":"Heer","sequence":"additional","affiliation":[{"name":"University of Washington, Seattle, WA"}]},{"ORCID":"http:\/\/orcid.org\/0000-0002-5982-275X","authenticated-orcid":false,"given":"Ren\u00e9","family":"Just","sequence":"additional","affiliation":[{"name":"University of Washington, Seattle, WA"}]}],"member":"320","reference":[{"issue":"1","key":"e_1_3_2_2_2","doi-asserted-by":"crossref","first-page":"22","DOI":"10.1109\/TVCG.2018.2865040","article-title":"Futzing and moseying: Interviews with professional data analysts on exploration practices","volume":"25","author":"Alspaugh Sara","year":"2018","journal-title":"IEEE Transactions on Visualization and Computer Graphics"},{"key":"e_1_3_2_3_2","article-title":"Fitting linear mixed-effects models using lme4","author":"Bates Douglas","year":"2014","journal-title":"arXiv:1406.5823"},{"key":"e_1_3_2_4_2","article-title":"Package \u2018lme4\u2018","author":"Bates Douglas","year":"2019","journal-title":"CRAN"},{"key":"e_1_3_2_5_2","article-title":"Characterizing exploratory visual analysis: A literature review and evaluation of analytic provenance in tableau","author":"Battle Leilani","year":"2019","journal-title":"Computer Graphics Forum 38, 3"},{"key":"e_1_3_2_6_2","volume-title":"Graphics and Graphic Information Processing","author":"Bertin Jacques","year":"2011"},{"key":"e_1_3_2_7_2","doi-asserted-by":"crossref","unstructured":"Michael Betancourt. 2020. Towards a Principled Bayesian Workflow. Psychological Methods 26 1 (2020) 103-126. Retrieved from https:\/\/betanalpha.github.io\/assets\/case_studies\/principled_bayesian_workflow.html.","DOI":"10.1037\/met0000275"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.5555\/3322706.3322734"},{"key":"e_1_3_2_9_2","article-title":"Top 15 python libr aries for data science in 2017","author":"Bobriakov Igor","year":"2017","journal-title":"ActiveWizards in Medium"},{"key":"e_1_3_2_10_2","article-title":"Top 20 python libraries for data science in 2018","author":"Bobriakov Igor","year":"2018","journal-title":"ActiveWizards in Medium"},{"key":"e_1_3_2_11_2","unstructured":"Leo Breiman Adele Cutler Andy Liaw and Matthew Wiener. 2018. Package \u201crandomForest\u201d. (2018). Retrieved September 16 2020 from https:\/\/cran.r-project.org\/web\/packages\/randomForest\/randomForest.pdf."},{"issue":"2","key":"e_1_3_2_12_2","doi-asserted-by":"crossref","first-page":"378","DOI":"10.32614\/RJ-2017-066","article-title":"glmmTMB balances speed and flexibility among packages for zero-inflated generalized linear mixed modeling","volume":"9","author":"Brooks Mollie E.","year":"2017","journal-title":"The R Journal"},{"key":"e_1_3_2_13_2","article-title":"API design for machine learning software: Experiences from the scikit-learn project","author":"Buitinck Lars","year":"2013","journal-title":"arXiv:1309.0238"},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.18637\/jss.v080.i01"},{"key":"e_1_3_2_15_2","article-title":"Stan : A probabilistic programming language","volume":"76","author":"Carpenter Bob","year":"2017","journal-title":"Journal of Statistical Software"},{"key":"e_1_3_2_16_2","unstructured":"Robert Carver Michelle Everson John Gabrosek Nicholas Horton Robin Lock Megan Mocko Allan Rossman Ginger Holmes Roswell Paul Velleman Jeffrey Witmer and Beverly Wood. 2016. Guidelines for assessment and instruction in statistics education (GAISE) college report 2016. AMSTAT (2016)."},{"key":"e_1_3_2_17_2","unstructured":"Yunshun Chen Aaron T. L. Lun Davis J. McCarthy Matthew E. Ritchie Belinda Phipson Yifang Hu Xiaobei Zhou Mark D. Robinson and Gordon K. Smyth. 2020. Empirical analysis of digital gene expression data in R (v3.30.3). (2020). Retrieved September 16 2020 from https:\/\/bioconductor.org\/packages\/release\/bioc\/html\/edgeR.html."},{"key":"e_1_3_2_18_2","unstructured":"Yunshun Chen David McCarthy Matthew Ritchie Mark Robinson and Gordon Smyth. 2020. edgeR: Differential analysisof sequence read count data. (2020). Retrieved September 16 2020 from https:\/\/bioconductor.org\/packages\/release\/bioc\/vignettes\/edgeR\/inst\/doc\/edgeRUsersGuide.pdf."},{"key":"e_1_3_2_19_2","article-title":"Keras","author":"Chollet Fran\u00e7ois","year":"2015","journal-title":"Retrieved from https:\/\/keras.io"},{"key":"e_1_3_2_20_2","doi-asserted-by":"crossref","unstructured":"Alexander Eiselmayer Chatchavan Wacharamanotham Michel Beaudouin-Lafon and Wendy Mackay. 2019. Touchstone2: An interactive environment for exploring trade-offs in HCI experiment design. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. (2019).","DOI":"10.1145\/3290605.3300447"},{"key":"e_1_3_2_21_2","first-page":"73","volume-title":"Proceedings of the 2012 20th IEEE International Conference on Program Comprehension","author":"Feigenspan Janet","year":"2012"},{"key":"e_1_3_2_22_2","unstructured":"Jerome Friedman Trevor Hastie Rob Tibshirani Balasubramanian Narasimhan Kenneth Tay Noah Simon and Junyang Qian. 2020. Package \u201cglmnet\u201d. (2020). Retrieved September 16 2020 from https:\/\/cran.r-project.org\/web\/packages\/glmnet\/index.html."},{"issue":"2","key":"e_1_3_2_23_2","doi-asserted-by":"crossref","first-page":"389","DOI":"10.1111\/rssa.12378","article-title":"Visualization in bayesian workflow","volume":"182","author":"Gabry Jonah","year":"2019","journal-title":"Journal of the Royal Statistical Society: Series A (Statistics in Society)"},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1201\/b16018"},{"key":"e_1_3_2_25_2","article-title":"The garden of forking paths: Why multiple comparisons can be a problem, even when there is no \u201cfishing expedition\u201d or \u201cp-hacking\u201d and the research hypothesis was posited ahead of time","author":"Gelman Andrew","year":"2013","journal-title":"Department of Statistics, Columbia University"},{"key":"e_1_3_2_26_2","article-title":"Bayesian workflow","author":"Gelman Andrew","year":"2020","journal-title":"arXiv:2011.01808"},{"key":"e_1_3_2_27_2","unstructured":"LLC. GraphPad Software. 2020. GraphPad prism 8 user guide. (2020). Retrieved fromhttps:\/\/www.graphpad.com\/guides\/prism\/8\/user-guide\/index.htm."},{"key":"e_1_3_2_28_2","article-title":"Quick list of useful R packages","author":"Grolemund Garrett","year":"2019","journal-title":"R Studio Support"},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1111\/insr.12028"},{"key":"e_1_3_2_30_2","unstructured":"Jarrod Hadfield. 2020. Package \u201cMCMCglmm\u201d. (2020). Retrieved September 16 2020 from https:\/\/cran.r-project.org\/web\/packages\/MCMCglmm\/MCMCglmm.pdf."},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.18637\/jss.v033.i02"},{"key":"e_1_3_2_32_2","unstructured":"Trevor Hastie and Junyang Qian. 2014. Glmnet vignette. (2014). Retrieved September 16 2020 from https:\/\/web.stanford.edu\/hastie\/glmnet\/glmnet_alpha.html."},{"key":"e_1_3_2_33_2","doi-asserted-by":"crossref","first-page":"281","volume-title":"Proceedings of the 32nd Annual ACM Symposium on User Interface Software and Technology","author":"Hempel Brian","year":"2019","DOI":"10.1145\/3332165.3347925"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.7551\/mitpress\/3729.001.0001"},{"issue":"3","key":"e_1_3_2_35_2","doi-asserted-by":"crossref","first-page":"188","DOI":"10.1002\/wics.162","article-title":"JMP statistical discovery software","volume":"3","author":"Jones Bradley","year":"2011","journal-title":"Wiley Interdisciplinary Reviews: Computational Statistics"},{"key":"e_1_3_2_36_2","unstructured":"Eric Jones Travis Oliphant and Pearu Peterson. 2001\u20132020. SciPy: Open source scientific tools for Python. Retrieved August 6 2020 from http:\/\/www.scipy.org\/."},{"key":"e_1_3_2_37_2","unstructured":"Eric Jones Travis Oliphant and Pearu Peterson. 2001\u20132020. Statistical functions (scipy.stats). Retrieved September 14 2020 from https:\/\/docs.scipy.org\/doc\/scipy\/reference\/stats.html."},{"key":"e_1_3_2_38_2","unstructured":"Eric Jones Travis Oliphant and Pearu Peterson. 2001\u20132020. Optimization and root finding (scipy.optimize). Retrieved September 14 2020 from https:\/\/docs.scipy.org\/doc\/scipy\/reference\/optimize.html."},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/3332165.3347940"},{"key":"e_1_3_2_40_2","first-page":"1","volume-title":"Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems","author":"Kale Alex","year":"2019"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/TVCG.2012.219"},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1207\/s15327957pspr0203_4"},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1207\/s15516709cog1201_1"},{"key":"e_1_3_2_44_2","doi-asserted-by":"publisher","DOI":"10.1037\/0033-2909.125.5.524"},{"key":"e_1_3_2_45_2","first-page":"118","volume-title":"Expertise out of Context","author":"Klein Gary","year":"2007"},{"key":"e_1_3_2_46_2","article-title":"\u201cANOVA\u2019s three types of estimating sums of squares: Don\u2019t make the wrong choice!","author":"Korstanje Joos","year":"2019","journal-title":"Towards Data Science, Medium"},{"key":"e_1_3_2_47_2","unstructured":"Max Kuhn Davis Vaughan and RStudio. 2020. parsnip: A Common API to Modeling and Analysis Functions. Retrieved from https:\/\/parsnip.tidymodels.org\/."},{"key":"e_1_3_2_48_2","volume-title":"Tidymodels: A Collection of Packages for Modeling and Machine Learning Using Tidyverse Principles.","author":"Kuhn Max","year":"2020"},{"issue":"3","key":"e_1_3_2_49_2","first-page":"141","article-title":"Robust modeling in cognitive science","volume":"2","author":"Lee Michael D.","year":"2019","journal-title":"Computational Brain and Behavior"},{"issue":"1","key":"e_1_3_2_50_2","first-page":"66","article-title":"Understanding the role of alternatives in data analysis practices","volume":"26","author":"Liu Jiali","year":"2019","journal-title":"IEEE Transactions on Visualization and Computer Graphics"},{"key":"e_1_3_2_51_2","doi-asserted-by":"crossref","unstructured":"Yang Liu Tim Althoff and Jeffrey Heer. 2020. Paths explored paths omitted paths obscured: Decision points & selective reporting in end-to-end data analysis. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems . 1\u201314.","DOI":"10.1145\/3313831.3376533"},{"key":"e_1_3_2_52_2","unstructured":"StataCorp LLC. 2020. Language syntax. Retrieved September 16 2020 fromhttps:\/\/www.stata.com\/manuals13\/u11.pdf."},{"key":"e_1_3_2_53_2","unstructured":"StataCorp LLC. 2020. Stata 16 Documentation. Retrieved fromhttps:\/\/www.stata.com\/features\/documentation\/."},{"key":"e_1_3_2_54_2","doi-asserted-by":"publisher","DOI":"10.1023\/A:1008929526011"},{"key":"e_1_3_2_55_2","unstructured":"Arni Magnusson Hans Skaug Anders Nielsen Casper Berg Kasper Kristensen Martin Maechler Koen van Bentham Ben Bolker Nafis Sadat Daniel L\u00fcdecke Russ Lenth Joseph O\u2019Brien and Mollie Brooks. 2020. Package \u201cglmmTMB\u201d. (2020). Retrieved September 16 2020 from https:\/\/cran.r-project.org\/web\/packages\/glmmTMB\/index.html."},{"key":"e_1_3_2_56_2","doi-asserted-by":"publisher","DOI":"10.1201\/9780429029608"},{"key":"e_1_3_2_57_2","doi-asserted-by":"publisher","DOI":"10.5555\/1095704"},{"issue":"12","key":"e_1_3_2_58_2","doi-asserted-by":"crossref","first-page":"1696","DOI":"10.1177\/0956797619879441","article-title":"Development of holistic episodic recollection","volume":"30","author":"Ngo Chi T.","year":"2019","journal-title":"Psychological Science"},{"key":"e_1_3_2_59_2","doi-asserted-by":"crossref","unstructured":"Donald A. Norman. 1986. User centered system design: New perspectives on human-computer interaction. CRC Press.","DOI":"10.1201\/b15703"},{"key":"e_1_3_2_60_2","unstructured":"University of Amsterdam. 2020. JASP: A Fresh Way to do Statistics. Retrieved September 16 2020 from https:\/\/jasp-stats.org\/."},{"key":"e_1_3_2_61_2","doi-asserted-by":"publisher","DOI":"10.5555\/1953048.2078195"},{"key":"e_1_3_2_62_2","unstructured":"Josef Perktold Skipper Seabold Jonathan Taylor and statsmodels developers. 2020. Statsmodels v0.10.2 reference guide. (2020). Retrieved April 1 2021 from https:\/\/www.statsmodels.org\/stable."},{"key":"e_1_3_2_63_2","first-page":"171","article-title":"Statistical thinking: One statistician\u2019s perspective","author":"Pfannkuch M.","year":"1997","journal-title":"Research Papers on Stochastics Education"},{"issue":"2","key":"e_1_3_2_64_2","first-page":"132","article-title":"Statistical thinking an statistical practice: Themes gleaned from professional statisticians","volume":"15","author":"Pfannkuch Maxine","year":"2000","journal-title":"Statistical Science"},{"key":"e_1_3_2_65_2","unstructured":"Jos\u00e9 Pinheiro Douglas Bates Saikat DebRoy Deepayan Sarkar EISPACK authors Siem Heisterkamp Bert Van Willigen and R-core. 2020. Package \u201cnlme\u201d. (2020). Retrieved September 16 2020 from https:\/\/cran.r-project.org\/web\/packages\/nlme\/nlme.pdf."},{"key":"e_1_3_2_66_2","first-page":"2","volume-title":"Proceedings of International Conference on Intelligence Analysis","volume":"5","author":"Pirolli Peter","year":"2005"},{"key":"e_1_3_2_67_2","article-title":"Top python libraries used in data science","author":"Prabhu Tanu N.","year":"2019","journal-title":"Towards Data Science, Medium"},{"key":"e_1_3_2_68_2","unstructured":"Brian Ripley Bill Venables Douglas M. Bates Kurt Hornik Albrecht Gebhardt and David Firth. 2020. Package \u201cMASS\u201d. (2020). Retrieved September 16 2020 from https:\/\/cran.r-project.org\/web\/packages\/MASS\/MASS.pdf."},{"key":"e_1_3_2_69_2","doi-asserted-by":"publisher","DOI":"10.1145\/169059.169209"},{"key":"e_1_3_2_70_2","doi-asserted-by":"publisher","DOI":"10.7717\/peerj-cs.55"},{"key":"e_1_3_2_71_2","unstructured":"SAS. 2020. JMP. Retrieved September 16 2020 fromhttps:\/\/www.jmp.com\/en_us\/home.html."},{"key":"e_1_3_2_72_2","first-page":"106","volume-title":"Proceedings of the 17th Annual Conference of the Cognitive Science Society","author":"Schunn Christian D.","year":"1995"},{"key":"e_1_3_2_73_2","first-page":"25","volume-title":"Proceedings of the 18th Annual Conference of the Cognitive Science Society: July 12\u201315, 1996, University of California, San Diego","volume":"18","author":"Schunn Christian D.","year":"1996"},{"key":"e_1_3_2_74_2","unstructured":"scikit-learn developers. 2020. Scikit-learn v0.23.2 documentation. (2020). Retrieved November 20 2020 from https:\/\/scikit-learn.org\/stable\/."},{"key":"e_1_3_2_75_2","first-page":"61","volume-title":"Proceedings of the 9th Python in Science Conference","volume":"57","author":"Seabold Skipper","year":"2010"},{"key":"e_1_3_2_76_2","unstructured":"IBM SPSS. [n.d.]. SPSS Software. Retrieved August 18 2020 fromhttps:\/\/www.ibm.com\/analytics\/spss-statistics-software."},{"key":"e_1_3_2_77_2","doi-asserted-by":"crossref","unstructured":"Stata. [n.d.]. Stata Software. Retrieved September 14 2020 fromhttps:\/\/www.stata.com\/.","DOI":"10.4324\/9781003149286-3"},{"key":"e_1_3_2_78_2","unstructured":"Michael Suh. 2014. Higher Education Gender & Work Dataset. Retrieved September 16 2020 from https:\/\/www.pewsocialtrends.org\/category\/datasets\/?download=20041."},{"key":"e_1_3_2_79_2","article-title":"Package \u2018stats\u2019 v4.1.0","author":"Team R Core","year":"2020","journal-title":"CRAN"},{"key":"e_1_3_2_80_2","unstructured":"Inc. The MathWorks. 2020. Matlab. Retrieved from https:\/\/www.mathworks.com\/."},{"key":"e_1_3_2_81_2","unstructured":"Inc. The MathWorks. 2020. Statistics and machine learning toolbox. (2020). Retrieved fromhttps:\/\/www.mathworks.com\/help\/stats\/index.html."},{"key":"e_1_3_2_82_2","doi-asserted-by":"publisher","DOI":"10.1145\/3242587.3242663"},{"key":"e_1_3_2_83_2","doi-asserted-by":"crossref","first-page":"2693","volume-title":"Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems","author":"Wacharamanotham Chat","year":"2015","DOI":"10.1145\/2702123.2702347"},{"issue":"1","key":"e_1_3_2_84_2","first-page":"1","article-title":"Tidy data","volume":"59","author":"Wickham Hadley","year":"2014","journal-title":"Journal of Statistical Software"},{"key":"e_1_3_2_85_2","doi-asserted-by":"publisher","DOI":"10.21105\/joss.01686"},{"issue":"3","key":"e_1_3_2_86_2","doi-asserted-by":"crossref","first-page":"223","DOI":"10.1111\/j.1751-5823.1999.tb00442.x","article-title":"Statistical thinking in empirical enquiry","volume":"67","author":"Wild Chris J.","year":"1999","journal-title":"International Statistical Review"},{"key":"e_1_3_2_87_2","article-title":"Goals, process, and challenges of exploratory data analysis: An interview study","author":"Wongsuphasawat Kanit","year":"2019","journal-title":"arXiv:1911.00568"},{"issue":"8","key":"e_1_3_2_88_2","doi-asserted-by":"crossref","first-page":"3920","DOI":"10.1073\/pnas.1901326117","article-title":"Veridical data science","volume":"117","author":"Yu Bin","year":"2020","journal-title":"Proceedings of the National Academy of Sciences"}],"container-title":["ACM Transactions on Computer-Human Interaction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3476980","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,1,7]],"date-time":"2022-01-07T14:55:25Z","timestamp":1641567325000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3476980"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,2,28]]},"references-count":87,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2022,2,28]]}},"alternative-id":["10.1145\/3476980"],"URL":"http:\/\/dx.doi.org\/10.1145\/3476980","relation":{},"ISSN":["1073-0516","1557-7325"],"issn-type":[{"value":"1073-0516","type":"print"},{"value":"1557-7325","type":"electronic"}],"subject":["Human-Computer Interaction"],"published":{"date-parts":[[2022,2,28]]}}}