{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,19]],"date-time":"2026-06-19T23:56:06Z","timestamp":1781913366033,"version":"3.54.5"},"reference-count":77,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2022,1,16]],"date-time":"2022-01-16T00:00:00Z","timestamp":1642291200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Comput.-Hum. Interact."],"published-print":{"date-parts":[[2022,4,30]]},"abstract":"<jats:p>Computational notebooks allow data scientists to express their ideas through a combination of code and documentation. However, data scientists often pay attention only to the code, and neglect creating or updating their documentation during quick iterations. Inspired by human documentation practices learned from 80 highly-voted Kaggle notebooks, we design and implement Themisto, an automated documentation generation system to explore how human-centered AI systems can support human data scientists in the machine learning code documentation scenario. Themisto facilitates the creation of documentation via three approaches: a deep-learning-based approach to generate documentation for source code, a query-based approach to retrieve online API documentation for source code, and a user prompt approach to nudge users to write documentation. We evaluated Themisto in a within-subjects experiment with 24 data science practitioners, and found that automated documentation generation techniques reduced the time for writing documentation, reminded participants to document code they would have ignored, and improved participants\u2019 satisfaction with their computational notebook.<\/jats:p>","DOI":"10.1145\/3489465","type":"journal-article","created":{"date-parts":[[2022,1,16]],"date-time":"2022-01-16T08:26:51Z","timestamp":1642321611000},"page":"1-33","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":74,"title":["Documentation Matters: Human-Centered AI System to Assist Data Science Code Documentation in Computational Notebooks"],"prefix":"10.1145","volume":"29","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8724-4662","authenticated-orcid":false,"given":"April Yi","family":"Wang","sequence":"first","affiliation":[{"name":"University of Michigan, Ann Arbor, MI, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Dakuo","family":"Wang","sequence":"additional","affiliation":[{"name":"IBM Research, Cambridge, MA, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jaimie","family":"Drozdal","sequence":"additional","affiliation":[{"name":"Rensselaer Polytechnic Institute, Troy, NY, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Michael","family":"Muller","sequence":"additional","affiliation":[{"name":"IBM Research, Cambridge, MA, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Soya","family":"Park","sequence":"additional","affiliation":[{"name":"MIT CSAIL, Cambridge, MA, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Justin D.","family":"Weisz","sequence":"additional","affiliation":[{"name":"IBM Research, Cambridge, MA, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Xuye","family":"Liu","sequence":"additional","affiliation":[{"name":"Rensselaer Polytechnic Institute, Troy, NY, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Lingfei","family":"Wu","sequence":"additional","affiliation":[{"name":"IBM Research, Cambridge, MA, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Casey","family":"Dugan","sequence":"additional","affiliation":[{"name":"IBM Research, Cambridge, MA, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2022,1,16]]},"reference":[{"key":"e_1_3_3_2_2","doi-asserted-by":"crossref","unstructured":"Rajas Agashe Srinivasan Iyer and Luke Zettlemoyer. 2019. Juice: A large scale distantly supervised dataset for open domain context-based code generation. arXiv:1910.02216. Retrieved from https:\/\/arxiv.org\/abs\/1910.02216.","DOI":"10.18653\/v1\/D19-1546"},{"key":"e_1_3_3_3_2","unstructured":"Uri Alon Shaked Brody Omer Levy and Eran Yahav. 2018. code2seq: Generating sequences from structured representations of code. arXiv:1808.01400. Retrieved from https:\/\/arxiv.org\/abs\/1808.01400."},{"key":"e_1_3_3_4_2","doi-asserted-by":"publisher","DOI":"10.1147\/JRD.2019.2942288"},{"key":"e_1_3_3_5_2","doi-asserted-by":"publisher","DOI":"10.1145\/3210713.3210745"},{"key":"e_1_3_3_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/3290605.3300234"},{"key":"e_1_3_3_7_2","doi-asserted-by":"publisher","DOI":"10.1145\/3313831.3376729"},{"key":"e_1_3_3_8_2","doi-asserted-by":"publisher","DOI":"10.1145\/1085313.1085331"},{"key":"e_1_3_3_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/HICSS.2009.343"},{"key":"e_1_3_3_10_2","doi-asserted-by":"publisher","DOI":"10.1145\/3377325.3377501"},{"key":"e_1_3_3_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICPC.2013.6613829"},{"issue":"8","key":"e_1_3_3_12_2","first-page":"9","article-title":"Energy generation prediction: Lessons learned from the use of kaggle in machine learning course","volume":"7","author":"Fernandez-Bes Jesus","year":"2016","unstructured":"Jesus Fernandez-Bes, Jer\u00f3nimo Arenas-Garc\u00eda, and Jes\u00fas Cid-Sueiro. 2016. Energy generation prediction: Lessons learned from the use of kaggle in machine learning course. Group 7, 8 (2016), 9.","journal-title":"Group"},{"key":"e_1_3_3_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/2460999.2461003"},{"key":"e_1_3_3_14_2","doi-asserted-by":"publisher","DOI":"10.1145\/3458723"},{"key":"e_1_3_3_15_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10606-018-9333-1"},{"key":"e_1_3_3_16_2","doi-asserted-by":"publisher","DOI":"10.1145\/3290605.3300500"},{"key":"e_1_3_3_17_2","unstructured":"Sarah Holland Ahmed Hosny Sarah Newman Joshua Joseph and Kasia Chmielinski. 2018. The dataset nutrition label: A framework to drive higher data quality standards. arXiv:1805.03677. Retrieved from https:\/\/arxiv.org\/abs\/1805.03677."},{"key":"e_1_3_3_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/3290607.3312897"},{"key":"e_1_3_3_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/302979.303030"},{"key":"e_1_3_3_20_2","doi-asserted-by":"publisher","DOI":"10.1145\/3196321.3196334"},{"key":"e_1_3_3_21_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P16-1195"},{"key":"e_1_3_3_22_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P16-1195"},{"key":"e_1_3_3_23_2","volume-title":"JupyterLab: The Next Generation of the Jupyter Notebook","author":"Jupyter Project","year":"2016","unstructured":"Project Jupyter. 2016. JupyterLab: The Next Generation of the Jupyter Notebook. Retrieved 01 September, 2021 from https:\/\/blog.jupyter.org\/jupyterlab-the-next-generation-of-the-jupyter-notebook-5c949dabea3."},{"key":"e_1_3_3_24_2","unstructured":"Project Jupyter. 2015. Project Jupyter: Computational Narratives as the Engine of Collaborative Data Science. Retrieved September 15 2019 from Retrieved from https:\/\/blog.jupyter.org\/project-jupyter-computational-narratives-as-the-engine-of-collaborative-data-science-2b5fb94c3c58."},{"key":"e_1_3_3_25_2","doi-asserted-by":"publisher","DOI":"10.1023\/B:LIDA.0000048322.42751.ca"},{"key":"e_1_3_3_26_2","doi-asserted-by":"publisher","DOI":"10.1145\/3290605.3300322"},{"key":"e_1_3_3_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/VLHCC.2017.8103446"},{"key":"e_1_3_3_28_2","doi-asserted-by":"publisher","DOI":"10.1145\/3173574.3173748"},{"key":"e_1_3_3_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/3507473.3507479"},{"key":"e_1_3_3_30_2","doi-asserted-by":"crossref","unstructured":"Markus Konkol Daniel N\u00fcst and Laura Goulier. 2020. Publishing computational research\u2013a review of infrastructures for reproducible and transparent scholarly communication. arXiv:2001.00484. Retrieved from https:\/\/arxiv.org\/abs\/2001.00484.","DOI":"10.5194\/egusphere-egu2020-17013"},{"key":"e_1_3_3_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/3476052"},{"key":"e_1_3_3_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/VL\/HCC50065.2020.9127201"},{"key":"e_1_3_3_33_2","doi-asserted-by":"crossref","unstructured":"Alexander LeClair Sakib Haque Linfgei Wu and Collin McMillan. 2020. Improved code summarization via a graph neural network. arXiv:2004.02843. Retrieved from https:\/\/arxiv.org\/abs\/2004.02843.","DOI":"10.1145\/3387904.3389268"},{"issue":"1","key":"e_1_3_3_34_2","first-page":"66","article-title":"Understanding the role of alternatives in data analysis practices","volume":"26","author":"Liu Jiali","year":"2019","unstructured":"Jiali Liu, Nadia Boukhelifa, and James R. Eagan. 2019. Understanding the role of alternatives in data analysis practices. IEEE Transactions on Visualization and Computer Graphics 26, 1 (2019), 66\u201376.","journal-title":"IEEE Transactions on Visualization and Computer Graphics"},{"key":"e_1_3_3_35_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i04.5926"},{"key":"e_1_3_3_36_2","volume-title":"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: Findings","author":"Liu Xuye","year":"2021","unstructured":"Xuye Liu, Dakuo Wang, April Yi Wang, Yufang Hou, and Lingfei Wu. 2021. HAConvGNN: Hierarchical attention based convolutional graph neural network for code documentation generation in jupyter notebooks. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: Findings."},{"key":"e_1_3_3_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/3313831.3376739"},{"key":"e_1_3_3_38_2","doi-asserted-by":"publisher","DOI":"10.1038\/s41559-017-0160"},{"key":"e_1_3_3_39_2","doi-asserted-by":"crossref","unstructured":"Minh-Thang Luong Hieu Pham and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv:1508.04025. Retrieved from https:\/\/arxiv.org\/abs\/1508.04025.","DOI":"10.18653\/v1\/D15-1166"},{"key":"e_1_3_3_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2013.12"},{"issue":"4","key":"e_1_3_3_41_2","first-page":"34","article-title":"How human\u2013computer\u2018Superminds\u2019 are redefining the future of work","volume":"59","author":"Malone Thomas W.","year":"2018","unstructured":"Thomas W. Malone. 2018. How human\u2013computer\u2018Superminds\u2019 are redefining the future of work. MIT Sloan Management Review 59, 4 (2018), 34\u201341.","journal-title":"MIT Sloan Management Review"},{"key":"e_1_3_3_42_2","doi-asserted-by":"publisher","DOI":"10.1145\/3287560.3287596"},{"key":"e_1_3_3_43_2","doi-asserted-by":"publisher","DOI":"10.1145\/3290605.3300356"},{"key":"e_1_3_3_44_2","unstructured":"Michael Muller April Yi Wang Steven I. Ross Justin D. Weisz Mayank Agarwal Kartik Talamadupula Stephanie Houde Fernando Martinez John Richards Jaimie Drozdal Xuye Liu David Piorkowski and Dakuo Wang. 2021. How data scientists improve generated code documentation in jupyter notebooks."},{"key":"e_1_3_3_45_2","doi-asserted-by":"publisher","DOI":"10.1145\/2207676.2208664"},{"key":"e_1_3_3_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE.2009.5070533"},{"key":"e_1_3_3_47_2","doi-asserted-by":"publisher","DOI":"10.3115\/1073083.1073135"},{"key":"e_1_3_3_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/3468.844354"},{"key":"e_1_3_3_49_2","doi-asserted-by":"publisher","DOI":"10.1145\/3266037.3266098"},{"key":"e_1_3_3_50_2","doi-asserted-by":"publisher","DOI":"10.1038\/d41586-018-07196-1"},{"key":"e_1_3_3_51_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-40593-3_13"},{"key":"e_1_3_3_52_2","doi-asserted-by":"publisher","DOI":"10.1145\/3449205"},{"key":"e_1_3_3_53_2","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3300107"},{"key":"e_1_3_3_54_2","doi-asserted-by":"crossref","unstructured":"Marco Tulio Ribeiro Tongshuang Wu Carlos Guestrin and Sameer Singh. 2020. Beyond accuracy: Behavioral testing of NLP models with checklist. arXiv:2005.04118. Retrieved from https:\/\/arxiv.org\/abs\/2005.04118.","DOI":"10.18653\/v1\/2020.acl-main.442"},{"key":"e_1_3_3_55_2","doi-asserted-by":"publisher","DOI":"10.5555\/2337223.2337254"},{"key":"e_1_3_3_56_2","doi-asserted-by":"crossref","unstructured":"Adam Rule Amanda Birmingham Cristal Zuniga Ilkay Altintas Shih-Cheng Huang Rob Knight Niema Moshiri Mai H. Nguyen Sara Brin Rosenthal Fernando P\u00e9rez and Peter W. Rose. 2019. Ten simple rules for writing and sharing computational analyses in Jupyter Notebooks.","DOI":"10.1371\/journal.pcbi.1007007"},{"key":"e_1_3_3_57_2","doi-asserted-by":"publisher","DOI":"10.1145\/3274419"},{"key":"e_1_3_3_58_2","doi-asserted-by":"publisher","DOI":"10.1145\/3173574.3173606"},{"key":"e_1_3_3_59_2","doi-asserted-by":"crossref","unstructured":"Jeffrey Saltz Kevin Crowston and Ivan Shamshurin. 2017. Comparing data science project management methodologies via a controlled experiment. In Proceedings of the Hawaii International Conference on System Sciences.","DOI":"10.24251\/HICSS.2017.120"},{"key":"e_1_3_3_60_2","volume-title":"Proceedings of the 17th International Semantic Web Conference","author":"Samuel Sheeba","year":"2018","unstructured":"Sheeba Samuel and Birgitta K\u00f6nig-Ries. 2018. ProvBook: Provenance-based semantic enrichment of interactive notebooks for reproducibility. In Proceedings of the 17th International Semantic Web Conference."},{"key":"e_1_3_3_61_2","unstructured":"Sheeba Samuel and Birgitta K\u00f6nig-Ries. 2020. ReproduceMeGit: A visualization tool for analyzing reproducibility of jupyter notebooks. arXiv:2006.12110. Retrieved from https:\/\/arxiv.org\/abs\/2006.12110."},{"key":"e_1_3_3_62_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.im.2019.103174"},{"key":"e_1_3_3_63_2","doi-asserted-by":"publisher","DOI":"10.5555\/1987434.1987473"},{"key":"e_1_3_3_64_2","doi-asserted-by":"publisher","DOI":"10.1080\/10447318.2020.1741118"},{"key":"e_1_3_3_65_2","doi-asserted-by":"publisher","DOI":"10.1145\/1858996.1859006"},{"key":"e_1_3_3_66_2","doi-asserted-by":"publisher","DOI":"10.1145\/3313831.3376740"},{"key":"e_1_3_3_67_2","doi-asserted-by":"publisher","DOI":"10.1145\/2818052.2874352"},{"key":"e_1_3_3_68_2","doi-asserted-by":"publisher","DOI":"10.1145\/3411764.3445526"},{"key":"e_1_3_3_69_2","unstructured":"Dakuo Wang Q. Vera Liao Yunfeng Zhang Udayan Khurana Horst Samulowitz Soya Park Michael Muller and Lisa Amini. 2021. How much automation does a data scientist want?. arXiv:2101.03970. Retrieved from https:\/\/arxiv.org\/abs\/2101.03970."},{"key":"e_1_3_3_70_2","doi-asserted-by":"publisher","DOI":"10.1145\/3379336.3381474"},{"key":"e_1_3_3_71_2","doi-asserted-by":"publisher","DOI":"10.1145\/3359313"},{"key":"e_1_3_3_72_2","doi-asserted-by":"publisher","DOI":"10.1145\/3377325.3377538"},{"key":"e_1_3_3_73_2","doi-asserted-by":"publisher","DOI":"10.1145\/3301275.3302290"},{"key":"e_1_3_3_74_2","doi-asserted-by":"publisher","DOI":"10.1109\/VDS48975.2019.8973385"},{"key":"e_1_3_3_75_2","doi-asserted-by":"publisher","DOI":"10.1109\/TVCG.2018.2864836"},{"key":"e_1_3_3_76_2","unstructured":"Kun Xu Lingfei Wu Zhiguo Wang Yansong Feng Michael Witbrock and Vadim Sheinin. 2018. Graph2seq: Graph to sequence learning with attention-based neural networks. arXiv:1804.00823. Retrieved from https:\/\/arxiv.org\/abs\/1804.00823."},{"key":"e_1_3_3_77_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.compedu.2020.104059"},{"key":"e_1_3_3_78_2","doi-asserted-by":"publisher","DOI":"10.1145\/3392826"}],"container-title":["ACM Transactions on Computer-Human Interaction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3489465","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3489465","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:18:39Z","timestamp":1750191519000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3489465"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,1,16]]},"references-count":77,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2022,4,30]]}},"alternative-id":["10.1145\/3489465"],"URL":"https:\/\/doi.org\/10.1145\/3489465","relation":{},"ISSN":["1073-0516","1557-7325"],"issn-type":[{"value":"1073-0516","type":"print"},{"value":"1557-7325","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,1,16]]},"assertion":[{"value":"2021-06-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-09-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-01-16","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}