{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,4]],"date-time":"2026-06-04T11:09:37Z","timestamp":1780571377231,"version":"3.54.1"},"reference-count":96,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2022,11,19]],"date-time":"2022-11-19T00:00:00Z","timestamp":1668816000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,11,19]],"date-time":"2022-11-19T00:00:00Z","timestamp":1668816000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001711","name":"Schweizerischer Nationalfonds zur F\u00f6rderung der Wissenschaftlichen Forschung","doi-asserted-by":"publisher","id":[{"id":"10.13039\/501100001711","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100006447","name":"University of Zurich","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100006447","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Empir Software Eng"],"published-print":{"date-parts":[[2023,1]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Despite the ubiquity of data science, we are far from rigorously understanding how coding in data science is performed. Even though the scientific literature has hinted at the iterative and explorative nature of data science coding, we need further empirical evidence to understand this practice and its workflows in detail. Such understanding is critical to recognise the needs of data scientists and, for instance, inform tooling support. To obtain a deeper understanding of the iterative and explorative nature of data science coding, we analysed 470 Jupyter notebooks publicly available in GitHub repositories. We focused on the extent to which data scientists transition between different types of data science activities, or <jats:italic>steps<\/jats:italic> (such as data preprocessing and modelling), as well as the frequency and co-occurrence of such transitions. For our analysis, we developed a dataset with the help of five data science experts, who manually annotated the data science steps for each code cell within the aforementioned 470 notebooks. Using the first-order Markov chain model, we extracted the transitions and analysed the transition probabilities between the different steps. In addition to providing deeper insights into the implementation practices of data science coding, our results provide evidence that the steps in a data science workflow are indeed iterative and reveal specific patterns. We also evaluated the use of the annotated dataset to train machine-learning classifiers to predict the data science step(s) of a given code cell. We investigate the representativeness of the classification by comparing the workflow analysis applied to (a) the predicted data set and (b) the data set labelled by experts, finding an F1-score of about 71% for the 10-class data science step prediction problem.<\/jats:p>","DOI":"10.1007\/s10664-022-10229-z","type":"journal-article","created":{"date-parts":[[2022,11,19]],"date-time":"2022-11-19T09:03:59Z","timestamp":1668848639000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":19,"title":["Workflow analysis of data science code in public GitHub repositories"],"prefix":"10.1007","volume":"28","author":[{"given":"Dhivyabharathi","family":"Ramasamy","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Cristina","family":"Sarasua","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Alberto","family":"Bacchelli","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Abraham","family":"Bernstein","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2022,11,19]]},"reference":[{"key":"10229_CR1","doi-asserted-by":"crossref","unstructured":"Aggarwal C, Bouneffouf D, Samulowitz H, Buesser B, Hoang T, Khurana U, Liu S, Pedapati T, Ram P, Rawat A, Wistuba M, Gray A (2019) How can ai automate end-to-end data science?arXiv:1910.14436","DOI":"10.1109\/IJCNN48605.2020.9207453"},{"key":"10229_CR2","doi-asserted-by":"publisher","DOI":"10.1201\/9780429258589","volume-title":"Practical statistics for medical research","author":"DG Altman","year":"1990","unstructured":"Altman DG (1990) Practical statistics for medical research. CRC press, Florida"},{"key":"10229_CR3","doi-asserted-by":"crossref","unstructured":"Aragon C, Hutto C, Echenique A, Fiore-Gartland B, Huang Y, Kim J, Neff G, Xing W, Bayer J (2016) Developing a research agenda for human-centered data science. In: Proceedings of the 19th ACM conference on computer supported cooperative work and social computing companion, pp 529\u2013535","DOI":"10.1145\/2818052.2855518"},{"key":"10229_CR4","doi-asserted-by":"crossref","unstructured":"Bacchelli A, Dal Sasso T, D\u2019Ambros M, Lanza M (2012) Content classification of development emails","DOI":"10.1109\/ICSE.2012.6227177"},{"key":"10229_CR5","unstructured":"Barstad V, Goodwin M, Gj\u00f8s\u00e6ter T (2014) Predicting source code quality with static analysis and machine learning. In: Norsk IKT-konferanse for forskning og utdanning"},{"key":"10229_CR6","unstructured":"Bennett KP, Erickson JS, de Los Santos H, Norris S, Patton E, Sheehan J, McGuinness DL (2016) Data analytics as data: a semantic workflow approach. In: Proc of artificial intelligence for data science workshop at neural information processing systems (NIPS), Barcelona, Spain"},{"issue":"1","key":"10229_CR7","doi-asserted-by":"publisher","first-page":"26","DOI":"10.1145\/1656274.1656280","volume":"11","author":"MR Berthold","year":"2009","unstructured":"Berthold MR, Cebron N, Dill F, Gabriel TR, K\u00f6tter T, Meinl T, Ohl P, Thiel K, Wiswedel B (2009) Knime-the konstanz information miner: version 2.0 and beyond. ACM SIGKDD Explor Newsl 11(1):26\u201331","journal-title":"ACM SIGKDD Explor Newsl"},{"key":"10229_CR8","volume-title":"AntiPatterns: refactoring software, architectures and projects in crisis","author":"WH Brown","year":"1998","unstructured":"Brown WH, Malveau RC, McCormick HWS, Mowbray TJ (1998) AntiPatterns: refactoring software, architectures and projects in crisis. Wiley, New Jersey"},{"key":"10229_CR9","unstructured":"Carvalho LA, Wang R, Gil Y, Garijo D (2017) Niw: converting notebooks into workflows to capture dataflow and provenance. In: K-CAP workshops, pp 12\u201316"},{"key":"10229_CR10","doi-asserted-by":"crossref","unstructured":"Carvalho LAM, Garijo D, Medeiros CB, Gil Y (2018) Semantic software metadata for workflow exploration and evolution. In: 2018 IEEE 14th International Conference on e-Science (e-Science), IEEE, pp 431\u2013441","DOI":"10.1109\/eScience.2018.00132"},{"key":"10229_CR11","unstructured":"Chan DK, Leung KR (1997) A workflow vista of the software process. In: Database and expert systems applications. 8th international conference, DEXA\u201997 Proceedings, IEEE, pp 62\u201367"},{"key":"10229_CR12","doi-asserted-by":"crossref","unstructured":"Chattopadhyay S, Prasad I, Henley AZ, Sarma A, Barik T (2020) What\u2019s wrong with computational notebooks? pain points, needs, and design opportunities. In: Proceedings of the 2020 CHI conference on human factors in computing systems, pp 1\u201312","DOI":"10.1145\/3313831.3376729"},{"key":"10229_CR13","doi-asserted-by":"publisher","first-page":"37","DOI":"10.1177\/001316446002000104","volume":"20","author":"J Cohen","year":"1960","unstructured":"Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20:37\u201346","journal-title":"Educ Psychol Meas"},{"issue":"11","key":"10229_CR14","doi-asserted-by":"publisher","first-page":"684","DOI":"10.1016\/j.sysarc.2006.06.012","volume":"52","author":"A Colombo","year":"2006","unstructured":"Colombo A, Damiani E, Gianini G (2006) Discovering the software process by means of stochastic workflow analysis. J Syst Archit 52(11):684\u2013692","journal-title":"J Syst Archit"},{"key":"10229_CR15","unstructured":"Desmond Y (2020) Structuring jupyter notebooks for fast and iterative machine learning experiments. https:\/\/towardsdatascience.com\/, Accessed on 01 Jan 2021"},{"key":"10229_CR16","doi-asserted-by":"publisher","unstructured":"Dong H, Zhou S, Guo JL, K\u00e4stner C (2021) Splitting, renaming, removing: a study of common cleaning activities in jupyter notebooks. In: 2021 36th IEEE\/ACM international conference on automated software engineering workshops (ASEW), pp 114\u2013119. https:\/\/doi.org\/10.1109\/ASEW52652.2021.00032","DOI":"10.1109\/ASEW52652.2021.00032"},{"key":"10229_CR17","unstructured":"Drori I, Krishnamurthy Y, Rampin R, Lourenco RdP, Ono JP, Cho K, Silva C, Freire J (2021) Alphad3m: machine learning pipeline synthesis. arXiv:211102508"},{"key":"10229_CR18","volume-title":"Refactoring: improving the design of existing code","author":"M Fowler","year":"2018","unstructured":"Fowler M (2018) Refactoring: improving the design of existing code. Addison-Wesley Professional, Boston"},{"key":"10229_CR19","doi-asserted-by":"publisher","first-page":"338","DOI":"10.1016\/j.future.2013.09.018","volume":"36","author":"D Garijo","year":"2013","unstructured":"Garijo D, Alper P, Belhajjame K, Corcho O, Gil Y, Goble C (2013a) Common motifs in scientific workflows: an empirical analysis. Future Gener Comput Syst 36:338\u2013351. https:\/\/doi.org\/10.1016\/j.future.2013.09.018https:\/\/doi.org\/10.1016\/j.future.2013.09.018","journal-title":"Future Gener Comput Syst"},{"key":"10229_CR20","doi-asserted-by":"crossref","unstructured":"Garijo D, Corcho O, Gil Y (2013b) Detecting common scientific workflow fragments using templates and execution provenance. In: Proceedings of the seventh international conference on Knowledge capture, pp 33\u201340","DOI":"10.1145\/2479832.2479848"},{"key":"10229_CR21","unstructured":"Gelman A, Loken E (2013) The garden of forking paths: why multiple comparisons can be a problem, even when there is no \u201cfishing expedition\u201d or \u201cp-hacking\u201d and the research hypothesis was posited ahead of time. Dep Stat Columbia Univ 348"},{"issue":"1","key":"10229_CR22","doi-asserted-by":"publisher","first-page":"62","DOI":"10.1109\/MIS.2010.9","volume":"26","author":"Y Gil","year":"2010","unstructured":"Gil Y, Ratnakar V, Kim J, Gonzalez-Calero P, Groth P, Moody J, Deelman E (2010) Wings: intelligent workflow-based design of computational experiments. IEEE Intell Syst 26(1):62\u201372","journal-title":"IEEE Intell Syst"},{"key":"10229_CR23","unstructured":"Guo PJ, Seltzer M (2012) Burrito: wrapping your lab notebook in computational infrastructure. In: Proceedings of the 4th USENIX conference on theory and practice of provenance, TaPP\u201912. USENIX Association, USA, p 7"},{"key":"10229_CR24","doi-asserted-by":"publisher","unstructured":"Head A, Hohman F, Barik T, Drucker SM, DeLine R (2019) Managing messes in computational notebooks. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI \u201919. https:\/\/doi.org\/10.1145\/3290605.3300500. Association for Computing Machinery, New York, pp 1\u201312","DOI":"10.1145\/3290605.3300500"},{"key":"10229_CR25","doi-asserted-by":"crossref","unstructured":"Heffetz Y, Vainshtein R, Katz G, Rokach L (2020) Deepline: automl tool for pipelines generation using deep reinforcement learning and hierarchical actions filtering. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pp 2103\u20132113","DOI":"10.1145\/3394486.3403261"},{"key":"10229_CR26","doi-asserted-by":"crossref","unstructured":"Hern\u00e1ndez-Orallo J, Vold K (2019) Ai extenders: The ethical and societal implications of humans cognitively extended by ai. In: Proceedings of the 2019 AAAI\/ACM Conference on AI, Ethics, and Society, pp 507\u2013513","DOI":"10.1145\/3306618.3314238"},{"key":"10229_CR27","doi-asserted-by":"publisher","DOI":"10.1201\/b16023","volume-title":"RapidMiner: data mining use cases and business analytics applications","author":"M Hofmann","year":"2016","unstructured":"Hofmann M, Klinkenberg R (2016) RapidMiner: data mining use cases and business analytics applications. CRC Press, Florida"},{"key":"10229_CR28","doi-asserted-by":"crossref","unstructured":"Hohman F, Wongsuphasawat K, Kery MB, Patel K (2020) Understanding and visualizing data iteration in machine learning. In: Proceedings of the 2020 CHI conference on human factors in computing systems, pp 1\u201313","DOI":"10.1145\/3313831.3376177"},{"key":"10229_CR29","unstructured":"Jupyter P (2015) Project jupyter: computational narratives as the engine of collaborative data science. https:\/\/blog.jupyter.org\/"},{"key":"10229_CR30","doi-asserted-by":"crossref","unstructured":"K\u00e4ll\u00e9n M, Wrigstad T (2020) Jupyter notebooks on github: characteristics and code clones. arXiv:200710146","DOI":"10.22152\/programming-journal.org\/2021\/5\/15"},{"issue":"12","key":"10229_CR31","doi-asserted-by":"publisher","first-page":"2917","DOI":"10.1109\/TVCG.2012.219","volume":"18","author":"S Kandel","year":"2012","unstructured":"Kandel S, Paepcke A, Hellerstein JM, Heer J (2012a) Enterprise data analysis and visualization: an interview study. IEEE Trans Vis Comput Graph 18 (12):2917\u20132926. https:\/\/doi.org\/10.1109\/TVCG.2012.219","journal-title":"IEEE Trans Vis Comput Graph"},{"issue":"12","key":"10229_CR32","doi-asserted-by":"publisher","first-page":"2917","DOI":"10.1109\/TVCG.2012.219","volume":"18","author":"S Kandel","year":"2012","unstructured":"Kandel S, Paepcke A, Hellerstein JM, Heer J (2012b) Enterprise data analysis and visualization: an interview study. IEEE Trans Vis Comput Graph 18 (12):2917\u20132926","journal-title":"IEEE Trans Vis Comput Graph"},{"key":"10229_CR33","doi-asserted-by":"crossref","unstructured":"Keith B, Vega V (2016) Process mining applications in software engineering. In: International conference on software process improvement, Springer, pp 47\u201356","DOI":"10.1007\/978-3-319-48523-2_5"},{"key":"10229_CR34","doi-asserted-by":"publisher","unstructured":"Kery MB, Horvath A, Myers B (2017) Variolite: supporting exploratory programming by data scientists. In: Proceedings of the 2017 CHI conference on human factors in computing systems, CHI \u201917. https:\/\/doi.org\/10.1145\/3025453.3025626. Association for Computing Machinery, New York, pp 1265\u20131276","DOI":"10.1145\/3025453.3025626"},{"key":"10229_CR35","doi-asserted-by":"publisher","unstructured":"Kery MB, Radensky M, Arya M, John BE, Myers BA (2018) The story in the notebook: exploratory data science using a literate programming tool. In: Proceedings of the 2018 CHI conference on human factors in computing systems, CHI \u201918. https:\/\/doi.org\/10.1145\/3173574.3173748. Association for Computing Machinery, New York, pp 1\u201311","DOI":"10.1145\/3173574.3173748"},{"key":"10229_CR36","doi-asserted-by":"publisher","unstructured":"Kery MB, John BE, O\u2019Flaherty P, Horvath A, Myers BA (2019) Towards effective foraging by data scientists to find past analysis choices. In: Proceedings of the 2019 CHI conference on human factors in computing systems, CHI \u201919. https:\/\/doi.org\/10.1145\/3290605.3300322. Association for Computing Machinery, New York, pp 1\u201313","DOI":"10.1145\/3290605.3300322"},{"key":"10229_CR37","doi-asserted-by":"publisher","unstructured":"Kim M, Zimmermann T, DeLine R, Begel A (2016) The emerging role of data scientists on software development teams. In: Proceedings of the 38th international conference on software engineering, ICSE \u201916. https:\/\/doi.org\/10.1145\/2884781.2884783. Association for Computing Machinery, New York, pp 96\u2013107","DOI":"10.1145\/2884781.2884783"},{"key":"10229_CR38","doi-asserted-by":"crossref","unstructured":"Knab P, Pinzger M, Bernstein A (2006) Predicting defect densities in source code files with decision tree learners. In: Proceedings of the 2006 international workshop on Mining software repositories, pp 119\u2013125","DOI":"10.1145\/1137983.1138012"},{"key":"10229_CR39","doi-asserted-by":"publisher","unstructured":"Koenzen AP, Ernst NA, Storey MAD (2020) Code duplication and reuse in jupyter notebooks. In: 2020 IEEE symposium on visual languages and human-centric computing (VL\/HCC), pp 1\u20139. https:\/\/doi.org\/10.1109\/VL\/HCC50065.2020.9127202","DOI":"10.1109\/VL\/HCC50065.2020.9127202"},{"key":"10229_CR40","doi-asserted-by":"crossref","unstructured":"Kr\u00e4mer JP, Karrer T, Kurz J, Wittenhagen M, Borchers J (2013) How tools in ides shape developers\u2019 navigation behavior. In: Proceedings of the SIGCHI conference on human factors in computing systems, pp 3073\u20133082","DOI":"10.1145\/2470654.2466419"},{"key":"10229_CR41","doi-asserted-by":"publisher","unstructured":"Kross S, Guo PJ (2019) Practitioners teaching data science in industry and academia: Expectations, workflows, and challenges. In: Proceedings of the 2019 CHI conference on human factors in computing systems, CHI \u201919. https:\/\/doi.org\/10.1145\/3290605.3300493. ACM, New York, pp 263:1\u2013263:14","DOI":"10.1145\/3290605.3300493"},{"key":"10229_CR42","doi-asserted-by":"crossref","unstructured":"Kun P, Mulder I, Kortuem G (2018) Design enquiry through data: appropriating a data science workflow for the design process. In: Proceedings of the 32nd international BCS human computer interaction conference, vol 32. pp 1\u201312","DOI":"10.14236\/ewic\/HCI2018.32"},{"key":"10229_CR43","doi-asserted-by":"crossref","unstructured":"LaToza TD, Myers BA (2010) Hard-to-answer questions about code. In: Evaluation and usability of programming languages and tools, pp 1\u20136","DOI":"10.1145\/1937117.1937125"},{"key":"10229_CR44","unstructured":"Lee A, Xin D, Lee D, Parameswaran A (2020) Demystifying a dark art: understanding real-world machine learning model development. arXiv:200501520"},{"key":"10229_CR45","doi-asserted-by":"publisher","first-page":"703","DOI":"10.1038\/nmeth.3968","volume":"13","author":"J Lever","year":"2016","unstructured":"Lever J, Krzywinski M, Altman NS (2016) Points of significance: Model selection and overfitting. Nat Methods 13:703\u2013704","journal-title":"Nat Methods"},{"issue":"4","key":"10229_CR46","doi-asserted-by":"publisher","first-page":"457","DOI":"10.1007\/s10723-015-9329-8","volume":"13","author":"J Liu","year":"2015","unstructured":"Liu J, Pacitti E, Valduriez P, Mattoso M (2015) A survey of data-intensive scientific workflow management. J Grid Comput 13(4):457\u2013493","journal-title":"J Grid Comput"},{"key":"10229_CR47","first-page":"66","volume":"26","author":"J Liu","year":"2020","unstructured":"Liu J, Boukhelifa N, Eagan JR (2020) Understanding the role of alternatives in data analysis practices. IEEE Trans Vis Comput Graph 26:66\u201376","journal-title":"IEEE Trans Vis Comput Graph"},{"key":"10229_CR48","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3361118","volume":"3","author":"Y Mao","year":"2019","unstructured":"Mao Y, Wang D, Muller MJ, Varshney KR, Baldini I, Dugan C, Mojsilovic A (2019) How data scientists work together with domain experts in scientific collaborations. Proc ACM Human-Comput Interact 3:1\u201323","journal-title":"Proc ACM Human-Comput Interact"},{"key":"10229_CR49","doi-asserted-by":"crossref","unstructured":"McCormick E, De Volder K (2004) Jquery: finding your way through tangled code. In: Companion to the 19th annual ACM SIGPLAN conference on Object-oriented programming systems, languages, and applications, pp 9\u201310","DOI":"10.1145\/1028664.1028670"},{"key":"10229_CR50","doi-asserted-by":"crossref","unstructured":"Meena HK, Saha I, Mondal KK, Prabhakar T (2005) An approach to workflow modeling and analysis. In: Proceedings of the 2005 OOPSLA workshop on Eclipse technology eXchange, pp 85\u201389","DOI":"10.1145\/1117696.1117714"},{"key":"10229_CR51","doi-asserted-by":"publisher","DOI":"10.1016\/B978-0-12-804206-9.00001-5","volume-title":"Perspectives on data science for software engineering","author":"T Menzies","year":"2016","unstructured":"Menzies T, Williams L, Zimmermann T (2016) Perspectives on data science for software engineering. Morgan Kaufmann, Burlington"},{"key":"10229_CR52","unstructured":"Microsoft (2020) What is the team data science process?. https:\/\/docs.microsoft.com\/en-us\/azure\/machine-learning\/team-data-science-process\/overview, Accessed 1 Jan 2021"},{"key":"10229_CR53","volume-title":"The quant crunch. How the demand for data science skills is disrupting the job market","author":"S Miller","year":"2017","unstructured":"Miller S, Hughes D (2017) The quant crunch. How the demand for data science skills is disrupting the job market. Burning Glass Technologies, Boston"},{"key":"10229_CR54","doi-asserted-by":"crossref","unstructured":"Missier P, Soiland-Reyes S, Owen S, Tan W, Nenadic A, Dunlop I, Williams A, Oinn T, Goble C (2010) Taverna, reloaded. In: International conference on scientific and statistical database management, Springer, pp 471\u2013481","DOI":"10.1007\/978-3-642-13818-8_33"},{"issue":"11","key":"10229_CR55","doi-asserted-by":"publisher","first-page":"1905","DOI":"10.1080\/00140139408964957","volume":"37","author":"BM Muir","year":"1994","unstructured":"Muir BM (1994) Trust in automation: part i. theoretical issues in the study of trust and human intervention in automated systems. Ergonomics 37 (11):1905\u20131922","journal-title":"Ergonomics"},{"key":"10229_CR56","doi-asserted-by":"crossref","unstructured":"Muller M, Feinberg M, George T, Jackson SJ, John BE, Kery MB, Passi S (2019a) Human-centered study of data science work practices. In: Extended abstracts of the 2019 CHI conference on human factors in computing systems, pp 1\u20138","DOI":"10.1145\/3290607.3299018"},{"key":"10229_CR57","doi-asserted-by":"publisher","unstructured":"Muller M, Lange I, Wang D, Piorkowski D, Tsay J, Liao QV, Dugan C, Erickson T (2019b) How data science workers work with data: discovery, capture, curation, design, creation. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI \u201919. https:\/\/doi.org\/10.1145\/3290605.3300356. Association for Computing Machinery, New York, pp 1\u201315","DOI":"10.1145\/3290605.3300356"},{"key":"10229_CR58","doi-asserted-by":"publisher","DOI":"10.1201\/b11509","volume-title":"Antipatterns: managing software organizations and people","author":"CJ Neill","year":"2011","unstructured":"Neill CJ, Laplante PA, DeFranco JF (2011) Antipatterns: managing software organizations and people. CRC Press, Florida"},{"issue":"1241","key":"10229_CR59","doi-asserted-by":"publisher","first-page":"585","DOI":"10.1098\/rstb.1990.0101","volume":"327","author":"DA Norman","year":"1990","unstructured":"Norman DA (1990) The \u2018problem\u2019 with automation: inappropriate feedback and interaction, not \u2018over-automation\u2019. Philos Trans R Soc Lond B Biol Sci 327(1241):585\u2013593","journal-title":"Philos Trans R Soc Lond B Biol Sci"},{"key":"10229_CR60","doi-asserted-by":"crossref","unstructured":"Olabarriaga S, Pierantoni G, Taffoni G, Sciacca E, Jaghoori M, Korkhov V, Castelli G, Vuerli C, Becciani U, Carley E, et al. (2014) Scientific workflow management\u2013for whom?. In: 2014 IEEE 10th international conference on e-Science, vol 1. IEEE, pp 298-305","DOI":"10.1109\/eScience.2014.8"},{"key":"10229_CR61","volume-title":"Doing data science. Straight talk from the frontline","author":"C O\u2019Neil","year":"2013","unstructured":"O\u2019Neil C, Schutt R (2013) Doing data science. Straight talk from the frontline. O\u2019Reilly Media Inc., California"},{"issue":"3","key":"10229_CR62","doi-asserted-by":"publisher","first-page":"286","DOI":"10.1109\/3468.844354","volume":"30","author":"R Parasuraman","year":"2000","unstructured":"Parasuraman R, Sheridan TB, Wickens CD (2000) A model for types and levels of human interaction with automation. IEEE Trans Syst Man Cybernet Part A Syst Hum 30(3):286\u2013297","journal-title":"IEEE Trans Syst Man Cybernet Part A Syst Hum"},{"key":"10229_CR63","doi-asserted-by":"crossref","unstructured":"Park LA, Read J (2018) A blended metric for multi-label optimisation and evaluation. In: Joint european conference on machine learning and knowledge discovery in databases, Springer, pp 719\u2013734","DOI":"10.1007\/978-3-030-10925-7_44"},{"key":"10229_CR64","doi-asserted-by":"crossref","unstructured":"Pascarella L, Bacchelli A (2017) Classifying code comments in java open-source software systems. In: 2017 IEEE\/ACM 14th international conference on mining software repositories, MSR, IEEE, pp 227\u2013237","DOI":"10.1109\/MSR.2017.63"},{"key":"10229_CR65","doi-asserted-by":"publisher","first-page":"22","DOI":"10.1016\/j.jss.2018.12.001","volume":"150","author":"L Pascarella","year":"2019","unstructured":"Pascarella L, Palomba F, Bacchelli A (2019) Fine-grained just-in-time defect prediction. J Syst Softw 150:22\u201336","journal-title":"J Syst Softw"},{"issue":"CSCW","key":"10229_CR66","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3274405","volume":"2","author":"S Passi","year":"2018","unstructured":"Passi S, Jackson SJ (2018) Trust in data science: collaboration, translation, and accountability in corporate data science projects. Proc ACM Human-Comput Interact 2(CSCW):1\u201328","journal-title":"Proc ACM Human-Comput Interact"},{"issue":"6","key":"10229_CR67","doi-asserted-by":"publisher","first-page":"9","DOI":"10.1147\/JRD.2017.2736278","volume":"61","author":"E Patterson","year":"2017","unstructured":"Patterson E, McBurney R, Schmidt H, Baldini I, Mojsilovi\u0107 A, Varshney KR (2017) Dataflow representation of data analyses: toward a platform for collaborative data science. IBM J Res Dev 61(6):9\u20131","journal-title":"IBM J Res Dev"},{"key":"10229_CR68","unstructured":"Pellin BN (2000) Using classification techniques to determine source code authorship White Paper, Department of Computer Science, University of Wisconsin"},{"key":"10229_CR69","doi-asserted-by":"publisher","unstructured":"Pimentel JaF, Murta L, Braganholo V, Freire J (2019) A large-scale study about quality and reproducibility of jupyter notebooks. In: Proceedings of the 16th international conference on mining software repositories, IEEE Press, MSR \u201919, p 507\u2013517. https:\/\/doi.org\/10.1109\/MSR.2019.00077","DOI":"10.1109\/MSR.2019.00077"},{"key":"10229_CR70","unstructured":"PriceWaterhouseCoopers (2017) Investing in america\u2019s data science and analytics talent: a case for action. In: Business-higher education forum report"},{"issue":"4","key":"10229_CR71","doi-asserted-by":"publisher","first-page":"503","DOI":"10.1007\/s10664-011-9195-3","volume":"17","author":"F Rahman","year":"2012","unstructured":"Rahman F, Bird C, Devanbu P (2012) Clones: what is that smell? Empir Softw Eng 17(4):503\u2013530","journal-title":"Empir Softw Eng"},{"issue":"12","key":"10229_CR72","doi-asserted-by":"publisher","first-page":"889","DOI":"10.1109\/TSE.2004.101","volume":"30","author":"MP Robillard","year":"2004","unstructured":"Robillard MP, Coelho W, Murphy GC (2004) How effective developers investigate source code: An exploratory study. IEEE Trans Softw Eng 30 (12):889\u2013903","journal-title":"IEEE Trans Softw Eng"},{"issue":"115","key":"10229_CR73","first-page":"64","volume":"541","author":"CK Roy","year":"2007","unstructured":"Roy CK, Cordy JR (2007) A survey on software clone detection research. Queen\u2019s School Comput TR 541(115):64\u201368","journal-title":"Queen\u2019s School Comput TR"},{"key":"10229_CR74","doi-asserted-by":"crossref","unstructured":"Rubin V, G\u00fcnther CW, Van Der Aalst WM, Kindler E, Van Dongen BF, Sch\u00e4fer W (2007) Process mining framework for software processes. In: International conference on software process, Springer, pp 169\u2013181","DOI":"10.1007\/978-3-540-72426-1_15"},{"key":"10229_CR75","doi-asserted-by":"publisher","unstructured":"Rule A, Tabard A, Hollan JD (2018) Exploration and explanation in computational notebooks. In: Proceedings of the 2018 CHI conference on human factors in computing systems, CHI \u201918. https:\/\/doi.org\/10.1145\/3173574.3173606. Association for Computing Machinery, New York, pp 1\u20132","DOI":"10.1145\/3173574.3173606"},{"key":"10229_CR76","unstructured":"Schweinsberg M, Feldman M, Staub N, van den Akker OR, van Aert RC, Van Assen MA, Liu Y, Althoff T, Heer J, Kale A, et al. (2021) Same data, different conclusions: Radical dispersion in empirical results when independent analysts operationalize and test the same hypothesis. Organizational Behavior and Human Decision Processes"},{"key":"10229_CR77","doi-asserted-by":"crossref","unstructured":"Smith B, Mizell D, Gilbert J, Shah V (2005) Towards a timed markov process model of software development. In: Proceedings of the second international workshop on Software engineering for high performance computing system applications, pp 65\u201367","DOI":"10.1145\/1145319.1145338"},{"key":"10229_CR78","doi-asserted-by":"crossref","unstructured":"Souza R, Azevedo LG, Louren\u00e7o V, Soares E, Thiago R, Brand\u00e3o R, Civitarese D, Brazil EV, Moreno M, Valduriez P, Mattoso M, Cerqueira R, Netto MAS (2020) Workflow provenance in the lifecycle of scientific machine learning","DOI":"10.1002\/cpe.6544"},{"key":"10229_CR79","unstructured":"Springboard (2016) The data science process. https:\/\/www.kdnuggets.com\/2016\/03\/data-science-process.html, Accessed 1 Jan 2021"},{"issue":"4","key":"10229_CR80","doi-asserted-by":"publisher","first-page":"470","DOI":"10.1109\/TSE.2009.15","volume":"35","author":"MA Storey","year":"2009","unstructured":"Storey MA, Ryall J, Singer J, Myers D, Cheng LT, Muller M (2009) How software developers use tagging to support reminding and refinding. IEEE Trans Softw Eng 35(4):470\u2013483","journal-title":"IEEE Trans Softw Eng"},{"key":"10229_CR81","doi-asserted-by":"crossref","unstructured":"Svyatkovskiy A, Zhao Y, Fu S, Sundaresan N (2019) Pythia: ai-assisted code completion system. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp 2727\u20132735","DOI":"10.1145\/3292500.3330699"},{"key":"10229_CR82","doi-asserted-by":"publisher","unstructured":"Titov S, Golubev Y, Bryksin T (2022) Resplit: improving the structure of jupyter notebooks by re-splitting their cells. In: 2022 IEEE international conference on software analysis, evolution and reengineering (SANER), pp 492\u2013496. https:\/\/doi.org\/10.1109\/SANER53432.2022.00066","DOI":"10.1109\/SANER53432.2022.00066"},{"key":"10229_CR83","unstructured":"Trcka N, Aalst V, Sidorova N (2008) Analyzing control-flow and data-flow in workflow processes in a unified way. Computer science report"},{"key":"10229_CR84","unstructured":"Tsoumakas G, Vlahavas I (2007) Random k -labelsets: an ensemble method for multilabel classification. In: ECML"},{"key":"10229_CR85","unstructured":"UCSD C (2021) Introduction to big data - steps in the data science process. coursera (university of california san diego). https:\/\/www.coursera.org\/lecture\/big-data-introduction\/steps-in-the-data-science-process-Fonq2https:\/\/www.coursera.org\/lecture\/big-data-introduction\/steps-in-the-data-science-process-Fonq2, Accessed 1 Jan 2021"},{"key":"10229_CR86","doi-asserted-by":"crossref","unstructured":"Ugurel S, Krovetz R, Giles CL (2002) What\u2019s the code? automatic classification of source code archives. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp 632\u2013638","DOI":"10.1145\/775047.775141"},{"key":"10229_CR87","doi-asserted-by":"crossref","unstructured":"Vassiliadis P, Simitsis A, Baikousi E (2009) A taxonomy of etl activities. In: Proceedings of the ACM twelfth international workshop on Data warehousing and OLAP, pp 25\u201332","DOI":"10.1145\/1651291.1651297"},{"key":"10229_CR88","doi-asserted-by":"publisher","unstructured":"Wang D, Weisz JD, Muller M, Ram P, Geyer W, Dugan C, Tausczik Y, Samulowitz H, Gray A (2019a) Human-ai collaboration in data science. In: Proceedings of the ACM on human-computer interaction 3(CSCW):1\u201324. https:\/\/doi.org\/10.1145\/3359313","DOI":"10.1145\/3359313"},{"key":"10229_CR89","unstructured":"Wang D, Liao QV, Zhang Y, Khurana U, Samulowitz H, Park S, Muller MJ, Amini L (2021a) How much automation does a data scientist want? ArXiv:2101.03970"},{"key":"10229_CR90","doi-asserted-by":"crossref","unstructured":"Wang J, Li L, Zeller A (2019b) Better code, better sharing:on the need of analyzing jupyter notebooks","DOI":"10.1145\/3377816.3381724"},{"key":"10229_CR91","doi-asserted-by":"crossref","unstructured":"Wang J, Li L, Zeller A (2021b) Restoring execution environments of jupyter notebooks. In: 2021 IEEE\/ACM 43rd international conference on software engineering, ICSE, IEEE, pp 1622\u20131633","DOI":"10.1109\/ICSE43902.2021.00144"},{"key":"10229_CR92","unstructured":"Watson A, Bateman S, Ray S (2019) Pysnippet: Accelerating exploratory data analysis in jupyter notebook through facilitated access to example code. In: EDBT\/ICDT Workshops"},{"key":"10229_CR93","unstructured":"Zevin S, Holzem C (2017) Machine learning based source code classification using syntax oriented features. arXiv:170307638"},{"key":"10229_CR94","unstructured":"Zhang AX, Muller M, Wang D (2020a) How do data science workers collaborate? roles, workflows, and tools. 2001.06684"},{"key":"10229_CR95","unstructured":"Zhang G, Merrill MA, Liu Y, Heer J, Althoff T (2020b) Coral: code representation learning with weakly-supervised transformers for analyzing data analysis. arXiv:200812828"},{"issue":"2","key":"10229_CR96","doi-asserted-by":"publisher","first-page":"153","DOI":"10.1631\/FITEE.1700053","volume":"18","author":"NN Zheng","year":"2017","unstructured":"Zheng NN, Liu ZY, Ren PJ, Ma YQ, Chen ST, Yu Sy, Xue JR, Chen BD, Wang FY (2017) Hybrid-augmented intelligence: collaboration and cognition. Front Inf Technol Electr Eng 18(2):153\u2013179","journal-title":"Front Inf Technol Electr Eng"}],"container-title":["Empirical Software Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10664-022-10229-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10664-022-10229-z\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10664-022-10229-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,10]],"date-time":"2023-01-10T03:16:21Z","timestamp":1673320581000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10664-022-10229-z"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,11,19]]},"references-count":96,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2023,1]]}},"alternative-id":["10229"],"URL":"https:\/\/doi.org\/10.1007\/s10664-022-10229-z","relation":{},"ISSN":["1382-3256","1573-7616"],"issn-type":[{"value":"1382-3256","type":"print"},{"value":"1573-7616","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,11,19]]},"assertion":[{"value":"8 August 2022","order":1,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"19 November 2022","order":2,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare they have no financial or non-financial interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"<!--Emphasis Type='Bold' removed-->Competing interests"}}],"article-number":"7"}}