{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,12]],"date-time":"2026-02-12T09:29:25Z","timestamp":1770888565766,"version":"3.50.1"},"reference-count":39,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2022,7,29]],"date-time":"2022-07-29T00:00:00Z","timestamp":1659052800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["SIGMOD Rec."],"published-print":{"date-parts":[[2022,7,29]]},"abstract":"<jats:p>The recent success of machine learning (ML) has led to an explosive growth of systems and applications built by an ever-growing community of system builders and data science (DS) practitioners. This quickly shifting panorama, however, is challenging for system builders and practitioners alike to follow. In this paper, we set out to capture this panorama through a wide-angle lens, performing the largest analysis of DS projects to date, focusing on questions that can advance our understanding of the field and determine investments. Specifically, we download and analyze (a) over 8M notebooks publicly available on GITHUB and (b) over 2M enterprise ML pipelines developed within Microsoft. Our analysis includes coarse-grained statistical characterizations, finegrained analysis of libraries and pipelines, and comparative studies across datasets and time. We report a large number of measurements for our readers to interpret and draw actionable conclusions on (a) what system builders should focus on to better serve practitioners and (b) what technologies should practitioners rely on.<\/jats:p>","DOI":"10.1145\/3552490.3552496","type":"journal-article","created":{"date-parts":[[2022,7,29]],"date-time":"2022-07-29T18:51:29Z","timestamp":1659120689000},"page":"30-37","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":24,"title":["Data Science Through the Looking Glass"],"prefix":"10.1145","volume":"51","author":[{"given":"Fotis","family":"Psallidas","sequence":"first","affiliation":[]},{"given":"Yiwen","family":"Zhu","sequence":"additional","affiliation":[]},{"given":"Bojan","family":"Karlas","sequence":"additional","affiliation":[]},{"given":"Jordan","family":"Henkel","sequence":"additional","affiliation":[]},{"given":"Matteo","family":"Interlandi","sequence":"additional","affiliation":[]},{"given":"Subru","family":"Krishnan","sequence":"additional","affiliation":[]},{"given":"Brian","family":"Kroth","sequence":"additional","affiliation":[]},{"given":"Venkatesh","family":"Emani","sequence":"additional","affiliation":[]},{"given":"Wentao","family":"Wu","sequence":"additional","affiliation":[]},{"given":"Ce","family":"Zhang","sequence":"additional","affiliation":[]},{"given":"Markus","family":"Weimer","sequence":"additional","affiliation":[]},{"given":"Avrilia","family":"Floratou","sequence":"additional","affiliation":[]},{"given":"Carlo","family":"Curino","sequence":"additional","affiliation":[]},{"given":"Konstantinos","family":"Karanasos","sequence":"additional","affiliation":[]}],"member":"320","published-online":{"date-parts":[[2022,7,29]]},"reference":[{"key":"e_1_2_1_1_1","first-page":"265","volume-title":"12th USENIX Symposium on Operating Systems Design and Implementation (OSDI)","author":"Abadi M.","year":"2016","unstructured":"M. Abadi , P. Barham , J. Chen , Z. Chen , A. Davis , J. Dean , M. Devin , S. Ghemawat , G. Irving , M. Isard , : A system for large-scale machine learning . In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI) , pages 265 -- 283 , 2016 . M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 265--283, 2016."},{"key":"e_1_2_1_2_1","volume-title":"Conference on Innovative Data Systems Research (CIDR)","author":"Agrawal A.","year":"2019","unstructured":"A. Agrawal , R. Chatterjee , C. Curino , A. Floratou , N. Gowdal , M. Interlandi , A. Jindal , K. Karanasos , S. Krishnan , B. Kroth , J. Leeka , K. Park , H. Patel , O. Poppe , F. Psallidas , R. Ramakrishnan , A. Roy , K. Saur , R. Sen , M. Weimer , T. Wright , and Y. Zhu . Cloudy with high chance of DBMS: A 10-year prediction for Enterprise-Grade ML . In Conference on Innovative Data Systems Research (CIDR) , 2019 . A. Agrawal, R. Chatterjee, C. Curino, A. Floratou, N. Gowdal, M. Interlandi, A. Jindal, K. Karanasos, S. Krishnan, B. Kroth, J. Leeka, K. Park, H. Patel, O. Poppe, F. Psallidas, R. Ramakrishnan, A. Roy, K. Saur, R. Sen, M. Weimer, T. Wright, and Y. Zhu. Cloudy with high chance of DBMS: A 10-year prediction for Enterprise-Grade ML. In Conference on Innovative Data Systems Research (CIDR), 2019."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/3292500.3330667"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE-SEIP.2019.00042"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.5555\/1795822"},{"key":"e_1_2_1_6_1","first-page":"11073","author":"Bommarito E.","year":"1907","unstructured":"E. Bommarito and M. Bommarito . An Empirical Analysis of the Python Package Index (PyPI). CoRR, abs\/ 1907 . 11073 , 2019. E. Bommarito and M. Bommarito. An Empirical Analysis of the Python Package Index (PyPI). CoRR, abs\/1907.11073, 2019.","journal-title":"CoRR, abs\/"},{"key":"e_1_2_1_7_1","unstructured":"Caffe. https:\/\/caffe.berkeleyvision.org.  Caffe. https:\/\/caffe.berkeleyvision.org."},{"key":"e_1_2_1_8_1","unstructured":"David Halter etal https:\/\/parso.readthedocs.io.  David Halter et al. https:\/\/parso.readthedocs.io."},{"key":"e_1_2_1_9_1","first-page":"1","volume":"21","author":"Decan A.","year":"2016","unstructured":"A. Decan , T. Mens , and M. Claes . On the Topology of Package Dependency Networks: A Comparison of Three Programming Language Ecosystems. In ECSAW, pages 21 : 1 -- 21 :4, 2016 . A. Decan, T. Mens, and M. Claes. On the Topology of Package Dependency Networks: A Comparison of Three Programming Language Ecosystems. In ECSAW, pages 21:1--21:4, 2016.","journal-title":"In ECSAW, pages"},{"key":"e_1_2_1_10_1","unstructured":"Gensim. https:\/\/radimrehurek.com\/gensim.  Gensim. https:\/\/radimrehurek.com\/gensim."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/VLHCC.2016.7739680"},{"key":"e_1_2_1_12_1","volume-title":"The State of Data Science & Machine Learning","year":"2019","unstructured":"Kaggle. The State of Data Science & Machine Learning , 2019 . https:\/\/www.kaggle.com\/kaggle-survey-2019. Kaggle. The State of Data Science & Machine Learning, 2019. https:\/\/www.kaggle.com\/kaggle-survey-2019."},{"key":"e_1_2_1_13_1","volume-title":"Conference on Innovative Data Systems Research (CIDR)","author":"Karanasos K.","year":"2019","unstructured":"K. Karanasos , M. Interlandi , D. Xin , F. Psallidas , R. Sen , K. Park , I. Popivanov , S. Nakandal , S. Krishnan , M. Weimer , Y. Yu , R. Ramakrishnan , and C. Curino . Extending relational query processing with ml inference . In Conference on Innovative Data Systems Research (CIDR) , 2019 . K. Karanasos, M. Interlandi, D. Xin, F. Psallidas, R. Sen, K. Park, I. Popivanov, S. Nakandal, S. Krishnan, M. Weimer, Y. Yu, R. Ramakrishnan, and C. Curino. Extending relational query processing with ml inference. In Conference on Innovative Data Systems Research (CIDR), 2019."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3025453.3025626"},{"key":"e_1_2_1_15_1","unstructured":"Matplotlib. https:\/\/matplotlib.org.  Matplotlib. https:\/\/matplotlib.org."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3290605.3300356"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394486.3403205"},{"key":"e_1_2_1_18_1","unstructured":"NumPy. https:\/\/numpy.org.  NumPy. https:\/\/numpy.org."},{"key":"e_1_2_1_19_1","unstructured":"OpenCV. https:\/\/opencv.org.  OpenCV. https:\/\/opencv.org."},{"key":"e_1_2_1_20_1","unstructured":"Pandas. https:\/\/pandas.pydata.org.  Pandas. https:\/\/pandas.pydata.org."},{"key":"e_1_2_1_21_1","unstructured":"Pillow. https:\/\/pillow.readthedocs.io.  Pillow. https:\/\/pillow.readthedocs.io."},{"key":"e_1_2_1_22_1","unstructured":"Plotly. https:\/\/plotly.com\/.  Plotly. https:\/\/plotly.com\/."},{"key":"e_1_2_1_23_1","volume-title":"Data science through the looking glass and what we found there. arXiv preprint arXiv:1912.09536","author":"Psallidas F.","year":"2019","unstructured":"F. Psallidas , Y. Zhu , B. Karlas , M. Interlandi , A. Floratou , K. Karanasos , W. Wu , C. Zhang , S. Krishnan , C. Curino , Data science through the looking glass and what we found there. arXiv preprint arXiv:1912.09536 , 2019 . F. Psallidas, Y. Zhu, B. Karlas, M. Interlandi, A. Floratou, K. Karanasos, W. Wu, C. Zhang, S. Krishnan, C. Curino, et al. Data science through the looking glass and what we found there. arXiv preprint arXiv:1912.09536, 2019."},{"key":"e_1_2_1_24_1","unstructured":"PyTorch. https:\/\/pytorch.org.  PyTorch. https:\/\/pytorch.org."},{"key":"e_1_2_1_25_1","unstructured":"Requests. https:\/\/requests.readthedocs.io.  Requests. https:\/\/requests.readthedocs.io."},{"key":"e_1_2_1_26_1","volume-title":"Data from: Exploration and Explanation in Computational Notebooks","author":"Rule A.","year":"2017","unstructured":"A. Rule , A. Tabard , and J. D. Hollan . Data from: Exploration and Explanation in Computational Notebooks . UC San Diego Library Digital Collections , 2017 . https:\/\/doi.org\/10.6075\/J0JW8C39. 10.6075\/J0JW8C39 A. Rule, A. Tabard, and J. D. Hollan. Data from: Exploration and Explanation in Computational Notebooks. UC San Diego Library Digital Collections, 2017. https:\/\/doi.org\/10.6075\/J0JW8C39."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3173574.3173606"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3411764.3445518"},{"key":"e_1_2_1_29_1","volume-title":"NIPS","author":"Schelter S.","year":"2017","unstructured":"S. Schelter , J.-H. B\u00a8ose , J. Kirschnick , T. Klein , and S. Seufert . Automatically tracking metadata and provenance of machine learning experiments . In NIPS , 2017 . S. Schelter, J.-H. B\u00a8ose, J. Kirschnick, T. Klein, and S. Seufert. Automatically tracking metadata and provenance of machine learning experiments. In NIPS, 2017."},{"key":"e_1_2_1_30_1","unstructured":"Scikit-Learn. https:\/\/scikit-learn.org\/stable\/modules\/generated\/ sklearn.pipeline.Pipeline.html\/.  Scikit-Learn. https:\/\/scikit-learn.org\/stable\/modules\/generated\/ sklearn.pipeline.Pipeline.html\/."},{"key":"e_1_2_1_31_1","unstructured":"SciPy. https:\/\/www.scipy.org.  SciPy. https:\/\/www.scipy.org."},{"key":"e_1_2_1_32_1","unstructured":"Scikit-Learn. https:\/\/scikit-learn.org.  Scikit-Learn. https:\/\/scikit-learn.org."},{"key":"e_1_2_1_33_1","unstructured":"SQLAlchemy. https:\/\/www.sqlalchemy.org.  SQLAlchemy. https:\/\/www.sqlalchemy.org."},{"key":"e_1_2_1_34_1","unstructured":"Tensorflow. https:\/\/www.tensorflow.org.  Tensorflow. https:\/\/www.tensorflow.org."},{"key":"e_1_2_1_35_1","unstructured":"Theano. http:\/\/deeplearning.net\/software\/theano.  Theano. http:\/\/deeplearning.net\/software\/theano."},{"key":"e_1_2_1_36_1","unstructured":"Tqdm. https:\/\/tqdm.github.io\/.  Tqdm. https:\/\/tqdm.github.io\/."},{"key":"e_1_2_1_37_1","unstructured":"XGBoost. https:\/\/xgboost.readthedocs.io.  XGBoost. https:\/\/xgboost.readthedocs.io."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3389738"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/3392826"}],"container-title":["ACM SIGMOD Record"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3552490.3552496","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3552490.3552496","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T17:45:12Z","timestamp":1750268712000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3552490.3552496"}},"subtitle":["Analysis of Millions of GitHub Notebooks and ML.NET Pipelines"],"short-title":[],"issued":{"date-parts":[[2022,7,29]]},"references-count":39,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2022,7,29]]}},"alternative-id":["10.1145\/3552490.3552496"],"URL":"https:\/\/doi.org\/10.1145\/3552490.3552496","relation":{},"ISSN":["0163-5808"],"issn-type":[{"value":"0163-5808","type":"print"}],"subject":[],"published":{"date-parts":[[2022,7,29]]},"assertion":[{"value":"2022-07-29","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}