{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,14]],"date-time":"2026-01-14T15:28:00Z","timestamp":1768404480263,"version":"3.49.0"},"publisher-location":"New York, NY, USA","reference-count":54,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,5,16]],"date-time":"2022-05-16T00:00:00Z","timestamp":1652659200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,5,16]]},"DOI":"10.1145\/3522664.3528621","type":"proceedings-article","created":{"date-parts":[[2022,10,17]],"date-time":"2022-10-17T16:30:14Z","timestamp":1666024214000},"page":"205-216","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":22,"title":["Data smells in public datasets"],"prefix":"10.1145","author":[{"given":"Arumoy","family":"Shome","sequence":"first","affiliation":[{"name":"Delft University of Technology, Netherlands"}]},{"given":"Lu\u00eds","family":"Cruz","sequence":"additional","affiliation":[{"name":"Delft University of Technology, Netherlands"}]},{"given":"Arie","family":"van Deursen","sequence":"additional","affiliation":[{"name":"Delft University of Technology, Netherlands"}]}],"member":"320","published-online":{"date-parts":[[2022,10,17]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"2018. Mortgage algorithms perpetuate racial bias in lending study finds. https:\/\/news.berkeley.edu\/story_jump\/mortgage-algorithms-perpetuate-racial-bias-in-lending-study-finds\/ Accessed on [2022-01-11 Tue]."},{"key":"e_1_3_2_1_2_1","first-page":"1258","article-title":"An Insider Data Leakage Detection Using One-Hot Encoding","volume":"23","author":"Al-Shehari Taher","year":"2021","unstructured":"Taher Al-Shehari and Rakan A Alsowail. 2021. An Insider Data Leakage Detection Using One-Hot Encoding, Synthetic Minority Oversampling and Machine Learning Techniques. Entropy 23, 10 (2021), 1258.","journal-title":"Synthetic Minority Oversampling and Machine Learning Techniques. Entropy"},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDSE.2016.7823957"},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE-SEIP.2019.00042"},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/SEAA.2018.00018"},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10664-014-9313-0"},{"key":"e_1_3_2_1_7_1","volume-title":"Automated Data Validation in Machine Learning Systems. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering.[Google Scholar]","author":"Biessmann Felix","year":"2021","unstructured":"Felix Biessmann, Jacek Golebiowski, Tammo Rukat, Dustin Lange, and Philipp Schmidt. 2021. Automated Data Validation in Machine Learning Systems. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering.[Google Scholar] (2021)."},{"key":"e_1_3_2_1_8_1","volume-title":"Helena Holmstr\u00f6m Olsson, and Ivica Crnkovic","author":"Bosch Jan","year":"2021","unstructured":"Jan Bosch, Helena Holmstr\u00f6m Olsson, and Ivica Crnkovic. 2021. Engineering AI systems: A research agenda. In Artificial Intelligence Paradigms for Smart Cyber-Physical Systems. IGI Global, 1--19."},{"key":"e_1_3_2_1_9_1","unstructured":"Eric Breck Neoklis Polyzotis Sudip Roy Steven Whang and Martin Zinkevich. 2019. Data Validation for Machine Learning. In MLSys."},{"key":"e_1_3_2_1_10_1","unstructured":"Tom B Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)."},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2912574"},{"key":"e_1_3_2_1_12_1","volume-title":"The Atlas of AI","author":"Crawford Kate","unstructured":"Kate Crawford. 2021. The Atlas of AI. Yale University Press."},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPC.2019.00025"},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1038\/s42256-020-0219-9"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.infsof.2021.106737"},{"key":"e_1_3_2_1_16_1","unstructured":"Martin Fowler. 2006. CodeSmell. https:\/\/martinfowler.com\/bliki\/CodeSmell.html Accessed on [2022-01-04 Tue]."},{"key":"e_1_3_2_1_17_1","volume-title":"Refactoring: improving the design of existing code","author":"Fowler Martin","unstructured":"Martin Fowler. 2018. Refactoring: improving the design of existing code. Addison-Wesley Professional."},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10664-014-9351-7"},{"key":"e_1_3_2_1_19_1","unstructured":"Refactoring Guru. [n.d.]. Catalog of Refactoring. https:\/\/refactoring.guru\/refactoring\/catalog Accessed on [2022-01-04 Tue]."},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10664-021-09993-1"},{"key":"e_1_3_2_1_21_1","unstructured":"Will Douglas Heaven. 2020. Predictive policing algorithms are racist. They need to be dismantled. https:\/\/www.technologyreview.com\/2020\/07\/17\/1005396\/predictive-policing-algorithms-racist-dismantled-machine-learning-bias-criminal-justice\/ Accessed on [2022-01-11 Tue]."},{"key":"e_1_3_2_1_22_1","volume-title":"Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE) 25","author":"Hellerstein Joseph M","year":"2008","unstructured":"Joseph M Hellerstein. 2008. Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE) 25 (2008)."},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3442188.3445918"},{"key":"e_1_3_2_1_24_1","volume-title":"NIPS MLSys Workshop.","author":"Hynes Nick","year":"2017","unstructured":"Nick Hynes, D Sculley, and Michael Terry. 2017. The data linter: Lightweight, automated sanity checking for ml data sets. In NIPS MLSys Workshop."},{"key":"e_1_3_2_1_25_1","unstructured":"IBM. [n.d.]. IBM SPSS Modeler CRISP-DM Guide. https:\/\/www.ibm.com\/docs\/en\/spss-modeler\/SaaS?topic=guide-data-understanding Accessed on [2022-03-15 Tue]."},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSE.2017.2754374"},{"key":"e_1_3_2_1_27_1","volume-title":"Boost-clean: Automated error detection and repair for machine learning. arXiv preprint arXiv:1711.01299","author":"Krishnan Sanjay","year":"2017","unstructured":"Sanjay Krishnan, Michael J Franklin, Ken Goldberg, and Eugene Wu. 2017. Boost-clean: Automated error detection and repair for machine learning. arXiv preprint arXiv:1711.01299 (2017)."},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/2939502.2939511"},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3461702.3462599"},{"key":"e_1_3_2_1_30_1","unstructured":"Brianna Lifshitz. 2021. Racism is systematic in artificial intelligence systems too. https:\/\/georgetownsecuritystudiesreview.org\/2021\/05\/06\/racism-is-systemic-in-artificial-intelligence-systems-too\/ Accessed on [2022-01-11 Tue]."},{"key":"e_1_3_2_1_31_1","volume-title":"Trustworthy ai: A computational perspective. arXiv preprint arXiv:2107.06641","author":"Liu Haochen","year":"2021","unstructured":"Haochen Liu, Yiqi Wang, Wenqi Fan, Xiaorui Liu, Yaxin Li, Shaili Jain, Yunhao Liu, Anil K Jain, and Jiliang Tang. 2021. Trustworthy ai: A computational perspective. arXiv preprint arXiv:2107.06641 (2021)."},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE-SEIP52600.2021.00034"},{"key":"e_1_3_2_1_34_1","volume-title":"Meelis Kull, Nicolas Lachiche, Marea Jose Ramirez Quintana, and Peter A Flach.","author":"Mart\u00ednez-Plumed Fernando","year":"2019","unstructured":"Fernando Mart\u00ednez-Plumed, Lidia Contreras-Ochando, Cesar Ferri, Jos\u00e9 Hern\u00e1ndez Orallo, Meelis Kull, Nicolas Lachiche, Marea Jose Ramirez Quintana, and Peter A Flach. 2019. CRISP-DM twenty years later: From data mining processes to data science trajectories. IEEE Transactions on Knowledge and Data Engineering (2019)."},{"key":"e_1_3_2_1_35_1","unstructured":"Natalia Mesa. 2021. Can the criminal justice system's artificial intelligence ever be truly fair? https:\/\/massivesci.com\/articles\/machine-learning-compas-racism-policing-fairness\/ Accessed on [2022-01-11 Tue]."},{"key":"e_1_3_2_1_36_1","unstructured":"Jennifer Miller. 2020. Is an Algorithm Less Racist Than a Loan Officer? https:\/\/www.nytimes.com\/2020\/09\/18\/business\/digital-mortgages.html Accessed on [2022-01-11 Tue]."},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/3379597.3387467"},{"key":"e_1_3_2_1_38_1","unstructured":"Andrew Ng. 2021. A Chat with Andrew on MLOps: From Model-centric to Data-centric AI. https:\/\/youtu.be\/06-AZXmwHjo Accessed on [2022-01-17 Mon]."},{"key":"e_1_3_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/3316615.3316730"},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/UCC.2014.57"},{"key":"e_1_3_2_1_41_1","volume-title":"Weaknesses and Potential Next Steps. In 2021 IEEE International Conference on Big Data (Big Data). IEEE, 2337--2344","author":"Saltz Jeffrey S","year":"2021","unstructured":"Jeffrey S Saltz. 2021. CRISP-DM for Data Science: Strengths, Weaknesses and Potential Next Steps. In 2021 IEEE International Conference on Big Data (Big Data). IEEE, 2337--2344."},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/3411764.3445518"},{"key":"e_1_3_2_1_43_1","unstructured":"Danilo Sato Arif Wilder and Christoph Windheuser. 2019. Continuous Delivery for Machine Learning. https:\/\/martinfowler.com\/articles\/cd4ml.html"},{"key":"e_1_3_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.14778\/3229863.3229867"},{"key":"e_1_3_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.procs.2021.01.199"},{"key":"e_1_3_2_1_46_1","volume-title":"Hidden technical debt in machine learning systems. Advances in neural information processing systems 28","author":"Sculley David","year":"2015","unstructured":"David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. 2015. Hidden technical debt in machine learning systems. Advances in neural information processing systems 28 (2015), 2503--2511."},{"key":"e_1_3_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183519.3183529"},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSME.2018.00010"},{"key":"e_1_3_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/2970276.2970340"},{"key":"e_1_3_2_1_50_1","volume-title":"Bug Tracking Process Smells In Practice. In 2022 IEEE\/ACM 44th International Conference on Software Engineering (ICSE). IEEE.","author":"Tuna Erdem","year":"2022","unstructured":"Erdem Tuna, Vladimir Kovalenko, and Eray T\u00fcz\u00fcn. 2022. Bug Tracking Process Smells In Practice. In 2022 IEEE\/ACM 44th International Conference on Software Engineering (ICSE). IEEE."},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISESE.2005.1541819"},{"key":"e_1_3_2_1_52_1","volume-title":"How does machine learning change software development practices? IEEE Transactions on Software Engineering","author":"Wan Zhiyuan","year":"2019","unstructured":"Zhiyuan Wan, Xin Xia, David Lo, and Gail C Murphy. 2019. How does machine learning change software development practices? IEEE Transactions on Software Engineering (2019)."},{"key":"e_1_3_2_1_53_1","unstructured":"Mark Weber Mikhail Yurochkin Sherif Botros and Vanio Markov. 2020. Black Loans Matter: Fighting Bias for AI Fairness in Lending. https:\/\/mitibmwatsonailab.mit.edu\/research\/blog\/black-loans-matter-fighting-bias-for-ai-fairness-in-lending\/ Accessed on [2022-01-11 Tue]."},{"key":"e_1_3_2_1_54_1","volume-title":"Machine learning testing: Survey, landscapes and horizons","author":"Zhang Jie M","year":"2020","unstructured":"Jie M Zhang, Mark Harman, Lei Ma, and Yang Liu. 2020. Machine learning testing: Survey, landscapes and horizons. IEEE Transactions on Software Engineering (2020)."},{"key":"e_1_3_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1613\/jair.1.11854"}],"event":{"name":"CAIN '22: 1st Conference on AI Engineering - Software Engineering for AI","location":"Pittsburgh Pennsylvania","acronym":"CAIN '22","sponsor":["SIGSOFT ACM Special Interest Group on Software Engineering","IEEE TCSC IEEE Technical Committee on Scalable Computing"]},"container-title":["Proceedings of the 1st International Conference on AI Engineering: Software Engineering for AI"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3522664.3528621","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3522664.3528621","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T18:09:34Z","timestamp":1750183774000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3522664.3528621"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,5,16]]},"references-count":54,"alternative-id":["10.1145\/3522664.3528621","10.1145\/3522664"],"URL":"https:\/\/doi.org\/10.1145\/3522664.3528621","relation":{},"subject":[],"published":{"date-parts":[[2022,5,16]]},"assertion":[{"value":"2022-10-17","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}