{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,28]],"date-time":"2026-03-28T10:38:28Z","timestamp":1774694308102,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":18,"publisher":"ACM","license":[{"start":{"date-parts":[[2020,8,20]],"date-time":"2020-08-20T00:00:00Z","timestamp":1597881600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2020,8,23]]},"DOI":"10.1145\/3394486.3406477","type":"proceedings-article","created":{"date-parts":[[2020,8,20]],"date-time":"2020-08-20T23:17:27Z","timestamp":1597965447000},"page":"3561-3562","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":223,"title":["Overview and Importance of Data Quality for Machine Learning Tasks"],"prefix":"10.1145","author":[{"given":"Abhinav","family":"Jain","sequence":"first","affiliation":[{"name":"IBM Research India, New Delhi, India"}]},{"given":"Hima","family":"Patel","sequence":"additional","affiliation":[{"name":"IBM Research India, Bengaluru, India"}]},{"given":"Lokesh","family":"Nagalapatti","sequence":"additional","affiliation":[{"name":"IBM Research India, Bengaluru, India"}]},{"given":"Nitin","family":"Gupta","sequence":"additional","affiliation":[{"name":"IBM Research India, New Delhi, India"}]},{"given":"Sameep","family":"Mehta","sequence":"additional","affiliation":[{"name":"IBM Research India, Bengaluru, India"}]},{"given":"Shanmukha","family":"Guttula","sequence":"additional","affiliation":[{"name":"IBM Research India, Bengaluru, India"}]},{"given":"Shashank","family":"Mujumdar","sequence":"additional","affiliation":[{"name":"IBM Research India, New Delhi, India"}]},{"given":"Shazia","family":"Afzal","sequence":"additional","affiliation":[{"name":"IBM Research India, New Delhi, India"}]},{"given":"Ruhi","family":"Sharma Mittal","sequence":"additional","affiliation":[{"name":"IBM Research India, Bengaluru, India"}]},{"given":"Vitobha","family":"Munigala","sequence":"additional","affiliation":[{"name":"IBM Research India, Bengaluru, India"}]}],"member":"320","published-online":{"date-parts":[[2020,8,20]]},"reference":[{"key":"e_1_3_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/3308558.3313602"},{"key":"e_1_3_2_1_2_1","volume-title":"Data Validation for Machine Learning. In Conference on Systems and Machine Learning (SysML).","author":"Breck Eric","year":"2019","unstructured":"Eric Breck , Neoklis Polyzotis , Sudip Roy , Steven Whang , and Martin Zinkevich . 2019 . Data Validation for Machine Learning. In Conference on Systems and Machine Learning (SysML). Eric Breck, Neoklis Polyzotis, Sudip Roy, Steven Whang, and Martin Zinkevich. 2019. Data Validation for Machine Learning. In Conference on Systems and Machine Learning (SysML)."},{"key":"e_1_3_2_1_3_1","volume-title":"Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks. arXiv preprint arXiv:1811.01910","author":"Collins Edward","year":"2018","unstructured":"Edward Collins , Nikolai Rozanov , and Bingbing Zhang . 2018. Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks. arXiv preprint arXiv:1811.01910 ( 2018 ). Edward Collins, Nikolai Rozanov, and Bingbing Zhang. 2018. Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks. arXiv preprint arXiv:1811.01910 (2018)."},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDM.2006.132"},{"key":"e_1_3_2_1_5_1","volume-title":"Advances in Artificial Intelligence,, Atefeh Farzindar and Vlado Kevs elj (Eds.)","author":"Denil Misha","unstructured":"Misha Denil and Thomas Trappenberg . 2010. Overlap versus Imbalance . In Advances in Artificial Intelligence,, Atefeh Farzindar and Vlado Kevs elj (Eds.) . Springer Berlin Heidelberg , Berlin, Heidelberg , 220--231. Misha Denil and Thomas Trappenberg. 2010. Overlap versus Imbalance. In Advances in Artificial Intelligence,, Atefeh Farzindar and Vlado Kevs elj (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 220--231."},{"key":"e_1_3_2_1_6_1","volume-title":"Proceedings of the 34th International Conference on Machine Learning-Volume 70","author":"Devlin Jacob","year":"2017","unstructured":"Jacob Devlin , Jonathan Uesato , Surya Bhupatiraju , Rishabh Singh , Abdel-rahman Mohamed, and Pushmeet Kohli . 2017 . Robustfill: Neural program learning under noisy i\/o . In Proceedings of the 34th International Conference on Machine Learning-Volume 70 . 990--998. Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel-rahman Mohamed, and Pushmeet Kohli. 2017. Robustfill: Neural program learning under noisy i\/o. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. 990--998."},{"key":"e_1_3_2_1_7_1","volume-title":"Data shapley: Equitable valuation of data for machine learning. arXiv preprint arXiv:1904.02868","author":"Ghorbani Amirata","year":"2019","unstructured":"Amirata Ghorbani and James Zou . 2019. Data shapley: Equitable valuation of data for machine learning. arXiv preprint arXiv:1904.02868 ( 2019 ). Amirata Ghorbani and James Zou. 2019. Data shapley: Equitable valuation of data for machine learning. arXiv preprint arXiv:1904.02868 (2019)."},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/1925844.1926423"},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3308558.3313415"},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1137\/1.9781611974973.55"},{"key":"e_1_3_2_1_11_1","volume-title":"BTW 2019--Workshopband","author":"Kiefer Cornelia","year":"2019","unstructured":"Cornelia Kiefer . 2019 . Quality indicators for text data . BTW 2019--Workshopband (2019). Cornelia Kiefer. 2019. Quality indicators for text data. BTW 2019--Workshopband (2019)."},{"key":"e_1_3_2_1_12_1","volume-title":"Outlier Detection for Improved Data Quality and Diversity in Dialog Systems. arXiv preprint arXiv:1904.03122","author":"Larson Stefan","year":"2019","unstructured":"Stefan Larson , Anish Mahendran , Andrew Lee , Jonathan K Kummerfeld , Parker Hill , Michael A Laurenzano , Johann Hauswald , Lingjia Tang , and Jason Mars . 2019. Outlier Detection for Improved Data Quality and Diversity in Dialog Systems. arXiv preprint arXiv:1904.03122 ( 2019 ). Stefan Larson, Anish Mahendran, Andrew Lee, Jonathan K Kummerfeld, Parker Hill, Michael A Laurenzano, Johann Hauswald, Lingjia Tang, and Jason Mars. 2019. Outlier Detection for Improved Data Quality and Diversity in Dialog Systems. arXiv preprint arXiv:1904.03122 (2019)."},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/3290605.3300356"},{"key":"e_1_3_2_1_14_1","volume-title":"Confident Learning: Estimating Uncertainty in Dataset Labels. arXiv preprint arXiv:1911.00068","author":"Northcutt Curtis G","year":"2019","unstructured":"Curtis G Northcutt , Lu Jiang , and Isaac L Chuang . 2019 . Confident Learning: Estimating Uncertainty in Dataset Labels. arXiv preprint arXiv:1911.00068 (2019). Curtis G Northcutt, Lu Jiang, and Isaac L Chuang. 2019. Confident Learning: Estimating Uncertainty in Dataset Labels. arXiv preprint arXiv:1911.00068 (2019)."},{"key":"e_1_3_2_1_15_1","volume-title":"Pedro Rangel Henriques, and Helena Galhardas","author":"Oliveira Paulo","year":"2005","unstructured":"Paulo Oliveira , F\u00e1tima Rodrigues , Pedro Rangel Henriques, and Helena Galhardas . 2005 . A Taxonomy of Data Quality Problems. Journal of Data and Information Quality - JDIQ ( 01 2005). Paulo Oliveira, F\u00e1tima Rodrigues, Pedro Rangel Henriques, and Helena Galhardas. 2005. A Taxonomy of Data Quality Problems. Journal of Data and Information Quality - JDIQ (01 2005)."},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1268"},{"key":"e_1_3_2_1_17_1","unstructured":"Erhard Rahm and Hong Hai Do. 2000. Data cleaning: Problems and current approaches. (2000).  Erhard Rahm and Hong Hai Do. 2000. Data cleaning: Problems and current approaches. (2000)."},{"key":"e_1_3_2_1_18_1","volume-title":"Data Valuation using Reinforcement Learning. arXiv preprint arXiv:1909.11671","author":"Yoon Jinsung","year":"2019","unstructured":"Jinsung Yoon , Sercan O Arik , and Tomas Pfister . 2019. Data Valuation using Reinforcement Learning. arXiv preprint arXiv:1909.11671 ( 2019 ). Jinsung Yoon, Sercan O Arik, and Tomas Pfister. 2019. Data Valuation using Reinforcement Learning. arXiv preprint arXiv:1909.11671 (2019)."}],"event":{"name":"KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining","location":"Virtual Event CA USA","acronym":"KDD '20","sponsor":["SIGMOD ACM Special Interest Group on Management of Data","SIGKDD ACM Special Interest Group on Knowledge Discovery in Data"]},"container-title":["Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &amp; Data Mining"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3394486.3406477","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3394486.3406477","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T21:31:30Z","timestamp":1750195890000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3394486.3406477"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,8,20]]},"references-count":18,"alternative-id":["10.1145\/3394486.3406477","10.1145\/3394486"],"URL":"https:\/\/doi.org\/10.1145\/3394486.3406477","relation":{},"subject":[],"published":{"date-parts":[[2020,8,20]]},"assertion":[{"value":"2020-08-20","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}