{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,5]],"date-time":"2026-06-05T15:16:44Z","timestamp":1780672604351,"version":"3.54.1"},"reference-count":150,"publisher":"Association for Computing Machinery (ACM)","issue":"9","license":[{"start":{"date-parts":[[2025,4,4]],"date-time":"2025-04-04T00:00:00Z","timestamp":1743724800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Ohio State University\u2019s STEM Education Faculty Startup Award","award":["GR129725"],"award-info":[{"award-number":["GR129725"]}]},{"name":"U.S. Department of Energy (DOE), Office of Science, Office of Advanced Scientific Computing Research","award":["DE-AC02-05CH11231"],"award-info":[{"award-number":["DE-AC02-05CH11231"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Comput. Surv."],"published-print":{"date-parts":[[2025,9,30]]},"abstract":"<jats:p>\n            Artificial Intelligence (AI) applications critically depend on data. Poor-quality data produces inaccurate and ineffective AI models that may lead to incorrect or unsafe use. Evaluation of data readiness is a crucial step in improving the quality and appropriateness of data usage for AI. R&amp;D efforts have been spent on improving data quality. However, standardized metrics for evaluating data readiness for use in AI training are still evolving. In this study, we perform a comprehensive survey of metrics used to verify data readiness for AI training. This survey examines more than 140 papers published by ACM Digital Library, IEEE Xplore, journals such as\n            <jats:italic>Nature, Springer,<\/jats:italic>\n            and\n            <jats:italic>Science Direct,<\/jats:italic>\n            and online articles published by prominent AI experts. This survey aims to propose a taxonomy of data readiness for AI (DRAI) metrics for structured and unstructured datasets. We anticipate that this taxonomy will lead to new standards for DRAI metrics that would be used for enhancing the quality, accuracy, and fairness of AI training and inference.\n          <\/jats:p>","DOI":"10.1145\/3722214","type":"journal-article","created":{"date-parts":[[2025,3,7]],"date-time":"2025-03-07T11:24:51Z","timestamp":1741346691000},"page":"1-39","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":23,"title":["Data Readiness for AI: A 360-Degree Survey"],"prefix":"10.1145","volume":"57","author":[{"ORCID":"https:\/\/orcid.org\/0009-0006-8516-0215","authenticated-orcid":false,"given":"Kaveen","family":"Hiniduma","sequence":"first","affiliation":[{"name":"Computer Science and Engineering, The Ohio State University, Columbus, United States"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3048-3448","authenticated-orcid":false,"given":"Suren","family":"Byna","sequence":"additional","affiliation":[{"name":"Computer Science and Engineering, The Ohio State University, Columbus, United States"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3915-1135","authenticated-orcid":false,"given":"Jean Luca","family":"Bez","sequence":"additional","affiliation":[{"name":"Lawrence Berkeley National Laboratory, Berkeley, United States"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2025,4,4]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"OPTICOM GmbH. 2008. PEVQ\u2014The standard for perceptual evaluation of video quality. Retrieved from http:\/\/www.pevq.com\/pevq.html"},{"key":"e_1_3_1_3_2","unstructured":"GO FAIR Initiative. 2024. FAIR Principles. Retrieved from https:\/\/www.go-fair.org\/fair-principles\/"},{"key":"e_1_3_1_4_2","unstructured":"Aindo. n.d. Privacy Score. Retrieved from https:\/\/docs.aindo.com\/evaluation\/privacy\/"},{"key":"e_1_3_1_5_2","unstructured":"International Telecommunication Union (ITU). n.d. BS.1387: Method for objective measurements of perceived audio quality. Retrieved from https:\/\/www.itu.int\/rec\/R-REC-BS.1387\/en"},{"key":"e_1_3_1_6_2","unstructured":"Readable. n.d. The Gunning Fog Index. Retrieved from https:\/\/readable.com\/readability\/gunning-fog-index\/"},{"key":"e_1_3_1_7_2","unstructured":"Sarnoff Corporation. n.d. JNDmetrix Technology. Retrieved from http:\/\/www.sarnoff.com\/products_services\/video_vision\/jndmetrix\/"},{"key":"e_1_3_1_8_2","unstructured":"Kaggle. n.d. Kaggle platform. Retrieved from https:\/\/www.kaggle.com\/"},{"key":"e_1_3_1_9_2","unstructured":"MathWorks. n.d. PSNR (Peak Signal-to-Noise Ratio). Retrieved from https:\/\/www.mathworks.com\/help\/vision\/ref\/psnr.html"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jspi.2018.07.005"},{"key":"e_1_3_1_11_2","volume-title":"IEEE International Conference on Smart Data Services (SMDS\u201920)","author":"Afzal S.","year":"2020","unstructured":"S. Afzal, C. Rajmohan, M. Kesarwani, S. Mehta, and H. Patel. 2020. Data readiness report. In IEEE International Conference on Smart Data Services (SMDS\u201920)."},{"key":"e_1_3_1_12_2","unstructured":"Telm AI. 2023. Demystifying data quality\u2019s impact on large language models. Retrieved from https:\/\/www.telm.ai\/blog\/demystifying-data-qualitys-impact-on-large-language-models\/"},{"key":"e_1_3_1_13_2","volume-title":"Learning from Imbalanced Data Sets","author":"Alberto Francisco","year":"2018","unstructured":"Francisco Alberto, Salvador Garc\u00eda, Mikel Galar, Ronaldo Prati, Bartosz Krawczyk, and Francisco Herrera. 2018. Learning from Imbalanced Data Sets. Springer."},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jocm.2018.07.002"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/CSCI51800.2020.00249"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1145\/3552433"},{"key":"e_1_3_1_17_2","volume-title":"International Conference on Advances in Neural Information Processing Systems (NIPS\u201917)","author":"Bachem Olivier","year":"2017","unstructured":"Olivier Bachem, Mario Lucic, and Andreas Krause. 2017. Practical coreset constructions for machine learning. In International Conference on Advances in Neural Information Processing Systems (NIPS\u201917)."},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-33173-5"},{"key":"e_1_3_1_19_2","first-page":"115","article-title":"A perceptual speech quality measure based on a psychoacoustic sound representation","volume":"42","author":"Beerends J.","year":"1994","unstructured":"J. Beerends and J. Stemerdink. 1994. A perceptual speech quality measure based on a psychoacoustic sound representation. J. Audio Eng. Soc. 42 (Dec.1994), 115\u2013123.","journal-title":"J. Audio Eng. Soc."},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/SECCOM.2007.4550303"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11837-016-2001-3"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/1891879.1891881"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.5555\/944919.944937"},{"key":"e_1_3_1_24_2","unstructured":"Netflix Technology Blog. 2017. Toward a practical perceptual video quality metric. Retrieved from https:\/\/netflixtechblog.com\/toward-a-practical-perceptual-video-quality-metric-653f208b9652"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1145\/3190578"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10044-016-0583-6"},{"key":"e_1_3_1_27_2","first-page":"93","volume-title":"ACM SIGMOD International Conference on Managment of Data","author":"Breunig Markus M.","year":"2000","unstructured":"Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and J\u00f6rg Sander. 2000. LOF: Identifying density-based local outliers. In ACM SIGMOD International Conference on Managment of Data. 93\u2013104."},{"key":"e_1_3_1_28_2","unstructured":"Nicholas Carlini Matthew Jagielski Chiyuan Zhang Nicolas Papernot Andreas Terzis and Florian Tramer. 2022. The privacy onion effect: Memorization is relative. arxiv:2206.10469 [cs.LG]"},{"key":"e_1_3_1_29_2","unstructured":"L. Elisa Celis Vijay Keswani and Nisheeth K. Vishnoi. 2020. Data preprocessing to mitigate bias: A maximum entropy based approach. arxiv:1906.02164 [cs.LG]"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2007.901820"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.cels.2019.09.011"},{"key":"e_1_3_1_32_2","unstructured":"Cleanlab. 2024. Elevating data quality: The crucial role of proper data annotation. Retrieved from https:\/\/cleanlab.ai\/blog\/learn\/data-annotation\/"},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1177\/001316446002000104"},{"key":"e_1_3_1_34_2","doi-asserted-by":"crossref","first-page":"283","DOI":"10.1037\/h0076540","article-title":"A computer readability formula designed for machine scoring.","volume":"60","author":"Coleman Meri","year":"1975","unstructured":"Meri Coleman and Ta Lin Liau. 1975. A computer readability formula designed for machine scoring. J. Appl. Psychol. 60 (1975), 283\u2013284.","journal-title":"J. Appl. Psychol."},{"key":"e_1_3_1_35_2","volume-title":"Residuals and Influence in Regression","author":"Cook R. Dennis","year":"1982","unstructured":"R. Dennis Cook and Sanford Weisberg. 1982. Residuals and Influence in Regression. Chapman & Hall."},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDM.2006.132"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1137\/1.9781611970104"},{"key":"e_1_3_1_38_2","volume-title":"Statistics and Data Analysis in Geology","author":"Davis John C.","year":"1986","unstructured":"John C. Davis and Robert J. Sampson. 1986. Statistics and Data Analysis in Geology. Vol. 646. Wiley, New York."},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2021.3131402"},{"key":"e_1_3_1_40_2","unstructured":"IBM Developer. 2021. IBM Data Quality AI Toolkit. Retrieved from https:\/\/developer.ibm.com\/learningpaths\/data-quality-ai-toolkit\/overview\/"},{"key":"e_1_3_1_41_2","unstructured":"Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. Retrieved from http:\/\/archive.ics.uci.edu\/ml"},{"key":"e_1_3_1_42_2","volume-title":"Pattern Classification","author":"Duda Richard O.","year":"2012","unstructured":"Richard O. Duda, Peter E. Hart, and David G. Stork. 2012. Pattern Classification. John Wiley & Sons."},{"key":"e_1_3_1_43_2","unstructured":"Vasisht Duddu Sebastian Szyller and N. Asokan. 2022. SHAPr: An efficient and versatile membership privacy risk metric for machine learning. arxiv:2112.02230 [cs.CR]"},{"key":"e_1_3_1_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2007.250581"},{"key":"e_1_3_1_45_2","unstructured":"Nitin Gupta Hima Patel Subhro Choudhury Arun Iyer and Diptikalyan Saha. 2021. Data quality toolkit: Automatic assessment of data quality and remediation for machine learning datasets. arXiv preprint arXiv:2108.05935 [cs.LG]. Retrieved from https:\/\/arxiv.org\/abs\/2108.05935"},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","unstructured":"Nikil Ravi Pranshu Chaturvedi E. A. Huerta Zhengchun Liu Ryan Chard Aristana Scourtas K. J. Schmidt and Kyle Chard. 2022. FAIR principles for AI models with a practical application for accelerated high energy diffraction microscopy. Scientific Data 9 1 (Nov. 2022) Article 712. DOI:10.1038\/s41597-022-01712-9","DOI":"10.1038\/s41597-022-01712-9"},{"key":"e_1_3_1_47_2","unstructured":"Rachel K. E. Bellamy Kuntal Dey Michael Hind Samuel C. Hoffman Stephanie Houde Kalapriya Kannan Pranay Lohia Jacquelyn Martino Sameep Mehta Aleksandra Mojsilovic et\u00a0al. 2018. AI Fairness 360: An extensible toolkit for detecting understanding and mitigating unwanted algorithmic bias. arXiv preprint arXiv:1810.01943 [cs.AI]. Retrieved from https:\/\/arxiv.org\/abs\/1810.01943"},{"key":"e_1_3_1_48_2","unstructured":"Shreyas Cholia Charuleka Varadharajan and Deb Agarwal. 2024. ESS-DIVE overview: A scalable user-focused repository for earth and environmental science data. Scientific Data Division Lawrence Berkeley National Laboratory Berkeley CA. Retrieved from https:\/\/ess-dive.lbl.gov\/"},{"key":"e_1_3_1_49_2","unstructured":"Yang Lu Yiu-Ming Cheung and Yuan Yan Tang. 2019. Bayes imbalance impact index: A measure of class imbalanced dataset for classification problem. arXiv preprint arXiv:1901.10173 [cs.LG]. Retrieved from https:\/\/arxiv.org\/abs\/1901.10173"},{"key":"e_1_3_1_50_2","volume-title":"Regulation (EU) 2016\/679 of the European Parliament and of the Council","author":"Parliament European","year":"2016","unstructured":"European Parliament and Council of the European Union. 2016. Regulation (EU) 2016\/679 of the European Parliament and of the Council. Retrieved from https:\/\/data.europa.eu\/eli\/reg\/2016\/679\/oj"},{"key":"e_1_3_1_51_2","unstructured":"FAIRassist.org. n.d.. FAIRassist.Org. Retrieved from https:\/\/fairassist.org"},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","unstructured":"Michael Feldman Sorelle A. Friedler John Moeller Carlos Scheidegger and Suresh Venkatasubramanian. 2015. Certifying and removing disparate impact. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD\u201915). Association for Computing Machinery New York NY USA 259\u2013268. DOI:10.1145\/2783258.2783311","DOI":"10.1145\/2783258.2783311"},{"key":"e_1_3_1_53_2","volume-title":"The Art of Readable Writing","author":"Flesch Rudolf","year":"1986","unstructured":"Rudolf Flesch. 1986. The Art of Readable Writing. MacMillan."},{"key":"e_1_3_1_54_2","first-page":"17","article-title":"An extensive empirical study of feature selection metrics for text classification","volume":"3","author":"Forman George","year":"2003","unstructured":"George Forman. 2003. An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3 (Mar.2003), 17 pages.","journal-title":"J. Mach. Learn. Res."},{"key":"e_1_3_1_55_2","unstructured":"Amirata Ghorbani and James Zou. 2019. Data Shapley: Equitable valuation of data for machine learning. arxiv:1904.02868 [stat.ML]"},{"key":"e_1_3_1_56_2","article-title":"Variability and mutability: Contribution to the study of statistical distribution and relations","author":"Gini C.","year":"1912","unstructured":"C. Gini. 1912. Variability and mutability: Contribution to the study of statistical distribution and relations. Studi Economico-Giuridici della R (1912).","journal-title":"Studi Economico-Giuridici della R"},{"key":"e_1_3_1_57_2","first-page":"235","volume-title":"FLAIRS","author":"Hall Mark A.","year":"1999","unstructured":"Mark A. Hall and Lloyd A. Smith. 1999. Feature selection for machine learning: Comparing a correlation-based filter approach to the wrapper. In FLAIRS. 235\u2013239."},{"key":"e_1_3_1_58_2","volume-title":"Neural Networks and Learning Machines (3rd ed.)","author":"Haykin Simon S.","year":"2009","unstructured":"Simon S. Haykin. 2009. Neural Networks and Learning Machines (3rd ed.). Pearson Education, Upper Saddle River, NJ."},{"key":"e_1_3_1_59_2","first-page":"507","volume-title":"International Conference on Advances in Neural Information Processing Systems (NIPS\u201905)","author":"He Xiaofei","year":"2005","unstructured":"Xiaofei He, Deng Cai, and Partha Niyogi. 2005. Laplacian score for feature selection. In International Conference on Advances in Neural Information Processing Systems (NIPS\u201905). 507\u2013514."},{"key":"e_1_3_1_60_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.dss.2015.02.009"},{"key":"e_1_3_1_61_2","unstructured":"Martin Heusel Hubert Ramsauer Thomas Unterthiner Bernhard Nessler and Sepp Hochreiter. 2018. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. arxiv:1706.08500 [cs.LG]"},{"key":"e_1_3_1_62_2","doi-asserted-by":"crossref","unstructured":"Kaveen Hiniduma Suren Byna and Jean Luca Bez. 2024. AI Data Readiness Inspector (AIDRIN) for Quantitative Assessment of Data Readiness for AI. In Proceedings of the 36th International Conference on Scientific and Statistical Database Management (SSDBM\u201924). Association for Computing Machinery New York NY USA.","DOI":"10.1145\/3676288.3676296"},{"key":"e_1_3_1_63_2","unstructured":"Sarah Holland Ahmed Hosny Sarah Newman Joshua Joseph and Kasia Chmielinski. 2018. The dataset nutrition label: A framework to drive higher data quality standards. arxiv:arXiv:1805.03677 [cs.DB]"},{"issue":"13","key":"e_1_3_1_64_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1049\/el:20080522","article-title":"Scope of validity of PSNR in image\/video quality assessment","volume":"44","author":"Huynh-Thu Q.","year":"2008","unstructured":"Q. Huynh-Thu and M. Ghanbari. 2008. Scope of validity of PSNR in image\/video quality assessment. Electron. Lett. 44, 13 (June 192008), 1\u20132.","journal-title":"Electron. Lett."},{"key":"e_1_3_1_65_2","unstructured":"Helen Hwang. 2022. New AI readiness report reveals insights into ML lifecycle. Retrieved from https:\/\/www.datacenterknowledge.com\/machine-learning\/new-ai-readiness-report-reveals-insights-ml-lifecycle"},{"key":"e_1_3_1_66_2","unstructured":"Espire Infolabs. 2024. Outlier detection redefined: A deep dive into AI\u2019s impact: Espire blog. Retrieved from https:\/\/www.espire.com\/blog\/posts\/outlier-detection-redefined-a-deep-dive-into-ai-impact"},{"key":"e_1_3_1_67_2","unstructured":"Informatica. 2024. Data Quality Metrics & Measures\u2014All You Need to Know. https:\/\/www.informatica.com\/resources\/articles\/data-quality-metrics-and-measures.html"},{"key":"e_1_3_1_68_2","volume-title":"ITU-T Recommendation P.808: Subjective Evaluation of Speech Quality with a Crowdsourcing Approach","author":"Union International Telecommunication","year":"2018","unstructured":"International Telecommunication Union. 2018. ITU-T Recommendation P.808: Subjective Evaluation of Speech Quality with a Crowdsourcing Approach. Technical Report. International Telecommunication Union, Geneva."},{"key":"e_1_3_1_69_2","volume-title":"6th International Congress on Acoustics.","author":"Itakura F.","year":"1968","unstructured":"F. Itakura and S. Saito. 1968. Analysis synthesis telephony based on the maximum likelihood method. In 6th International Congress on Acoustics."},{"key":"e_1_3_1_70_2","volume-title":"Unimatch: A Record Linkage System: User\u2019s Manual","author":"Jaro M. A.","year":"1976","unstructured":"M. A. Jaro. 1976. Unimatch: A Record Linkage System: User\u2019s Manual. Technical Report. US Bureau of the Census, Washington, D.C."},{"key":"e_1_3_1_71_2","volume-title":"Digital Coding of Waveforms: Principles and Applications to Speech and Video","author":"Jayant N. C.","year":"1984","unstructured":"N. C. Jayant and P. Noll. 1984. Digital Coding of Waveforms: Principles and Applications to Speech and Video. Prentice Hall, NJ, USA."},{"key":"e_1_3_1_72_2","unstructured":"Matthew B. Jones and Peter Slaughter. 2019. Retrieved from https:\/\/www.dataone.org\/uploads\/dataonewebinar_jonesslaughter_fairmetadata_190514.pdf"},{"key":"e_1_3_1_73_2","doi-asserted-by":"publisher","DOI":"10.1002\/sam.11583"},{"key":"e_1_3_1_74_2","doi-asserted-by":"publisher","DOI":"10.3389\/fdata.2021.693674"},{"key":"e_1_3_1_75_2","unstructured":"M. Kaiser Mathias Klier and Bernd Heinrich. 1970. How to measure data quality? A metric-based approach. Retrieved from https:\/\/www.semanticscholar.org\/paper\/How-to-Measure-Data-Quality-A-Metric-Based-Approach-Kaiser-Klier\/afcdf53c5a88f3320c861ad3f09f28237b6744cb"},{"key":"e_1_3_1_76_2","unstructured":"Sergey Kastryulin Dzhamil Zakirov and Denis Prokopenko. 2019. PyTorch image quality: Metrics and measure for image quality assessment. Retrieved from https:\/\/github.com\/photosynthesis-team\/piq"},{"key":"e_1_3_1_77_2","unstructured":"Martin Kemka. 2019. Learning Amazon Sagemaker. Retrieved from https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/clarify-data-bias-metric-cddl.html"},{"key":"e_1_3_1_78_2","volume-title":"International Conference on Machine Learning (ICML\u201917)","author":"Koh Pang Wei","year":"2017","unstructured":"Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via influence functions. In International Conference on Machine Learning (ICML\u201917)."},{"key":"e_1_3_1_79_2","doi-asserted-by":"publisher","DOI":"10.1148\/ryai.2019190177"},{"key":"e_1_3_1_80_2","volume-title":"NEURIPS Workshop for Data Centric AI","author":"Lavitas Liliya","year":"2021","unstructured":"Liliya Lavitas, Olivia Redfield, Allen Lee, Daniel Fletcher, Matthias Eck, and Sunil Janardhanan. 2021. Annotation quality framework\u2014Accuracy, credibility, and consistency. In NEURIPS Workshop for Data Centric AI."},{"issue":"4","key":"e_1_3_1_81_2","first-page":"845","article-title":"Binary codes capable of correcting deletions, insertions and reversals","volume":"163","author":"Levenshtein V. I.","year":"1965","unstructured":"V. I. Levenshtein. 1965. Binary codes capable of correcting deletions, insertions and reversals. Doklady Akademii Nauk SSSR 163, 4 (1965), 845\u2013848.","journal-title":"Doklady Akademii Nauk SSSR"},{"key":"e_1_3_1_82_2","doi-asserted-by":"publisher","DOI":"10.3115\/1075527.1075574"},{"key":"e_1_3_1_83_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jesp.2013.03.013"},{"key":"e_1_3_1_84_2","doi-asserted-by":"publisher","DOI":"10.1145\/3136625"},{"key":"e_1_3_1_85_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE51399.2021.00009"},{"key":"e_1_3_1_86_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2005.848345"},{"key":"e_1_3_1_87_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jvcir.2011.01.005"},{"key":"e_1_3_1_88_2","first-page":"388","volume-title":"International Conference on Tools with Artificial Intelligence (ICTAI\u201995)","author":"Liu Huan","year":"1995","unstructured":"Huan Liu and Rudy Setiono. 1995. Chi2: Feature selection and discretization of numeric attributes. In International Conference on Tools with Artificial Intelligence (ICTAI\u201995). 388\u2013391."},{"key":"e_1_3_1_89_2","doi-asserted-by":"publisher","unstructured":"Sijia Liu Parikshit Ram Deepak Vijaykeerthy Djallel Bouneffouf Gregory Bramble Horst Samulowitz Dakuo Wang Andrew Conn and Alexander Gray. 2020. An ADMM-based framework for autoML pipeline configuration. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI\u201920). AAAI Press 4892\u20134899. DOI:10.1609\/aaai.v34i04.5926","DOI":"10.1609\/aaai.v34i04.5926"},{"key":"e_1_3_1_90_2","unstructured":"Luc Longpr\u00e9 Vladik Kreinovich and Thongchai Dumrongpokaphan. 2017. Entropy as a measure of average loss of privacy. Thai Journal of Mathematics 15 Special Issue (2017) 7\u201315."},{"key":"e_1_3_1_91_2","doi-asserted-by":"publisher","DOI":"10.1147\/rd.14.0309"},{"key":"e_1_3_1_92_2","unstructured":"Chris Markham. 2024. How AI can uncover data outliers and patterns in patient behavior. Retrieved from https:\/\/www.saama.com\/how-ai-can-uncover-data-outliers-and-patterns-in-patient-behavior\/"},{"key":"e_1_3_1_93_2","volume-title":"International Conference on Image Processing","author":"Marziliano Pina","year":"2002","unstructured":"Pina Marziliano, Frederic Dufaux, Stefan Winkler, and Touradj Ebrahimi. 2002. A no-reference perceptual blur metric. In International Conference on Image Processing."},{"key":"e_1_3_1_94_2","volume-title":"An assessment of the range and usefulness of lexical diversity measures and the potential of the measure of textual, lexical diversity (MTLD)","author":"McCarthy Philip M.","year":"2005","unstructured":"Philip M. McCarthy. 2005. An assessment of the range and usefulness of lexical diversity measures and the potential of the measure of textual, lexical diversity (MTLD). Ph. D. Dissertation. The University of Memphis."},{"key":"e_1_3_1_95_2","doi-asserted-by":"publisher","DOI":"10.3758\/BRM.42.2.381"},{"key":"e_1_3_1_96_2","doi-asserted-by":"publisher","unstructured":"David Mimno Hanna M. Wallach Edmund Talley Miriam Leenders and Andrew McCallum. 2011. Optimizing semantic coherence in topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP\u201911). Association for Computational Linguistics 262\u2013272. DOI:10.5555\/2145432.2145462","DOI":"10.5555\/2145432.2145462"},{"key":"e_1_3_1_97_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2012.2214050"},{"key":"e_1_3_1_98_2","first-page":"267","volume-title":"2nd International Conference on Knowledge Discovery and Data Mining (KDD\u201996)","author":"Monge A. E.","year":"1996","unstructured":"A. E. Monge and C. P. Elkan. 1996. The field matching problem: Algorithms and applications. In 2nd International Conference on Knowledge Discovery and Data Mining (KDD\u201996). 267\u2013270."},{"key":"e_1_3_1_99_2","doi-asserted-by":"publisher","unstructured":"David Newman Jey Han Lau Karl Grieser and Timothy Baldwin. 2010. Evaluating Topic Models for Digital Libraries. In Proceedings of the 10th Annual Joint Conference on Digital Libraries (JCDL\u201910). Association for Computing Machinery New York NY USA 215\u2013224. DOI:10.1145\/1816123.1816156","DOI":"10.1145\/1816123.1816156"},{"key":"e_1_3_1_100_2","volume-title":"AAAI Conference on Artificial Intelligence","author":"Nie Feiping","year":"2008","unstructured":"Feiping Nie, Shiming Xiang, Yangqing Jia, Changshui Zhang, and Shuicheng Yan. 2008. Trace ratio criterion for feature selection. In AAAI Conference on Artificial Intelligence."},{"key":"e_1_3_1_101_2","doi-asserted-by":"publisher","unstructured":"E. Ntoutsi P. Fafalios I. Gkatzia V. Tsoumakas I. Vlahavas and G. Mentzas. 2020. Bias in data-driven artificial intelligence systems \u2013 An introductory survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 10 3 (2020) e1356. DOI:10.1002\/widm.1356","DOI":"10.1002\/widm.1356"},{"key":"e_1_3_1_102_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.compbiomed.2010.12.006"},{"key":"e_1_3_1_103_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patrec.2017.08.002"},{"key":"e_1_3_1_104_2","doi-asserted-by":"publisher","unstructured":"Orestis Papakyriakopoulos Simon Hegelich Juan Carlos Medina Serrano and Fabienne Marco. 2020. Bias in word embeddings. In Proceedings of the 2020 Conference on Fairness Accountability and Transparency (FAT\u201920)*. Association for Computing Machinery New York NY USA 446\u2013457. DOI:10.1145\/3351095.3372843","DOI":"10.1145\/3351095.3372843"},{"key":"e_1_3_1_105_2","doi-asserted-by":"publisher","unstructured":"Ronald K. Pearson. 2006. The problem of disguised missing data. SIGKDD Explor. Newsl. 8 1 (June 2006) 83\u201392. 10.1145\/1147234.1147247","DOI":"10.1145\/1147234.1147247"},{"key":"e_1_3_1_106_2","doi-asserted-by":"publisher","DOI":"10.1145\/505248.506010"},{"key":"e_1_3_1_107_2","first-page":"504","volume-title":"IEEE Symposium on Computer Intelligence and Data Mining","author":"Pokrajac Dragoljub","year":"2007","unstructured":"Dragoljub Pokrajac, Aleksandar Lazarevic, and Longin Jan Latecki. 2007. Incremental local outlier detection for data streams. In IEEE Symposium on Computer Intelligence and Data Mining. 504\u2013515."},{"key":"e_1_3_1_108_2","doi-asserted-by":"publisher","unstructured":"Maria Priestley Fionnt\u00e1n O\u2019donnell and Elena Simperl. 2023. A Survey of Data Quality Requirements That Matter in ML Development Pipelines. J. Data and Information Quality 15 2 Article 11 (June 2023) 39 pages. 10.1145\/3592616","DOI":"10.1145\/3592616"},{"key":"e_1_3_1_109_2","doi-asserted-by":"publisher","unstructured":"Shahzad Qaiser and Ramsha Ali. 2018. Text mining: Use of TF-IDF to examine the relevance of words to documents. International Journal of Computer Applications 181 1 (Jul 2018) 25\u201329. DOI:10.5120\/ijca2018917395","DOI":"10.5120\/ijca2018917395"},{"key":"e_1_3_1_110_2","doi-asserted-by":"publisher","DOI":"10.1186\/s12859-023-05156-9"},{"key":"e_1_3_1_111_2","unstructured":"Juan Ramos. 2003. Using TF-IDF to determine word relevance in document queries. In Proceedings of the First Instructional Conference on Machine Learning. Citeseer 29\u201348."},{"key":"e_1_3_1_112_2","doi-asserted-by":"crossref","unstructured":"Pedro Reviriego Javier Conde Elena Merino-G\u00f3mez Gonzalo Mart\u00ednez and Jos\u00e9 Alberto Hern\u00e1ndez. 2023. Playing with words: Comparing the vocabulary and lexical richness of ChatGPT and humans. arxiv:2308.07462 [cs.CL]","DOI":"10.1016\/j.mlwa.2024.100602"},{"key":"e_1_3_1_113_2","volume-title":"22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","author":"Ribeiro Marco Tulio","year":"2016","unstructured":"Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. \u201cWhy should I trust you?\u201d Explaining the predictions of any classifier. In 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining."},{"key":"e_1_3_1_114_2","doi-asserted-by":"publisher","unstructured":"Antony W. Rix John G. Beerends Michael P. Hollier and Andries P. Hekstra. 2001. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In Proceedings of the 2001 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP\u201901). IEEE Vol. 2. 749\u2013752. DOI:10.1109\/ICASSP.2001.941023","DOI":"10.1109\/ICASSP.2001.941023"},{"key":"e_1_3_1_115_2","doi-asserted-by":"publisher","unstructured":"Marko Robnik-Sikonja and Igor Kononenko. 2003. Theoretical and empirical analysis of ReliefF and RReliefF. Machine Learning 53 1\/2 (October 2003) 23\u201369. DOI:10.1023\/A:1025667309714","DOI":"10.1023\/A:1025667309714"},{"key":"e_1_3_1_116_2","doi-asserted-by":"publisher","unstructured":"Philippe Rocca-Serra Wei Gu Vassilios Ioannidis Tooba Abbassi-Daloii Salvador Capella-Gutierrez Ishwar Chandramouliswaran Andrea Splendiani Tony Burdett Robert T. Giessmann David Henderson et\u00a0al. 2023. The FAIR Cookbook\u2014The essential resource for and by FAIR doers. Scientific Data 10 (2023). DOI:10.1038\/s41597-023-02166-3","DOI":"10.1038\/s41597-023-02166-3"},{"key":"e_1_3_1_117_2","doi-asserted-by":"publisher","DOI":"10.1145\/2684822.2685324"},{"key":"e_1_3_1_118_2","doi-asserted-by":"publisher","DOI":"10.1080\/00401706.1983.10487848"},{"issue":"2","key":"e_1_3_1_119_2","doi-asserted-by":"crossref","first-page":"e1236","DOI":"10.1002\/widm.1236","article-title":"Anomaly detection by robust statistics","volume":"8","author":"Rousseeuw Peter J.","year":"2018","unstructured":"Peter J. Rousseeuw and Mia Hubert. 2018. Anomaly detection by robust statistics. WIREs Data Min. Knowl. Discov. 8, 2 (Mar.2018), e1236.","journal-title":"WIREs Data Min. Knowl. Discov."},{"key":"e_1_3_1_120_2","unstructured":"R. C. Russell. 1922. Index. Retrieved from http:\/\/patft.uspto.gov\/netahtml\/srchnum.htm"},{"key":"e_1_3_1_121_2","doi-asserted-by":"publisher","DOI":"10.1148\/ryai.2019190015"},{"key":"e_1_3_1_122_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patrec.2020.05.032"},{"key":"e_1_3_1_123_2","doi-asserted-by":"publisher","DOI":"10.14778\/3229863.3229867"},{"key":"e_1_3_1_124_2","unstructured":"Ron Schmelzer. 2019. The Achilles\u2019 heel of AI. Retrieved from https:\/\/www.forbes.com\/sites\/cognitiveworld\/2019\/03\/07\/the-achilles-heel-of-ai\/?sh=20e53e4d7be7"},{"key":"e_1_3_1_125_2","doi-asserted-by":"publisher","unstructured":"Nima Shahbazi Yin Lin Abolfazl Asudeh and H. V. Jagadish. 2023. Representation Bias in data: A survey on identification and resolution techniques. ACM Comput. Surv. 55 13s Article 293 (December 2023) 39 pages. 10.1145\/3588433","DOI":"10.1145\/3588433"},{"key":"e_1_3_1_126_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2005.859378"},{"key":"e_1_3_1_127_2","doi-asserted-by":"publisher","DOI":"10.1109\/BigData50022.2020.9378296"},{"key":"e_1_3_1_128_2","doi-asserted-by":"publisher","unstructured":"Fatimah Sidi Noraini Ibrahim and Aziz Deraman. 2012. Data Quality: A Survey of Data Quality Dimensions. In Proceedings of the 2012 International Conference on Information Retrieval & Knowledge Management (InfrKM). IEEE Kuala Lumpur Malaysia 300\u2013304. DOI:10.1109\/InfRKM.2012.6204995","DOI":"10.1109\/InfRKM.2012.6204995"},{"key":"e_1_3_1_129_2","unstructured":"Simha. 2021. Understanding TF-IDF for machine learning. Retrieved from https:\/\/www.capitalone.com\/tech\/machine-learning\/understanding-tf-idf\/"},{"key":"e_1_3_1_130_2","unstructured":"Alessandro Simonetta Andrea Trenta Maria Cristina Paoletti and Antonio Vetr\u00f2. 2021. Metrics for Identifying Bias in Datasets. In Proceedings of the International Conference of Yearly Reports on Informatics Mathematics and Engineering (ICYRIME 2021) July 9 2021 Online 10-17. CEUR Workshop Proceedings Vol. 3118."},{"key":"e_1_3_1_131_2","doi-asserted-by":"publisher","DOI":"10.1038\/163688a0"},{"key":"e_1_3_1_132_2","first-page":"2615","volume-title":"30th USENIX Security Symposium (USENIX Security\u201921)","author":"Song Liwei","year":"2021","unstructured":"Liwei Song and Prateek Mittal. 2021. Systematic evaluation of privacy risks of machine learning models. In 30th USENIX Security Symposium (USENIX Security\u201921). USENIX Association, 2615\u20132632. Retrieved from https:\/\/www.usenix.org\/conference\/usenixsecurity21\/presentation\/song"},{"key":"e_1_3_1_133_2","doi-asserted-by":"publisher","DOI":"10.1108\/eb026526"},{"key":"e_1_3_1_134_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2010.5495701"},{"key":"e_1_3_1_135_2","doi-asserted-by":"publisher","DOI":"10.5749\/j.ctttv2st"},{"key":"e_1_3_1_136_2","doi-asserted-by":"publisher","DOI":"10.1109\/TECHPOS.2009.5412098"},{"key":"e_1_3_1_137_2","doi-asserted-by":"publisher","unstructured":"Dinusha Vatsalan Khin Than Win and Blanca Rodr\u00edguez-Briones. 2022. Privacy risk quantification in education data using Markov model. British Journal of Educational Technology 53 4 (2022) 804\u2013821. DOI:10.1111\/bjet.13223","DOI":"10.1111\/bjet.13223"},{"key":"e_1_3_1_138_2","unstructured":"Tuan L. Vo Thu Nguyen Hugo L. Hammer Michael A. Riegler and Pal Halvorsen. 2024. Explainability of machine learning models under missing data. arxiv:2407.00411 [cs.LG]"},{"key":"e_1_3_1_139_2","doi-asserted-by":"publisher","DOI":"10.1145\/3168389"},{"key":"e_1_3_1_140_2","unstructured":"Jiachen T. Wang and Ruoxi Jia. 2023. Data Banzhaf: A robust data valuation framework for machine learning. arxiv:2205.15466 [cs.LG]"},{"key":"e_1_3_1_141_2","doi-asserted-by":"publisher","DOI":"10.1109\/97.995823"},{"key":"e_1_3_1_142_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2003.819861"},{"key":"e_1_3_1_143_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACSSC.2003.1292216"},{"issue":"4","key":"e_1_3_1_144_2","doi-asserted-by":"crossref","first-page":"367","DOI":"10.1016\/0001-8708(76)90202-4","article-title":"Some biological sequence metrics","volume":"20","author":"Waterman M. S.","year":"1976","unstructured":"M. S. Waterman, T. F. Smith, and W. A. Beyer. 1976. Some biological sequence metrics. Advan. Math. 20, 4 (1976), 367\u2013387.","journal-title":"Advan. Math."},{"key":"e_1_3_1_145_2","doi-asserted-by":"crossref","unstructured":"Mark D. Wilkinson Susanna-Assunta Sansone Erik Schultes Peter Doorn Luiz Olavo Bonino da Silva Santos and Michel Dumontier. 2018. A design framework and exemplar metrics for fairness. Retrieved from https:\/\/www.nature.com\/articles\/sdata2018118","DOI":"10.1101\/225490"},{"key":"e_1_3_1_146_2","unstructured":"Alex Woodie. 2020. Data prep still dominates data scientists\u2019 time survey finds. Retrieved from https:\/\/www.datanami.com\/2020\/07\/06\/data-prep-still-dominates-data-scientists-time-survey-finds\/"},{"key":"e_1_3_1_147_2","doi-asserted-by":"publisher","DOI":"10.1109\/tip.2013.2293423"},{"key":"e_1_3_1_148_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICISAT54145.2021.9678209"},{"key":"e_1_3_1_149_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2011.2109730"},{"key":"e_1_3_1_150_2","doi-asserted-by":"crossref","first-page":"1151","DOI":"10.1145\/1273496.1273641","volume-title":"International Conference on Machine Learning (ICML\u201907)","author":"Zhao Zheng","year":"2007","unstructured":"Zheng Zhao and Huan Liu. 2007. Spectral feature selection for supervised and unsupervised learning. In International Conference on Machine Learning (ICML\u201907). 1151\u20131157."},{"key":"e_1_3_1_151_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patrec.2018.09.012"}],"container-title":["ACM Computing Surveys"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3722214","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3722214","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T18:43:51Z","timestamp":1750272231000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3722214"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,4,4]]},"references-count":150,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2025,9,30]]}},"alternative-id":["10.1145\/3722214"],"URL":"https:\/\/doi.org\/10.1145\/3722214","relation":{},"ISSN":["0360-0300","1557-7341"],"issn-type":[{"value":"0360-0300","type":"print"},{"value":"1557-7341","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,4,4]]},"assertion":[{"value":"2023-12-13","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-02-24","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-04-04","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}