{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,27]],"date-time":"2026-03-27T08:38:40Z","timestamp":1774600720051,"version":"3.50.1"},"reference-count":69,"publisher":"China Science Publishing & Media Ltd.","issue":"2","license":[{"start":{"date-parts":[[2024,7,11]],"date-time":"2024-07-11T00:00:00Z","timestamp":1720656000000},"content-version":"vor","delay-in-days":192,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2024,5,1]]},"abstract":"<jats:title>ABSTRACT<\/jats:title>\n               <jats:p>The rapid evolution of Large Language Models (LLMs) highlights the necessity for ethical considerations and data integrity in AI development, particularly emphasizing the role of FAIR (Findable, Accessible, Interoperable, Reusable) data principles. While these principles are crucial for ethical data stewardship, their specific application in the context of LLM training data remains an under-explored area. This research gap is the focus of our study, which begins with an examination of existing literature to underline the importance of FAIR principles in managing data for LLM training. Building upon this, we propose a novel frame-work designed to integrate FAIR principles into the LLM development lifecycle. A contribution of our work is the development of a comprehensive checklist intended to guide researchers and developers in applying FAIR data principles consistently across the model development process. The utility and effectiveness of our frame-work are validated through a case study on creating a FAIR-compliant dataset aimed at detecting and mitigating biases in LLMs. We present this framework to the community as a tool to foster the creation of technologically advanced, ethically grounded, and socially responsible AI models.<\/jats:p>","DOI":"10.1162\/dint_a_00255","type":"journal-article","created":{"date-parts":[[2024,7,11]],"date-time":"2024-07-11T20:01:04Z","timestamp":1720728064000},"page":"559-585","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":18,"title":["FAIR Enough: Develop and Assess a FAIR-Compliant Dataset for Large Language Model Training?"],"prefix":"10.3724","volume":"6","author":[{"given":"Shaina","family":"Raza","sequence":"first","affiliation":[{"name":"Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Shardul","family":"Ghuge","sequence":"additional","affiliation":[{"name":"Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Chen","family":"Ding","sequence":"additional","affiliation":[{"name":"Toronto Metropolitan University, Toronto, Ontario, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Elham","family":"Dolatabadi","sequence":"additional","affiliation":[{"name":"Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada"},{"name":"York University, Toronto, Ontario, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Deval","family":"Pandya","sequence":"additional","affiliation":[{"name":"Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"2026","published-online":{"date-parts":[[2024,5,1]]},"reference":[{"key":"2024071120004867800_ref1","doi-asserted-by":"crossref","first-page":"423","DOI":"10.1162\/tacl_a_00324","article-title":"How can we know what language models know?","volume":"8","author":"Jiang","year":"2020","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2024071120004867800_ref2","volume-title":"A survey of large language models","author":"Zhao","year":"2023"},{"key":"2024071120004867800_ref3","volume-title":"Large Language Model (LLM) Trends","author":"TrendFeedr","year":"2024"},{"key":"2024071120004867800_ref4","doi-asserted-by":"crossref","first-page":"610-623","DOI":"10.1145\/3442188.3445922","article-title":"On the dangers of stochastic parrots: Can language models be too big?","volume-title":"Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency","author":"Bender","year":"2021"},{"key":"2024071120004867800_ref5","volume-title":"Aligning large language models with human: A survey","author":"Wang","year":"2023"},{"key":"2024071120004867800_ref6","volume-title":"A survey on evaluation of large language models","author":"Chang","year":"2023"},{"issue":"2","key":"2024071120004867800_ref7","doi-asserted-by":"crossref","first-page":"177","DOI":"10.2218\/ijdc.v12i2.567","article-title":"Are the fair data principles fair?","volume":"12","author":"Dunning","year":"1970","journal-title":"International Journal of digital curation"},{"issue":"7","key":"2024071120004867800_ref8","doi-asserted-by":"crossref","first-page":"931","DOI":"10.1038\/s41431-018-0160-0","article-title":"The fair guiding principles for data stewardship: fair enough?","volume":"26","author":"Boeckhout","year":"2018","journal-title":"European Journal of Human Genetics"},{"issue":"4","key":"2024071120004867800_ref9","doi-asserted-by":"crossref","first-page":"933","DOI":"10.1016\/j.drudis.2019.01.008","article-title":"Implementation and relevance of FAIR data principles in biopharmaceutical r& d","volume":"24","author":"Wise","year":"2019","journal-title":"Drug Discovery Today"},{"key":"2024071120004867800_ref10","doi-asserted-by":"crossref","first-page":"83","DOI":"10.1016\/j.procs.2022.10.179","article-title":"Implementing fair work flows along the research lifecycle","volume":"211","author":"Chen","year":"2022","journal-title":"Procedia Computer Science"},{"key":"2024071120004867800_ref11","doi-asserted-by":"crossref","first-page":"227","DOI":"10.1145\/3514094.3534187","article-title":"Responsible ai systems: Who are the stakeholders?","volume-title":"Proceedings of the 2022 AAAI\/ACM Conference on AI","author":"Deshpande","year":"2022"},{"key":"2024071120004867800_ref12","volume-title":"Home","author":"Ethics","year":"2024"},{"key":"2024071120004867800_ref13","doi-asserted-by":"crossref","first-page":"112965","DOI":"10.1016\/j.marpolbul.2021.112965","article-title":"Data quality and fair principles applied to marine litter data in europe","volume":"168","author":"Partescano","year":"2021","journal-title":"Marine Pollution Bulletin"},{"issue":"1","key":"2024071120004867800_ref14","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1038\/sdata.2016.18","article-title":"The fair guiding principles for scientific data management and stewardship","volume":"3","author":"Wilkinson","year":"2016","journal-title":"Scientific Data"},{"key":"2024071120004867800_ref15","first-page":"469","article-title":"Assessing fair data principles against the 5-star open data principles","volume-title":"The Semantic Web: ESWC 2018 Satellite Events: ESWC 2018 Satellite Events, Heraklion, Crete, Greece, June 3-7, 2018, Revised Selected Papers 15","author":"Hasnain","year":"2018"},{"issue":"1-2","key":"2024071120004867800_ref16","doi-asserted-by":"crossref","first-page":"10","DOI":"10.1162\/dint_r_00024","article-title":"FAIR principles: Interpretations and implementation considerations","volume":"2","author":"Jacobsen","year":"2020","journal-title":"Data Intelligence"},{"key":"2024071120004867800_ref17","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/2021.naacl-main.295","article-title":"Beyond fair pay: Ethical implications of nlp crowdsourcing","volume-title":"North American Chapter of the Association for Computational Linguistics","author":"Shmueli","year":"2021"},{"issue":"1","key":"2024071120004867800_ref18","doi-asserted-by":"crossref","first-page":"7913","DOI":"10.1038\/s41467-023-43713-1","article-title":"Augmenting interpretable models with large language models during training","volume":"14","author":"Singh","year":"2023","journal-title":"Nature Communications"},{"key":"2024071120004867800_ref19","volume-title":"AI and the everything in the whole wide world benchmark","author":"Raji","year":"2021"},{"issue":"9","key":"2024071120004867800_ref20","doi-asserted-by":"crossref","first-page":"389","DOI":"10.1038\/s42256-019-0088-2","article-title":"The global landscape of ai ethics guidelines","volume":"1","author":"Jobin","year":"2019","journal-title":"Nature Machine Intelligence"},{"issue":"164-167","key":"2024071120004867800_ref21","doi-asserted-by":"crossref","DOI":"10.3233\/SHTI230452","article-title":"Desiderata for the data governance and FAIR principles adoption in health data hubs","volume":"305","author":"Alvarez-Romero","year":"2023","journal-title":"Study in Health Technology and Informatics."},{"issue":"2","key":"2024071120004867800_ref22","doi-asserted-by":"crossref","first-page":"22505","DOI":"10.2196\/22505","article-title":"Initiatives, concepts, and implementation practices of FAIR (findable, accessible, interoperable, and reusable) data principles in health data stewardship practice: protocol for a scoping review","volume":"10","author":"Inau","year":"2021","journal-title":"JMIR Research Protocols"},{"key":"2024071120004867800_ref23","first-page":"14","article-title":"Opportunities for improving data sharing and FAIR data practices to advance global mental health","volume":"10","author":"Sadeh","year":"2023","journal-title":"Cambridge Prisms: Global Mental Health"},{"key":"2024071120004867800_ref24","article-title":"Data management plan for healthcare: Following FAIR principles and addressing cybersecurity aspects. a systematic review using instructgpt","volume":"2023-04","author":"Stanciu","year":"2023","journal-title":"medRxiv"},{"key":"2024071120004867800_ref25","doi-asserted-by":"crossref","DOI":"10.3389\/fpubh.2023.1214766","article-title":"Challenges in mapping european rare disease databases, relevant for ml-based screening technologies in terms of organizational, fair and legal principles: scoping review","volume":"11","author":"Raycheva","year":"2023","journal-title":"Frontiers in Public Health"},{"issue":"3","key":"2024071120004867800_ref26","doi-asserted-by":"crossref","first-page":"936","DOI":"10.1093\/bib\/bbz044","article-title":"Implementing the FAIR Data Principles in precision oncology: review of supporting initiatives","volume":"21","author":"Vesteghem","year":"2019","journal-title":"Briefings in Bioinformatics"},{"key":"2024071120004867800_ref27","volume-title":"Fair principles for data and ai models in high energy physics research and education","author":"Dungkek","year":"2022"},{"key":"2024071120004867800_ref28","doi-asserted-by":"crossref","first-page":"45013","DOI":"10.2196\/45013","article-title":"Initiatives, concepts, and implementation practices of the findable, accessible, interoperable, and reusable data principles in health data stewardship: Scoping review","volume":"25","author":"Inau","year":"2023","journal-title":"Journal of Medical Internet Research"},{"key":"2024071120004867800_ref29","doi-asserted-by":"crossref","DOI":"10.5772\/intechopen.110248","article-title":"FAIR data model for chemical substances: Development challenges, management strategies, and applications","volume-title":"Data Integrity and Data Governance","author":"Jeliazkova","year":"2023"},{"key":"2024071120004867800_ref30","doi-asserted-by":"crossref","first-page":"160018","DOI":"10.1038\/sdata.2016.18","article-title":"The FAIR guiding principles for scientific data management and stewardship","volume":"3","author":"Axton","year":"2016","journal-title":"Scientific Data"},{"issue":"1","key":"2024071120004867800_ref31","first-page":"1","article-title":"Supporting FAIR data principles with fedora","volume":"28","author":"Wilcox","year":"2018","journal-title":"LIBER Quarterly: The Journal of the Association of European Research Libraries"},{"issue":"1","key":"2024071120004867800_ref32","doi-asserted-by":"crossref","DOI":"10.1038\/s41597-023-02298-6","article-title":"FAIR for AI: An interdisciplinary and international community building perspective","volume":"10","author":"Huerta","year":"2023","journal-title":"Scientific Data"},{"key":"2024071120004867800_ref33","doi-asserted-by":"crossref","DOI":"10.21203\/rs.3.rs-3092538\/v1","article-title":"A goal-oriented method for fairification planning","volume-title":"CEUR Workshop Proceedings","author":"Bernab\u00e9","year":"2023"},{"key":"2024071120004867800_ref34","volume-title":"Ai fairness: from principles to practice","author":"Bateni","year":"2022"},{"key":"2024071120004867800_ref35","first-page":"192","article-title":"An ecosystem approach to ethical ai and data use: experimental reflections","volume-title":"2020 IEEE\/ITU International Conference on Artificial Intelligence for Good (AI4G)","author":"Findlay","year":"2020"},{"key":"2024071120004867800_ref36","first-page":"11894","volume-title":"Towards a conceptual model for the fair digital object framework","author":"Santos","year":"2023"},{"key":"2024071120004867800_ref37","doi-asserted-by":"crossref","DOI":"10.56367\/OAG-039-10749","article-title":"The fair principles: Trusting in fair data repositories","volume-title":"Open Access Government","author":"G\u00f6tz","year":"2023"},{"key":"2024071120004867800_ref38","doi-asserted-by":"crossref","DOI":"10.5206\/EXFO3999","article-title":"The fair principles and research data management","volume-title":"Research Data Management in the Canadian Context","author":"Wang","year":"2023"},{"issue":"1","key":"2024071120004867800_ref39","doi-asserted-by":"crossref","first-page":"37","DOI":"10.3233\/DS-190026","article-title":"Towards fair principles for research software","volume":"3","author":"Lamprecht","year":"2020","journal-title":"Data Science"},{"issue":"1-2","key":"2024071120004867800_ref40","doi-asserted-by":"crossref","first-page":"238","DOI":"10.1162\/dint_a_00046","article-title":"Go FAIR Brazil: a challenge for brazilian data science","volume":"2","author":"Sales","year":"2020","journal-title":"Data Intelligence"},{"key":"2024071120004867800_ref41","first-page":"270","article-title":"FAIR data points supporting big data interoper-ability","volume-title":"Enterprise Interoperability in the Digitized and Networked Factory of the Future","author":"Silva Santos","year":"2016"},{"key":"2024071120004867800_ref42","doi-asserted-by":"crossref","first-page":"105","DOI":"10.1093\/database\/bax105","article-title":"Fair principles and the iedb: short-term improvements and a long-term vision of obo-foundry mediated machine-actionable interoperability","volume":"2018","author":"Vita","year":"2018","journal-title":"Database"},{"key":"2024071120004867800_ref43","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s13326-017-0169-2","article-title":"The extensible ontology development (xod) principles and tool implementation to support ontology interoperability","volume":"9","author":"He","year":"2018","journal-title":"Journal of biomedical semantics"},{"issue":"1","key":"2024071120004867800_ref44","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1038\/sdata.2018.118","article-title":"A design framework and exemplar metrics for fairness","volume":"5","author":"Wilkinson","year":"2018","journal-title":"Scientific data"},{"key":"2024071120004867800_ref45","first-page":"23","article-title":"Ready, set, go fair: Accelerating convergence to an internet of fair data and services","volume":"19","author":"Schultes","year":"2018","journal-title":"DAMDID\/RCDL"},{"key":"2024071120004867800_ref46","doi-asserted-by":"crossref","first-page":"161","DOI":"10.1145\/2372251.2372280","article-title":"A study of reusability, complexity, and reuse design principles","volume-title":"Proceedings of the ACM-IEEE International Symposium on Empirical Software Engineering and Measurement","author":"Anguswamy","year":"2012"},{"key":"2024071120004867800_ref47","doi-asserted-by":"crossref","first-page":"444","DOI":"10.1109\/Cluster48925.2021.00053","article-title":"Reusability first: Toward fair work flows","volume-title":"2021 IEEE International Conference on Cluster Computing (CLUSTER)","author":"Wolf","year":"2021"},{"issue":"1","key":"2024071120004867800_ref48","doi-asserted-by":"crossref","first-page":"8591","DOI":"10.1038\/s41598-023-35482-0","article-title":"Constructing a disease database and using natural language processing to capture and standardize free text clinical information","volume":"13","author":"Raza","year":"2023","journal-title":"Scientific Reports"},{"key":"2024071120004867800_ref49","article-title":"Human-in-the-Loop Machine Learning: Active Learning and Annotation for Human-centered AI","volume-title":"Simon and Schuster","author":"Monarch","year":"2021"},{"key":"2024071120004867800_ref50","volume-title":"The rise and potential of large language model based agents: A survey","author":"Xi","year":"2023"},{"issue":"12","key":"2024071120004867800_ref51","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3571730","article-title":"Survey of hallucination in natural language generation","volume":"55","author":"Ji","year":"2023","journal-title":"ACM Computing Surveys"},{"key":"2024071120004867800_ref52","article-title":"Dbias: detecting biases and ensuring fairness in news articles","volume":"1-21","author":"Raza","year":"2022","journal-title":"International Journal of Data Science and Analytics"},{"key":"2024071120004867800_ref53","doi-asserted-by":"crossref","DOI":"10.1609\/aaaiss.v1i1.27493","volume-title":"Fairness in machine learning meets with equity in healthcare","author":"Raza","year":"2023"},{"key":"2024071120004867800_ref54","doi-asserted-by":"crossref","first-page":"720","DOI":"10.1145\/3583780.3614949","article-title":"Large language models as zero-shot conversational recommenders","volume-title":"Proceedings of the 32nd ACM International Conference on Information and Knowledge Management","author":"He","year":"2023"},{"issue":"10","key":"2024071120004867800_ref55","doi-asserted-by":"crossref","first-page":"28","DOI":"10.1080\/15265161.2023.2233356","article-title":"Autogen: A personalized large language model for academic enhancement\u2014ethics and proof of principle","volume":"23","author":"Porsdam Mann","year":"2023","journal-title":"The American Journal of Bioethics"},{"issue":"11","key":"2024071120004867800_ref56","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3567592","article-title":"Neural machine translation for low-resource languages: A survey","volume":"55","author":"Ranathunga","year":"2023","journal-title":"ACM Computing Surveys"},{"key":"2024071120004867800_ref57","first-page":"1405","article-title":"Towards efficient post-training quantization of pre-trained language models","volume":"35","author":"Bai","year":"2022","journal-title":"Advances in Neural Information Processing Systems"},{"issue":"3","key":"2024071120004867800_ref58","first-page":"1356","article-title":"Bias in data-driven artificial intelligence systems\u2014an introductory survey","volume":"10","author":"Ntoutsi","year":"2020","journal-title":"Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery"},{"key":"2024071120004867800_ref59","doi-asserted-by":"crossref","first-page":"121542","DOI":"10.1016\/j.eswa.2023.121542","article-title":"Nbias: A natural language processing framework for bias identification in text","volume":"237","author":"Raza","year":"2024","journal-title":"Expert Systems with Applications"},{"key":"2024071120004867800_ref60","first-page":"2021","article-title":"Stereoset: Measuring stereotypical bias in pretrained language models","volume-title":"ACL-IJCNLPth Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing","author":"Nadeem","year":"2021"},{"key":"2024071120004867800_ref61","volume-title":"Redditbias: A real-world resource for bias evaluation and debiasing of conversational language models","author":"Barikeri","year":"2021"},{"issue":"4","key":"2024071120004867800_ref62","doi-asserted-by":"crossref","first-page":"335","DOI":"10.1007\/s41060-021-00302-z","article-title":"Fake news detection based on news content and social contexts: a transformer-based approach","volume":"13","author":"Raza","year":"2022","journal-title":"International Journal of Data Science and Analytics"},{"key":"2024071120004867800_ref63","first-page":"622","article-title":"On measuring social biases in sentence encoders","volume":"2019","author":"May","year":"2019","journal-title":"NAACL HLT"},{"issue":"30","key":"2024071120004867800_ref64","doi-asserted-by":"crossref","first-page":"2305016120","DOI":"10.1073\/pnas.2305016120","article-title":"Chatgpt outperforms crowd workers for text-annotation tasks","volume":"120","author":"Gilardi","year":"2023","journal-title":"Proceedings of the National Academy of Sciences"},{"key":"2024071120004867800_ref65","volume-title":"Llama 2: Open foundation and fine-tuned chat models","author":"Touvron","year":"2023"},{"key":"2024071120004867800_ref66","volume-title":"Creative Commons Attribution-NonCommercial 4. 0 International License","author":"Creative Commons","year":"2023"},{"key":"2024071120004867800_ref67","volume-title":"Explainability for large language models: A survey","author":"Zhao","year":"2023"},{"key":"2024071120004867800_ref68","doi-asserted-by":"crossref","first-page":"366","DOI":"10.1145\/3627106.3627196","article-title":"Can large language models provide security & privacy advice?measuring the ability of llms to refute misconceptions","volume-title":"Proceedings of the 39th Annual Computer Security Applications Conference","author":"Chen","year":"2023"},{"key":"2024071120004867800_ref69","doi-asserted-by":"crossref","first-page":"1377","DOI":"10.1162\/tacl_a_00608","article-title":"How abstract is linguistic generalization in large language models? experiments with argument structure","volume":"11","author":"Wilson","year":"2023","journal-title":"Transactions of the Association for Computational Linguistics"}],"container-title":["Data Intelligence"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/dint\/article-pdf\/6\/2\/559\/2458950\/dint_a_00255.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/dint\/article-pdf\/6\/2\/559\/2458950\/dint_a_00255.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,4]],"date-time":"2025-03-04T23:07:28Z","timestamp":1741129648000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/dint\/article\/6\/2\/559\/123375\/FAIR-Enough-Develop-and-Assess-a-FAIR-Compliant"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024]]},"references-count":69,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2024,5,1]]}},"URL":"https:\/\/doi.org\/10.1162\/dint_a_00255","relation":{},"ISSN":["2641-435X"],"issn-type":[{"value":"2641-435X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2024]]},"published":{"date-parts":[[2024]]}}}