{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,24]],"date-time":"2026-04-24T18:31:05Z","timestamp":1777055465310,"version":"3.51.4"},"reference-count":56,"publisher":"IOP Publishing","issue":"2","license":[{"start":{"date-parts":[[2025,4,10]],"date-time":"2025-04-10T00:00:00Z","timestamp":1744243200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"},{"start":{"date-parts":[[2025,4,10]],"date-time":"2025-04-10T00:00:00Z","timestamp":1744243200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/iopscience.iop.org\/info\/page\/text-and-data-mining"}],"funder":[{"DOI":"10.13039\/501100001703","name":"\u00c9cole Polytechnique F\u00e9d\u00e9rale de Lausanne","doi-asserted-by":"crossref","award":["large-scale Solutions4Sustainability demonstrator"],"award-info":[{"award-number":["large-scale Solutions4Sustainability demonstrator"]}],"id":[{"id":"10.13039\/501100001703","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100023650","name":"NCCR Catalysis","doi-asserted-by":"crossref","award":["225147"],"award-info":[{"award-number":["225147"]}],"id":[{"id":"10.13039\/501100023650","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["iopscience.iop.org"],"crossmark-restriction":false},"short-container-title":["Mach. Learn.: Sci. Technol."],"published-print":{"date-parts":[[2025,6,30]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Retrieval-Augmented Generation (RAG) is a widely used strategy in Large-Language Models (LLMs) to extrapolate beyond the inherent pre-trained knowledge. Hence, RAG is crucial when working in data-sparse fields such as Chemistry. The evaluation of RAG systems is commonly conducted using specialized datasets. However, existing datasets, typically in the form of scientific Question-Answer-Context (QAC) triplets or QA pairs, are often limited in size due to the labor-intensive nature of manual curation or require further quality assessment when generated through automated processes. This highlights a critical need for large, high-quality datasets tailored to scientific applications. We introduce ChemLit-QA, a comprehensive, expert-validated, open-source dataset comprising over 1,000 entries specifically designed for chemistry. Our approach involves the initial generation and filtering of a QAC dataset using an automated framework based on GPT-4 Turbo, followed by rigorous evaluation by chemistry experts. Additionally, we provide two supplementary datasets: ChemLit-QA-neg focused on negative data, and ChemLit-QA-multi focused on multihop reasoning tasks for LLMs, which complement the main dataset on hallucination detection and more reasoning-intensive tasks.<\/jats:p>","DOI":"10.1088\/2632-2153\/adc2d6","type":"journal-article","created":{"date-parts":[[2025,3,19]],"date-time":"2025-03-19T22:58:51Z","timestamp":1742425131000},"page":"020601","update-policy":"https:\/\/doi.org\/10.1088\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["ChemLit-QA: a human evaluated dataset for chemistry RAG tasks"],"prefix":"10.1088","volume":"6","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3772-6927","authenticated-orcid":true,"given":"Geemi P","family":"Wellawatte","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2289-8945","authenticated-orcid":true,"given":"Huixuan","family":"Guo","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-0665-1839","authenticated-orcid":true,"given":"Magdalena","family":"Lederbauer","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Anna","family":"Borisova","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Matthew","family":"Hart","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Marta","family":"Brucka","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3046-6576","authenticated-orcid":false,"given":"Philippe","family":"Schwaller","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"266","published-online":{"date-parts":[[2025,4,10]]},"reference":[{"key":"mlstadc2d6bib1","doi-asserted-by":"publisher","first-page":"13","DOI":"10.1038\/s43246-024-00449-9","article-title":"Accelerating materials language processing with large language models","volume":"5","author":"Choi","year":"2024","journal-title":"Commun. Mater."},{"key":"mlstadc2d6bib2","doi-asserted-by":"publisher","first-page":"570","DOI":"10.1038\/s41586-023-06792-0","article-title":"Autonomous chemical research with large language models","volume":"624","author":"Boiko","year":"2023","journal-title":"Nature"},{"key":"mlstadc2d6bib3","doi-asserted-by":"publisher","first-page":"161","DOI":"10.1038\/s42256-023-00788-1","article-title":"Leveraging large language models for predictive chemistry","volume":"6","author":"Maik Jablonka","year":"2024","journal-title":"Nat. Mach. Intell."},{"key":"mlstadc2d6bib4","doi-asserted-by":"publisher","first-page":"1123","DOI":"10.1126\/science.ade2574","article-title":"Evolutionary-scale prediction of atomic-level protein structure with a language model","volume":"379","author":"Lin","year":"2023","journal-title":"Science"},{"key":"mlstadc2d6bib5","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s42256-024-00832-8","article-title":"Augmenting large language models with chemistry tools","volume":"6","author":"Bran","year":"2024","journal-title":"Nat. Mach. Intell."},{"key":"mlstadc2d6bib6","article-title":"Are llms ready for real-world materials discovery?","author":"Miret","year":"2024"},{"key":"mlstadc2d6bib7","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3571730","article-title":"Survey of hallucination in natural language generation","volume":"55","author":"Ji","year":"2023","journal-title":"ACM Comput. Surv."},{"key":"mlstadc2d6bib8","doi-asserted-by":"publisher","first-page":"625","DOI":"10.1038\/s41586-024-07421-0","article-title":"Detecting hallucinations in large language models using semantic entropy","volume":"630","author":"Farquhar","year":"2024","journal-title":"Nature"},{"key":"mlstadc2d6bib9","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/2020.acl-main.173","article-title":"On faithfulness and factuality in abstractive summarization","author":"Maynez","year":"2020"},{"key":"mlstadc2d6bib10","doi-asserted-by":"publisher","first-page":"120","DOI":"10.1038\/s41746-023-00873-0","article-title":"The imperative for regulatory oversight of large language models (or generative AI) in healthcare","volume":"6","author":"Mesk\u00f3","year":"2023","journal-title":"NPJ Dig. Med."},{"key":"mlstadc2d6bib11","article-title":"Retrieval-augmented generation for large language models: a survey","author":"Gao","year":"2023"},{"key":"mlstadc2d6bib12","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/2021.findings-emnlp.320","article-title":"Retrieval augmentation reduces hallucination in conversation","author":"Shuster","year":"2021"},{"key":"mlstadc2d6bib13","doi-asserted-by":"publisher","DOI":"10.1056\/AIoa2300068","article-title":"Almanac-retrieval-augmented language models for clinical medicine","volume":"1","author":"Zakka","year":"2024","journal-title":"NEJM AI"},{"key":"mlstadc2d6bib14","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/P18-1031","article-title":"universal language model fine-tuning for text classification","author":"Howard","year":"2018"},{"key":"mlstadc2d6bib15","doi-asserted-by":"publisher","first-page":"10600","DOI":"10.1039\/D4SC00924J","article-title":"Fine-tuning large language models for chemical text mining","volume":"15","author":"Zhang","year":"2024","journal-title":"Chem. Sci."},{"key":"mlstadc2d6bib16","article-title":"Fantastically ordered prompts and where to find them: overcoming few-shot prompt order sensitivity","author":"Lu","year":"2021"},{"key":"mlstadc2d6bib17","article-title":"Language models are few-shot learners","author":"Brown","year":"2020"},{"key":"mlstadc2d6bib18","first-page":"pp 11763","article-title":"Lift: language-interfaced fine-tuning for non-language machine learning tasks","volume":"vol 35","author":"Dinh","year":"2022"},{"key":"mlstadc2d6bib19","article-title":"Bayesian optimization of catalysts with in-context learning","author":"Caldas Ramos","year":"2023"},{"key":"mlstadc2d6bib20","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/D16-1264","article-title":"Squad: 100,000+ questions for machine comprehension of text","author":"Rajpurkar","year":"2016"},{"key":"mlstadc2d6bib21","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/D18-1259","article-title":"Hotpotqa: a dataset for diverse, explainable multi-hop question answering","author":"Yang","year":"2018"},{"key":"mlstadc2d6bib22","article-title":"Paperqa: retrieval-augmented generative agent for scientific research","author":"L\u00e1la","year":"2023"},{"key":"mlstadc2d6bib23","article-title":"Multihop-rag: benchmarking retrieval-augmented generation for multi-hop queries","author":"Tang","year":"2024"},{"key":"mlstadc2d6bib24","first-page":"pp 17754","article-title":"Benchmarking large language models in retrieval-augmented generation","volume":"vol 38","author":"Chen","year":"2023"},{"key":"mlstadc2d6bib25","article-title":"Generative ai for synthetic data generation: methods, challenges and the future","author":"Guo","year":"2024"},{"key":"mlstadc2d6bib26","article-title":"Lost in the middle: how language models use long contexts","author":"Liu","year":"2023"},{"key":"mlstadc2d6bib27","doi-asserted-by":"publisher","first-page":"546","DOI":"10.1162\/tacl_a_00563","article-title":"Understanding and detecting hallucinations in neural machine translation via model introspection","volume":"11","author":"Xu","year":"2023","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"mlstadc2d6bib28","first-page":"pp 6565","article-title":"Towards understanding and mitigating social biases in language models","author":"Pu Liang","year":"2021"},{"key":"mlstadc2d6bib29","first-page":"pp 12","article-title":"Gender bias and stereotypes in large language models","author":"Kotek","year":"2023"},{"key":"mlstadc2d6bib30","article-title":"Lab-bench: measuring capabilities of language models for biology research","author":"Laurent","year":"2024"},{"key":"mlstadc2d6bib31","doi-asserted-by":"publisher","first-page":"289","DOI":"10.1007\/s00799-022-00329-y","article-title":"Scienceqa: a novel resource for question answering on scholarly articles","volume":"23","author":"Saikh","year":"2022","journal-title":"Int. J. Dig. Libraries"},{"key":"mlstadc2d6bib32","article-title":"Are large language models superhuman chemists?","author":"Mirza","year":"2024"},{"key":"mlstadc2d6bib33","article-title":"Chain-of-thought prompting elicits reasoning in large language models","author":"Wei","year":"2023"},{"key":"mlstadc2d6bib34","author":"Meta Meta-llama-3","year":"2024"},{"key":"mlstadc2d6bib35","author":"Open AI Hello gpt -4o","year":"2024"},{"key":"mlstadc2d6bib36","article-title":"Sciqag: a framework for auto-generated scientific question answering dataset with fine-grained evaluation","author":"Wan","year":"2024"},{"key":"mlstadc2d6bib37","article-title":"The fineweb datasets: decanting the web for the finest text data at scale","author":"Penedo","year":"2024"},{"key":"mlstadc2d6bib38","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/W17-4413","article-title":"Crowdsourcing multiple choice science questions","author":"Welbl","year":"2017"},{"key":"mlstadc2d6bib39","first-page":"pp 2567","article-title":"Pubmedqa: A dataset for biomedical research question answering","author":"Jin","year":"2019"},{"key":"mlstadc2d6bib40","article-title":"Mistral 7b","author":"Jiang","year":"2023"},{"key":"mlstadc2d6bib41","article-title":"Gpt-4 turbo in the openai api","author":"Open AI","year":"2024"},{"key":"mlstadc2d6bib42","article-title":"Models","author":"Open AI","year":"2023"},{"key":"mlstadc2d6bib43","first-page":"3","article-title":"Claude","author":"Anthropic","year":"2023"},{"key":"mlstadc2d6bib44","article-title":"Llama 2: open foundation and fine-tuned chat models","author":"Touvron","year":"2023"},{"key":"mlstadc2d6bib45","article-title":"Phi-2: the surprising power of small language models","author":"Abdin","year":"2023"},{"key":"mlstadc2d6bib46","first-page":"hi-3","article-title":"Microsoft","author":"","year":"2024"},{"key":"mlstadc2d6bib47","article-title":"Google deepmind gemini","author":"","year":"2024"},{"key":"mlstadc2d6bib48","article-title":"Gemma: open models based on gemini research and technology","author":"Gemma Team","year":"2024"},{"key":"mlstadc2d6bib49","article-title":"Gpt-4o mini: advancing cost-efficient intelligence","author":"Open AI","year":"2024"},{"key":"mlstadc2d6bib50","article-title":"Qwen2.5-vl technical report","author":"Bai","year":"2025"},{"key":"mlstadc2d6bib51","article-title":"Deepseek-r1: incentivizing reasoning capability in llms via reinforcement learning","author":"DeepSeek","year":"2025"},{"key":"mlstadc2d6bib52","first-page":"pp 9459","article-title":"Retrieval-augmented generation for knowledge-intensive NLP tasks","volume":"vol 33","author":"Lewis","year":"2020"},{"key":"mlstadc2d6bib53","doi-asserted-by":"crossref","DOI":"10.1109\/MIPR62202.2024.00031","article-title":"Blended rag: improving rag (retriever-augmented generation) accuracy with semantic search and hybrid query-based retrievers","author":"Sawarkar","year":"2024"},{"key":"mlstadc2d6bib54","article-title":"Enhancing large language model performance to answer questions and extract information more accurately","author":"Zhang","year":"2024"},{"key":"mlstadc2d6bib55","doi-asserted-by":"publisher","first-page":"100","DOI":"10.1186\/s40537-024-00963-0","article-title":"Tc-llama 2: fine-tuning LLM for technology and commercialization applications","volume":"11","author":"Yeom","year":"2024","journal-title":"J. Big Data"},{"key":"mlstadc2d6bib56","article-title":"Cyberseceval 2: a wide-ranging cybersecurity evaluation suite for large language models","author":"Bhatt","year":"2024"}],"container-title":["Machine Learning: Science and Technology"],"original-title":[],"link":[{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/adc2d6","content-type":"text\/html","content-version":"am","intended-application":"text-mining"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/adc2d6\/pdf","content-type":"application\/pdf","content-version":"am","intended-application":"text-mining"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/adc2d6","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/adc2d6\/pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/adc2d6\/pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/adc2d6\/pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/adc2d6\/pdf","content-type":"application\/pdf","content-version":"am","intended-application":"similarity-checking"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/adc2d6\/pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,4,25]],"date-time":"2025-04-25T11:39:01Z","timestamp":1745581141000},"score":1,"resource":{"primary":{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/adc2d6"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,4,10]]},"references-count":56,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2025,4,10]]},"published-print":{"date-parts":[[2025,6,30]]}},"URL":"https:\/\/doi.org\/10.1088\/2632-2153\/adc2d6","relation":{},"ISSN":["2632-2153"],"issn-type":[{"value":"2632-2153","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,4,10]]},"assertion":[{"value":"ChemLit-QA: a human evaluated dataset for chemistry RAG tasks","name":"article_title","label":"Article Title"},{"value":"Machine Learning: Science and Technology","name":"journal_title","label":"Journal Title"},{"value":"paper","name":"article_type","label":"Article Type"},{"value":"\u00a9 2025 The Author(s). Published by IOP Publishing Ltd","name":"copyright_information","label":"Copyright Information"},{"value":"2025-01-13","name":"date_received","label":"Date Received","group":{"name":"publication_dates","label":"Publication dates"}},{"value":"2025-03-19","name":"date_accepted","label":"Date Accepted","group":{"name":"publication_dates","label":"Publication dates"}},{"value":"2025-04-10","name":"date_epub","label":"Online publication date","group":{"name":"publication_dates","label":"Publication dates"}}]}}