{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,27]],"date-time":"2025-11-27T05:24:22Z","timestamp":1764221062986,"version":"3.46.0"},"reference-count":38,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2025,11,25]],"date-time":"2025-11-25T00:00:00Z","timestamp":1764028800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100012190","name":"Ministry of Science and Higher Education of the Russian Federation","doi-asserted-by":"publisher","award":["075\u201015\u20102024\u2010534"],"award-info":[{"award-number":["075\u201015\u20102024\u2010534"]}],"id":[{"id":"10.13039\/501100012190","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["MAKE"],"abstract":"<jats:p>The rapid growth of the body of literature on heavy metal hyperaccumulation in plants has created a critical bottleneck in data synthesis. Manual curation is slow, labor-intensive, and not scalable. To address this issue, we developed an artificial intelligence pipeline that automatically transforms unstructured scientific papers, including text, tables, and figures, into a structured knowledge database. Our system recovers numerical data and extracts key experimental parameters, such as plant species, metal types, concentrations, and growing conditions. This enables on-demand dataset generation. We validated our pipeline by replicating a recently published, manually curated dataset that required seven months of expert effort. Our tool achieved comparable accuracy in minutes per article. We implemented a dual-validation strategy combining standard extraction metrics with a qualitative \u201cLLM-as-a-Judge\u201d fact-checking layer to assess contextual correctness. This revealed that high extraction performance does not guarantee factual reliability, underscoring the necessity of semantic validation in scientific knowledge extraction. The resulting open, reproducible framework accelerates evidence synthesis, supports trend analysis (e.g., metal\u2013plant co-occurrence networks), and provides a scalable solution for data-driven environmental research.<\/jats:p>","DOI":"10.3390\/make7040152","type":"journal-article","created":{"date-parts":[[2025,11,25]],"date-time":"2025-11-25T15:29:47Z","timestamp":1764084587000},"page":"152","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["LLM-Based Pipeline for Structured Knowledge Extraction from Scientific Literature on Heavy Metal Hyperaccumulation"],"prefix":"10.3390","volume":"7","author":[{"given":"Kiril","family":"Makrinsky","sequence":"first","affiliation":[{"name":"Frumkin Institute of Physical Chemistry and Electrochemistry, Russian Academy of Sciences, 31\/4 Leninskiy pr., 119071 Moscow, Russia"}]},{"given":"Valery","family":"Shendrikov","sequence":"additional","affiliation":[{"name":"Frumkin Institute of Physical Chemistry and Electrochemistry, Russian Academy of Sciences, 31\/4 Leninskiy pr., 119071 Moscow, Russia"}]},{"given":"Anna","family":"Makhonko","sequence":"additional","affiliation":[{"name":"Frumkin Institute of Physical Chemistry and Electrochemistry, Russian Academy of Sciences, 31\/4 Leninskiy pr., 119071 Moscow, Russia"}]},{"given":"Dmitry","family":"Merkushkin","sequence":"additional","affiliation":[{"name":"Frumkin Institute of Physical Chemistry and Electrochemistry, Russian Academy of Sciences, 31\/4 Leninskiy pr., 119071 Moscow, Russia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9581-2233","authenticated-orcid":false,"given":"Oleg V.","family":"Batishchev","sequence":"additional","affiliation":[{"name":"Frumkin Institute of Physical Chemistry and Electrochemistry, Russian Academy of Sciences, 31\/4 Leninskiy pr., 119071 Moscow, Russia"}]}],"member":"1968","published-online":{"date-parts":[[2025,11,25]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Sabir, M., Baltr\u0117nait\u0117-Gedien\u0117, E., Ditta, A., Ullah, H., Kanwal, A., Ullah, S., and Faraj, T.K. (2022). Bioaccumulation of Heavy Metals in a Soil\u2013Plant System from an Open Dumpsite and the Associated Health Risks through Multiple Routes. Sustainability, 14.","DOI":"10.3390\/su142013223"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"131039","DOI":"10.1016\/j.jhazmat.2023.131039","article-title":"Comprehensive Mechanisms of Heavy Metal Toxicity in Plants, Detoxification, and Remediation","volume":"450","author":"Ghuge","year":"2023","journal-title":"J. Hazard. Mater."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Khan, I.U., Qi, S.-S., Gul, F., Manan, S., Rono, J.K., Naz, M., Shi, X.-N., Zhang, H., Dai, Z.-C., and Du, D.-L. (2023). A Green Approach Used for Heavy Metals \u2018Phytoremediation\u2019 Via Invasive Plant Species to Mitigate Environmental Pollution: A Review. Plants, 12.","DOI":"10.3390\/plants12040725"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"134788","DOI":"10.1016\/j.chemosphere.2022.134788","article-title":"Phytoremediation of Heavy Metals in Soil and Water: An Eco-Friendly, Sustainable and Multidisciplinary Approach","volume":"303","author":"Bhat","year":"2022","journal-title":"Chemosphere"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"119035","DOI":"10.1016\/j.envpol.2022.119035","article-title":"A Review on Bioremediation Approach for Heavy Metal Detoxification and Accumulation in Plants","volume":"301","author":"Yaashikaa","year":"2022","journal-title":"Environ. Pollut."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"e28357","DOI":"10.1016\/j.heliyon.2024.e28357","article-title":"Sources, Effects and Present Perspectives of Heavy Metals Contamination: Soil, Plants and Human Food Chain","volume":"10","author":"Angon","year":"2024","journal-title":"Heliyon"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"02010","DOI":"10.1051\/e3sconf\/202346302010","article-title":"Features of Phytoextraction of Rare Earth Elements by a Complex of Plants and Microorganisms from Technogenically Polluted Wastewater of Mining Enterprises","volume":"463","author":"Tokhtar","year":"2023","journal-title":"E3S Web Conf."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"173169","DOI":"10.1016\/j.scitotenv.2024.173169","article-title":"Enhancing Remediation Efficiency of Hyperaccumulators through Earthworm Addition: Evidence from a Pot Study on Cadmium-Contaminated Soil","volume":"934","author":"Zhang","year":"2024","journal-title":"Sci. Total Environ."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"119971","DOI":"10.1016\/j.jenvman.2023.119971","article-title":"Farmland Phytoremediation in Bibliometric Analysis","volume":"351","author":"Wang","year":"2024","journal-title":"J. Environ. Manag."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Beltagy, I., Lo, K., and Cohan, A. (2019). SciBERT: A Pretrained Language Model for Scientific Text. arXiv.","DOI":"10.18653\/v1\/D19-1371"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"194","DOI":"10.1038\/s41524-025-01674-7","article-title":"Agent-Based Multimodal Information Extraction for Nanomaterials","volume":"11","author":"Odobesku","year":"2025","journal-title":"NPJ Comput. Mater."},{"key":"ref_12","unstructured":"Peng, R., Liu, K., Yang, P., Yuan, Z., and Li, S. (2023). Embedding-Based Retrieval with LLM for Effective Agriculture Information Extracting from Unstructured Data. arXiv."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Patiny, L., and Godin, G. (2023). Automatic Extraction of FAIR Data from Publications Using LLM. ChemRxiv.","DOI":"10.26434\/chemrxiv-2023-05v1b-v2"},{"key":"ref_14","unstructured":"Biswas, A., and Talukdar, W. (2024). Robustness of Structured Data Extraction from In-Plane Rotated Documents Using Multi-Modal Large Language Models (LLM). arXiv."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"e70073","DOI":"10.1002\/cl2.70073","article-title":"SciDaSynth: Interactive Structured Data Extraction from Scientific Literature with Large Language Model","volume":"21","author":"Wang","year":"2025","journal-title":"Campbell Syst. Rev."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Dagli, M.M., Ghenbot, Y., Ahmad, H.S., Chauhan, D., Turlip, R., Wang, P., Welch, W.C., Ozturk, A.K., and Yoon, J.W. (2024). Development and Validation of a Novel AI Framework Using NLP with LLM Integration for Relevant Clinical Data Extraction through Automated Chart Review. Sci. Rep., 14.","DOI":"10.1038\/s41598-024-77535-y"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"625","DOI":"10.1038\/s41586-024-07421-0","article-title":"Detecting Hallucinations in Large Language Models Using Semantic Entropy","volume":"630","author":"Farquhar","year":"2024","journal-title":"Nature"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Tian, S., Jin, Q., Yeganova, L., Lai, P.-T., Zhu, Q., Chen, X., Yang, Y., Chen, Q., Kim, W., and Comeau, D.C. (2023). Opportunities and Challenges for ChatGPT and Large Language Models in Biomedicine and Health. Brief. Bioinform., 25.","DOI":"10.1093\/bib\/bbad493"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"100207","DOI":"10.1016\/j.gloepi.2025.100207","article-title":"An AI Assistant for Critically Assessing and Synthesizing Clusters of Journal Articles","volume":"10","author":"Cox","year":"2025","journal-title":"Glob. Epidemiol."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Li, H., Chen, X., Xu, Z., Li, D., Hu, N., Teng, F., Li, Y., Qiu, L., Zhang, C.J., and Qing, L. (August, January 27). Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language Models. Proceedings of the Findings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria.","DOI":"10.18653\/v1\/2025.findings-acl.1026"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"1418","DOI":"10.1038\/s41467-024-45563-x","article-title":"Structured Information Extraction from Scientific Text with Large Language Models","volume":"15","author":"Dagdelen","year":"2024","journal-title":"Nat. Commun."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"905","DOI":"10.1038\/s41597-025-05239-7","article-title":"Remediating Toxic Elements with Sunflower, Hemp, Castor Bean, & Bamboo: An Open Dataset of Harmonized Variables","volume":"12","author":"Ha","year":"2025","journal-title":"Sci. Data"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"129904","DOI":"10.1016\/j.jhazmat.2022.129904","article-title":"Modeling Phytoremediation of Heavy Metal Contaminated Soils through Machine Learning","volume":"441","author":"Shi","year":"2023","journal-title":"J. Hazard. Mater."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"100065","DOI":"10.1016\/j.soilad.2025.100065","article-title":"Large Language Models and the Future of Soil Health: Bridging Knowledge Gaps through Scalable Semantic Intelligence","volume":"4","author":"Wu","year":"2025","journal-title":"Soil Adv."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"e70011","DOI":"10.1002\/aps3.70011","article-title":"Using Large Language Models to Extract Plant Functional Traits from Unstructured Text","volume":"13","author":"Domazetoski","year":"2025","journal-title":"Appl. Plant Sci."},{"key":"ref_26","unstructured":"(2025, October 02). Lmstudio-Community\/Qwen3-4B-Instruct-2507-GGUF\u00b7Hugging Face. Available online: https:\/\/huggingface.co\/lmstudio-community\/Qwen3-4B-Instruct-2507-GGUF."},{"key":"ref_27","unstructured":"(2025, October 02). Lmstudio-Community\/Gpt-Oss-120b-GGUF\u00b7Hugging Face. Available online: https:\/\/huggingface.co\/lmstudio-community\/gpt-oss-120b-GGUF."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Foppiano, L., Lambard, G., Amagasa, T., and Ishii, M. (2024). Mining Experimental Data from Materials Science Literature with Large Language Models: An Evaluation Study. arXiv.","DOI":"10.1080\/27660400.2024.2356506"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Hu, Y., Keloth, V.K., Raja, K., Chen, Y., and Xu, H. (2023). Towards Precise PICO Extraction from Abstracts of Randomized Controlled Trials Using a Section-Specific Learning Approach. Bioinformatics, 39.","DOI":"10.1093\/bioinformatics\/btad542"},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"baaa011","DOI":"10.1093\/database\/baaa011","article-title":"MMHub, a Database for the Mulberry Metabolome","volume":"2020","author":"Li","year":"2020","journal-title":"Database"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Kropf, S., Uciteli, A., Schierle, K., Kr\u00fccken, P., Denecke, K., and Herre, H. (2018). Querying Archetype-Based EHRs by Search Ontology-Based XPath Engineering. J. Biomed. Semant., 9.","DOI":"10.1186\/s13326-018-0180-2"},{"key":"ref_32","first-page":"46595","article-title":"Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena","volume":"36","author":"Zheng","year":"2023","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_33","unstructured":"Li, H., Dong, Q., Chen, J., Su, H., Zhou, Y., Ai, Q., Ye, Z., and Liu, Y. (2024). LLMs-as-Judges: A Comprehensive Survey on LLM-Based Evaluation Methods. arXiv."},{"key":"ref_34","unstructured":"Rahmani, H.A., Yilmaz, E., Craswell, N., Mitra, B., Thomas, P., Clarke, C.L.A., Aliannejadi, M., Siro, C., and Faggioli, G. (2024). LLMJudge: LLMs for Relevance Judgments. arXiv."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Saad-Falcon, J., Khattab, O., Potts, C., and Zaharia, M. (2024, January 16\u201321). ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Mexico City, Mexico.","DOI":"10.18653\/v1\/2024.naacl-long.20"},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"122","DOI":"10.1186\/1752-153X-6-122","article-title":"Nutritive Quality of Romanian Hemp Varieties (Cannabis sativa L.) with Special Focus on Oil and Metal Contents of Seeds","volume":"6","author":"Mihoc","year":"2012","journal-title":"Chem. Cent. J."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"1770","DOI":"10.1080\/01904167.2021.1881553","article-title":"Nitrogen Fertilizer Ameliorate the Remedial Capacity of Industrial Hemp (Cannabis sativa L.) Grown in Lead Contaminated Soil","volume":"44","author":"Deng","year":"2021","journal-title":"J. Plant Nutr."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Zielonka, D., Szulc, W., Skowro\u0144ska, M., Rutkowska, B., and Russel, S. (2020). Hemp-Based Phytoaccumulation of Heavy Metals from Municipal Sewage Sludge and Phosphogypsum Under Field Conditions. Agronomy, 10.","DOI":"10.3390\/agronomy10060907"}],"container-title":["Machine Learning and Knowledge Extraction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-4990\/7\/4\/152\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,27]],"date-time":"2025-11-27T05:16:59Z","timestamp":1764220619000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-4990\/7\/4\/152"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,25]]},"references-count":38,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["make7040152"],"URL":"https:\/\/doi.org\/10.3390\/make7040152","relation":{},"ISSN":["2504-4990"],"issn-type":[{"type":"electronic","value":"2504-4990"}],"subject":[],"published":{"date-parts":[[2025,11,25]]}}}