{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,14]],"date-time":"2026-04-14T00:19:09Z","timestamp":1776125949822,"version":"3.50.1"},"publisher-location":"Cham","reference-count":31,"publisher":"Springer Nature Switzerland","isbn-type":[{"value":"9783032061355","type":"print"},{"value":"9783032061362","type":"electronic"}],"license":[{"start":{"date-parts":[[2025,9,27]],"date-time":"2025-09-27T00:00:00Z","timestamp":1758931200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,9,27]],"date-time":"2025-09-27T00:00:00Z","timestamp":1758931200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2026]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Large, open datasets can accelerate ecological research, particularly by enabling researchers to develop new insights by reusing datasets from multiple sources. However, to find the most suitable datasets to combine and integrate, researchers must navigate diverse ecological and environmental data provider platforms with varying metadata availability and standards. To overcome this obstacle, we have developed a large language model (LLM)-based metadata harvester that flexibly extracts metadata from any dataset\u2019s landing page, and converts these to a user-defined, unified format using existing metadata standards. We validate that our tool is able to extract both structured and unstructured metadata with equal accuracy, aided by our LLM post-processing protocol. Furthermore, we utilise LLMs to identify links between datasets, both by calculating embedding similarity and by unifying the formats of extracted metadata to enable rule-based processing. Our tool, which flexibly links the metadata of different datasets, can therefore be used for ontology creation or graph-based queries, for example, to find relevant ecological and environmental datasets in a virtual research environment.<\/jats:p>","DOI":"10.1007\/978-3-032-06136-2_32","type":"book-chapter","created":{"date-parts":[[2025,9,26]],"date-time":"2025-09-26T18:38:19Z","timestamp":1758911899000},"page":"338-352","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Flexible Metadata Harvesting for\u00a0Ecology Using Large Language Models"],"prefix":"10.1007","author":[{"ORCID":"https:\/\/orcid.org\/0009-0001-6518-2744","authenticated-orcid":false,"given":"Zehao","family":"Lu","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5490-1785","authenticated-orcid":false,"given":"Thijs L.","family":"van der Plas","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0898-0942","authenticated-orcid":false,"given":"Parinaz","family":"Rashidi","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7274-6755","authenticated-orcid":false,"given":"W Daniel","family":"Kissling","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2764-0078","authenticated-orcid":false,"given":"Ioannis N.","family":"Athanasiadis","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2025,9,27]]},"reference":[{"key":"32_CR1","first-page":"82133","volume":"37","author":"M Akhtar","year":"2024","unstructured":"Akhtar, M., et al.: Croissant: a metadata format for ml-ready datasets. Adv. Neural. Inf. Process. Syst. 37, 82133\u201382148 (2024)","journal-title":"Adv. Neural. Inf. Process. Syst."},{"key":"32_CR2","doi-asserted-by":"crossref","unstructured":"Caufield, J.H., et\u00a0al.: Structured prompt interrogation and recursive extraction of semantics (spires): a method for populating knowledge bases using zero-shot learning. Bioinformatics 40(3), btae104 (2024)","DOI":"10.1093\/bioinformatics\/btae104"},{"issue":"3","key":"32_CR3","doi-asserted-by":"publisher","first-page":"420","DOI":"10.1038\/s41559-017-0458-2","volume":"2","author":"A Culina","year":"2018","unstructured":"Culina, A., Baglioni, M., Crowther, T.W., Visser, M.E., Woutersen-Windhouwer, S., Manghi, P.: Navigating the unfolding open data landscape in ecology and evolution. Nat. Ecol. Evol. 2(3), 420\u2013426 (2018)","journal-title":"Nat. Ecol. Evol."},{"issue":"8","key":"32_CR4","doi-asserted-by":"publisher","first-page":"563","DOI":"10.1093\/biosci\/biy068","volume":"68","author":"SS Farley","year":"2018","unstructured":"Farley, S.S., Dawson, A., Goring, S.J., Williams, J.W.: Situating ecology as a big-data science: current advances, challenges, and solutions. Bioscience 68(8), 563\u2013576 (2018)","journal-title":"Bioscience"},{"key":"32_CR5","unstructured":"Google: Gemini 2.5 flash preview model card (2025). https:\/\/storage.googleapis.com\/model-cards\/documents\/gemini-2.5-flash-preview.pdf. Accessed 12 Jun 2025"},{"key":"32_CR6","doi-asserted-by":"crossref","unstructured":"Gregory, K., Groth, P., Scharnhorst, A., Wyatt, S.: Lost or found? discovering data needed for research. Harvard Data Sci. Rev. 2(2) (2020)","DOI":"10.1162\/99608f92.e38165eb"},{"key":"32_CR7","unstructured":"Guo, Z., Xia, L., Yu, Y., Ao, T., Huang, C.: LightRAG: simple and fast retrieval-augmented generation. arXiv preprint arXiv:2410.05779 (2025)"},{"issue":"3","key":"32_CR8","doi-asserted-by":"publisher","first-page":"156","DOI":"10.1890\/120103","volume":"11","author":"SE Hampton","year":"2013","unstructured":"Hampton, S.E., et al.: Big data and the future of ecology. Front. Ecol. Environ. 11(3), 156\u2013162 (2013)","journal-title":"Front. Ecol. Environ."},{"key":"32_CR9","doi-asserted-by":"crossref","unstructured":"Hayashi, T., Sakaji, H., Dai, J., Goebel, R.: Metadata-based data exploration with retrieval-augmented generation for large language models. In: International Conference on Big Data, pp. 6574\u20136583. IEEE (2024)","DOI":"10.1109\/BigData62323.2024.10826055"},{"issue":"1","key":"32_CR10","doi-asserted-by":"publisher","first-page":"407","DOI":"10.1186\/s12859-019-3002-3","volume":"20","author":"RC Jackson","year":"2019","unstructured":"Jackson, R.C., Balhoff, J.P., Douglass, E., Harris, N.L., Mungall, C.J., Overton, J.A.: Robot: a tool for automating ontology workflows. BMC Bioinform. 20(1), 407 (2019)","journal-title":"BMC Bioinform."},{"key":"32_CR11","doi-asserted-by":"crossref","unstructured":"Johnston, A., et al.: North American bird declines are greatest where species are most abundant. Science 388(6746), 532\u2013537 (2025)","DOI":"10.1126\/science.adn4381"},{"issue":"8062","key":"32_CR12","doi-asserted-by":"publisher","first-page":"395","DOI":"10.1038\/s41586-025-08752-2","volume":"641","author":"F Keck","year":"2025","unstructured":"Keck, F., Peller, T., Alther, R., Barouillet, C., Blackman, R., Capo, E., et al.: The global human impact on biodiversity. Nature 641(8062), 395\u2013400 (2025)","journal-title":"Nature"},{"issue":"10","key":"32_CR13","doi-asserted-by":"publisher","first-page":"1531","DOI":"10.1038\/s41559-018-0667-3","volume":"2","author":"WD Kissling","year":"2018","unstructured":"Kissling, W.D., Walls, R., Bowser, A., Jones, M.O., Kattge, J., Agosti, D., et al.: Towards global data products of essential biodiversity variables on species traits. Nat. Ecol. Evol. 2(10), 1531\u20131540 (2018)","journal-title":"Nat. Ecol. Evol."},{"issue":"6694","key":"32_CR14","doi-asserted-by":"publisher","first-page":"453","DOI":"10.1126\/science.adj6598","volume":"384","author":"PF Langhammer","year":"2024","unstructured":"Langhammer, P.F., et al.: The positive impact of conservation action. Science 384(6694), 453\u2013458 (2024)","journal-title":"Science"},{"key":"32_CR15","unstructured":"Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74\u201381 (2004)"},{"issue":"3","key":"32_CR16","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0246099","volume":"16","author":"F L\u00f6ffler","year":"2021","unstructured":"L\u00f6ffler, F., Wesp, V., K\u00f6nig-Ries, B., Klan, F.: Dataset search in biodiversity research: do metadata in data repositories reflect scholarly information needs? PLoS ONE 16(3), e0246099 (2021)","journal-title":"PLoS ONE"},{"key":"32_CR17","doi-asserted-by":"crossref","unstructured":"MacKenzie, D.I., Nichols, J.D., Royle, J.A., Pollock, K.H., Bailey, L.L., Hines, J.E.: Chapter 3 - fundamental principals of statistical inference. In: Occupancy Estimation and Modeling, 2nd edn., pp. 71\u2013111. Academic Press, Boston (2018)","DOI":"10.1016\/B978-0-12-407197-1.00004-1"},{"key":"32_CR18","doi-asserted-by":"crossref","unstructured":"Nathan, R., et\u00a0al.: Big-data approaches lead to an increased understanding of the ecology of animal movement. Science 375(6582), eabg1780 (2022)","DOI":"10.1126\/science.abg1780"},{"key":"32_CR19","doi-asserted-by":"crossref","unstructured":"Noy, N., Burgess, M., Brickley, D.: Google dataset search: Building a search engine for datasets in an open web ecosystem. In: Proceedings of the 2019 World Wide Web Conference, pp. 1365\u20131375 (2019)","DOI":"10.1145\/3308558.3313685"},{"key":"32_CR20","unstructured":"OpenAI: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)"},{"issue":"1","key":"32_CR21","doi-asserted-by":"publisher","first-page":"457","DOI":"10.1038\/s41597-023-02364-z","volume":"10","author":"EA Papoutsoglou","year":"2023","unstructured":"Papoutsoglou, E.A., Athanasiadis, I.N., Visser, R.G.F., Finkers, R.: The benefits and struggles of fair data: the case of reusing plant phenotyping data. Sci. Data 10(1), 457 (2023)","journal-title":"Sci. Data"},{"key":"32_CR22","doi-asserted-by":"crossref","unstructured":"Van der Plas, T.L., Alexander, D.G., Pocock, M.J.: Monitoring protected areas by integrating machine learning, remote sensing and citizen science. Ecol. Solut. Eviden. 6(2), e70040 (2025)","DOI":"10.1002\/2688-8319.70040"},{"key":"32_CR23","doi-asserted-by":"crossref","unstructured":"Rafiq, K., Beery, S., Palmer, M.S., Harchaoui, Z., Abrahms, B.: Generative AI as a tool to accelerate the field of ecology. Nature Ecol. Evol. 9, 378\u2013385 (2025)","DOI":"10.1038\/s41559-024-02623-1"},{"key":"32_CR24","doi-asserted-by":"crossref","unstructured":"Reimers, N., Gurevych, I.: Sentence-Bert: sentence embeddings using SIAMESE Bert-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2019)","DOI":"10.18653\/v1\/D19-1410"},{"key":"32_CR25","doi-asserted-by":"crossref","unstructured":"Reynolds, S.A., et al.: The potential for AI to revolutionize conservation: a horizon scan. Trends Ecol. Evol. 40(2), 191\u2013207 (2025)","DOI":"10.1016\/j.tree.2024.11.013"},{"issue":"1","key":"32_CR26","doi-asserted-by":"publisher","first-page":"2003","DOI":"10.1038\/s41467-020-15870-0","volume":"11","author":"RK Runting","year":"2020","unstructured":"Runting, R.K., Phinn, S., Xie, Z., Venter, O., Watson, J.E.: Opportunities for big data in conservation and sustainability. Nat. Commun. 11(1), 2003 (2020)","journal-title":"Nat. Commun."},{"key":"32_CR27","unstructured":"Sundaram, S.S., Musen, M.A.: Making metadata more fair using large language models. arXiv preprint arXiv:2307.13085 (2023)"},{"key":"32_CR28","unstructured":"Wang, D.Y.B., Shen, Z., Mishra, S.S., Xu, Z., Teng, Y., Ding, H.: Slot: structuring the output of large language models. arXiv preprint arXiv:2505.04016 (2025)"},{"key":"32_CR29","doi-asserted-by":"crossref","unstructured":"Watanabe, Y., Ito, K., Matsubara, S.: Capabilities and challenges of LLMs in metadata extraction from scholarly papers. In: International Conference on Asian Digital Libraries, pp. 280\u2013287. Springer (2025)","DOI":"10.1007\/978-981-96-0865-2_23"},{"issue":"1","key":"32_CR30","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/sdata.2016.18","volume":"3","author":"MD Wilkinson","year":"2016","unstructured":"Wilkinson, M.D., et al.: The fair guiding principles for scientific data management and stewardship. Sci. Data 3(1), 1\u20139 (2016)","journal-title":"Sci. Data"},{"key":"32_CR31","unstructured":"Zhang, S., Wu, M., Zhang, X.: Utilising a large language model to annotate subject metadata: a case study in an Australian national research data catalogue. arXiv preprint arXiv:2310.11318 (2023)"}],"container-title":["Communications in Computer and Information Science","New Trends in Theory and Practice of Digital Libraries"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/978-3-032-06136-2_32","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,26]],"date-time":"2025-09-26T18:38:23Z","timestamp":1758911903000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/978-3-032-06136-2_32"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,27]]},"ISBN":["9783032061355","9783032061362"],"references-count":31,"URL":"https:\/\/doi.org\/10.1007\/978-3-032-06136-2_32","relation":{},"ISSN":["1865-0929","1865-0937"],"issn-type":[{"value":"1865-0929","type":"print"},{"value":"1865-0937","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,9,27]]},"assertion":[{"value":"27 September 2025","order":1,"name":"first_online","label":"First Online","group":{"name":"ChapterHistory","label":"Chapter History"}},{"value":"Conceptualization: all authors. Data curation: PR and TvdP. Formal analysis: ZL and TvdP. Funding acquisition: IA and WDK. Methodology: ZL and TvdP. Software: ZL and TvdP. Supervision: IA and WDK. Validation: ZL and TvdP. Visualization: TvdP. Writing \u2013 original draft: TvdP. Writing \u2013 review & editing: all authors.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Author Contributions."}},{"value":"The authors have no competing interests to declare that are relevant to the content of this article.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Disclosure of Interests"}},{"value":"TPDL","order":1,"name":"conference_acronym","label":"Conference Acronym","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"International Conference on Theory and Practice of Digital Libraries","order":2,"name":"conference_name","label":"Conference Name","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Tampere","order":3,"name":"conference_city","label":"Conference City","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"Finland","order":4,"name":"conference_country","label":"Conference Country","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"2025","order":5,"name":"conference_year","label":"Conference Year","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"23 September 2025","order":7,"name":"conference_start_date","label":"Conference Start Date","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"26 September 2025","order":8,"name":"conference_end_date","label":"Conference End Date","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"29","order":9,"name":"conference_number","label":"Conference Number","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"tpdl2025","order":10,"name":"conference_id","label":"Conference ID","group":{"name":"ConferenceInfo","label":"Conference Information"}},{"value":"https:\/\/tpdl2025.github.io\/","order":11,"name":"conference_url","label":"Conference URL","group":{"name":"ConferenceInfo","label":"Conference Information"}}]}}