{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,10]],"date-time":"2026-01-10T03:48:21Z","timestamp":1768016901324,"version":"3.49.0"},"reference-count":0,"publisher":"IOS Press","isbn-type":[{"value":"9781643686080","type":"electronic"}],"license":[{"start":{"date-parts":[[2025,8,7]],"date-time":"2025-08-07T00:00:00Z","timestamp":1754524800000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2025,8,7]]},"abstract":"<jats:p>The objective is to create an automated knowledge extraction tool for cancer research that builds high-quality academic corpora for LLM fine-tuning while investigating its effectiveness in interleukin-6 and bladder cancer domains. To address the current gap in knowledge retrieval techniques for cancer research data collection, we propose KnowledgePipeline, a novel automated tool that incorporates diverse aspects of academic papers and metadata. Our tool integrates content, co-citations, and co-authorship networks to construct domain-specific academic corpora suitable for fine-tuning LLMs. We leverage two LLMs (GPTJ-6.7B and Galactica30B) trained on domain-specific question-answer pairs from the refined data. The system\u2019s evaluation focuses on both the quality of extracted knowledge and the performance of fine-tuned models in open-ended question-answering tasks. We see that KnowledgePipeline offers a scalable, automated framework for domain-specific knowledge retrieval and fine-tuned applications in cancer research, advancing literature discovery and addressing critical biomedical challenges. It achieved high relevance scores of 68% for IL-6 and 74.5% for bladder cancer, with a fine-tuned Galactica-30B model demonstrating promising capabilities.<\/jats:p>","DOI":"10.3233\/shti251039","type":"book-chapter","created":{"date-parts":[[2025,8,7]],"date-time":"2025-08-07T11:38:57Z","timestamp":1754566737000},"source":"Crossref","is-referenced-by-count":1,"title":["Efficient Training Corpus Retrieval for Large Language Model Fine Tuning: A Case Study in Cancer"],"prefix":"10.3233","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5015-2665","authenticated-orcid":false,"given":"Avisha","family":"Das","sequence":"first","affiliation":[{"name":"Mayo Clinic Arizona, Phoenix, AZ, USA"}]},{"given":"Chiamaka","family":"Diala","sequence":"additional","affiliation":[{"name":"University of Texas Health Science Center, Houston, TX, USA"}]},{"given":"Guocai","family":"Chen","sequence":"additional","affiliation":[{"name":"University of Texas Health Science Center, Houston, TX, USA"}]},{"given":"Zhao","family":"Li","sequence":"additional","affiliation":[{"name":"University of Texas Health Science Center, Houston, TX, USA"}]},{"given":"Rongbin","family":"Li","sequence":"additional","affiliation":[{"name":"University of Texas Health Science Center, Houston, TX, USA"}]},{"given":"Omer","family":"Anjum","sequence":"additional","affiliation":[{"name":"University of Texas Health Science Center, Houston, TX, USA"}]},{"given":"W. Jim","family":"Zheng","sequence":"additional","affiliation":[{"name":"University of Texas Health Science Center, Houston, TX, USA"}]}],"member":"7437","container-title":["Studies in Health Technology and Informatics","MEDINFO 2025 \u2014 Healthcare Smart \u00d7 Medicine Deep"],"original-title":[],"link":[{"URL":"https:\/\/ebooks.iospress.nl\/pdf\/doi\/10.3233\/SHTI251039","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,7]],"date-time":"2025-08-07T11:38:57Z","timestamp":1754566737000},"score":1,"resource":{"primary":{"URL":"https:\/\/ebooks.iospress.nl\/doi\/10.3233\/SHTI251039"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,8,7]]},"ISBN":["9781643686080"],"references-count":0,"URL":"https:\/\/doi.org\/10.3233\/shti251039","relation":{},"ISSN":["0926-9630","1879-8365"],"issn-type":[{"value":"0926-9630","type":"print"},{"value":"1879-8365","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,8,7]]}}}