{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,19]],"date-time":"2026-02-19T02:14:42Z","timestamp":1771467282468,"version":"3.50.1"},"reference-count":60,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,2,3]],"date-time":"2025-02-03T00:00:00Z","timestamp":1738540800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,2,3]],"date-time":"2025-02-03T00:00:00Z","timestamp":1738540800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"eTRANSAFE","award":["77365"],"award-info":[{"award-number":["77365"]}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Cheminform"],"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Over the last few decades the pharmaceutical industry has generated a vast corpus of knowledge on the safety and efficacy of drugs. Much of this information is contained in toxicology reports, which summarise the results of animal studies designed to analyse the effects of the tested compound, including unintended pharmacological and toxic effects, known as treatment-related findings. Despite the potential of this knowledge, the fact that most of this relevant information is only available as unstructured text with variable degrees of digitisation has hampered its systematic access, use and exploitation. Text mining technologies have the ability to automatically extract, analyse and aggregate such information, providing valuable new insights into the drug discovery and development process. In the context of the eTRANSAFE project, we present PretoxTM (Preclinical Toxicology Text Mining), the first system specifically designed to detect, extract, organise and visualise treatment-related findings from toxicology reports. The PretoxTM tool comprises three main components: PretoxTM Corpus, PretoxTM Pipeline and PretoxTM Web App. The PretoxTM Corpus is a gold standard corpus of preclinical treatment-related findings annotated by toxicology experts. This corpus was used to develop, train and validate the PretoxTM Pipeline, which extracts treatment-related findings from preclinical study reports. The extracted information is then presented for expert visualisation and validation in the PretoxTM Web App.<\/jats:p>\n          <jats:p>\n            <jats:bold>Scientific Contribution<\/jats:bold>\n          <\/jats:p>\n          <jats:p>While text mining solutions have been widely used in the clinical domain to identify adverse drug reactions from various sources, no similar systems exist for identifying adverse events in animal models during preclinical testing. PretoxTM fills this gap by efficiently extracting treatment-related findings from preclinical toxicology reports. This provides a valuable resource for toxicology research, enhancing the efficiency of safety evaluations, saving time, and leading to more effective decision-making in the drug development process.<\/jats:p>","DOI":"10.1186\/s13321-024-00925-x","type":"journal-article","created":{"date-parts":[[2025,2,3]],"date-time":"2025-02-03T12:31:56Z","timestamp":1738585916000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["PretoxTM: a text mining system for extracting treatment-related findings from preclinical toxicology reports"],"prefix":"10.1186","volume":"17","author":[{"given":"Javier","family":"Corvi","sequence":"first","affiliation":[]},{"given":"Nicol\u00e1s","family":"D\u00edaz-Roussel","sequence":"additional","affiliation":[]},{"given":"Jos\u00e9 M.","family":"Fern\u00e1ndez","sequence":"additional","affiliation":[]},{"given":"Francesco","family":"Ronzano","sequence":"additional","affiliation":[]},{"given":"Emilio","family":"Centeno","sequence":"additional","affiliation":[]},{"given":"Pablo","family":"Accuosto","sequence":"additional","affiliation":[]},{"given":"Celine","family":"Ibrahim","sequence":"additional","affiliation":[]},{"given":"Shoji","family":"Asakura","sequence":"additional","affiliation":[]},{"given":"Frank","family":"Bringezu","sequence":"additional","affiliation":[]},{"given":"Mirjam","family":"Fr\u00f6hlicher","sequence":"additional","affiliation":[]},{"given":"Annika","family":"Kreuchwig","sequence":"additional","affiliation":[]},{"given":"Yoko","family":"Nogami","sequence":"additional","affiliation":[]},{"given":"Jeong","family":"Rih","sequence":"additional","affiliation":[]},{"given":"Raul","family":"Rodriguez-Esteban","sequence":"additional","affiliation":[]},{"given":"Nicolas","family":"Sajot","sequence":"additional","affiliation":[]},{"given":"Joerg","family":"Wichard","sequence":"additional","affiliation":[]},{"given":"Heng-Yi Michael","family":"Wu","sequence":"additional","affiliation":[]},{"given":"Philip","family":"Drew","sequence":"additional","affiliation":[]},{"given":"Thomas","family":"Steger-Hartmann","sequence":"additional","affiliation":[]},{"given":"Alfonso","family":"Valencia","sequence":"additional","affiliation":[]},{"given":"Laura I.","family":"Furlong","sequence":"additional","affiliation":[]},{"given":"Salvador","family":"Capella-Gutierrez","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,2,3]]},"reference":[{"issue":"9","key":"925_CR1","doi-asserted-by":"publisher","first-page":"844","DOI":"10.1001\/jama.2020.1166","volume":"323","author":"OJ Wouters","year":"2020","unstructured":"Wouters OJ, McKee M, Luyten J (2020) Estimated research and development investment needed to bring a new medicine to market, 2009\u20132018. JAMA 323(9):844\u2013853. https:\/\/doi.org\/10.1001\/jama.2020.1166","journal-title":"JAMA"},{"key":"925_CR2","doi-asserted-by":"publisher","DOI":"10.1016\/j.heliyon.2023.e17575","author":"R Qureshi","year":"2023","unstructured":"Qureshi R, Irfan M, Gondal TM, Khan S, Wu J, Hadi MU, Heymach J, Le X, Yan H, Alam T (2023) AI in drug discovery and its clinical relevance. Heliyon. https:\/\/doi.org\/10.1016\/j.heliyon.2023.e17575","journal-title":"Heliyon"},{"key":"925_CR3","doi-asserted-by":"publisher","DOI":"10.3390\/ph14030237","author":"F Pognan","year":"2021","unstructured":"Pognan F, Steger-Hartmann T, D\u00edaz C, Blomberg N, Bringezu F, Briggs K, Callegaro G, Capella-Gutierrez S, Centeno E, Corvi J, Drew P, Drewe WC, Fern\u00e1ndez JM, Furlong LI, Guney E, Kors JA, Mayer MA, Pastor M, Pi\u00f1ero J, Ram\u00edrez-anguita JM, Ronzano F, Rowell P, Sa\u00fcch-pitarch J, Valencia A, Water B, Lei J, Mulligen E, Sanz F (2021) The eTRANSAFE project on translational safety assessment through integrative knowledge management: achievements and perspectives. Pharmaceuticals. https:\/\/doi.org\/10.3390\/ph14030237","journal-title":"Pharmaceuticals"},{"key":"925_CR4","doi-asserted-by":"publisher","DOI":"10.1038\/d41573-023-00099-5","author":"F Sanz","year":"2023","unstructured":"Sanz F, Pognan F, Steger-Hartmann T, Diaz C, Asakura S, Amberg A, B\u00e9court-Lhote N, Blomberg N, Bosc N, Briggs K, Bringezu F, Brulle-Wohlhueter C, Brunak S, Bueters R, Callegaro G, Capella-Gutierrez S, Centeno E, Corvi J, Cronin M, Wilkinson D (2023) eTRANSAFE: data science to empower translational safety assessment. Nat Rev Drug Discovery. https:\/\/doi.org\/10.1038\/d41573-023-00099-5","journal-title":"Nat Rev Drug Discovery"},{"key":"925_CR5","doi-asserted-by":"publisher","DOI":"10.14573\/altex.2011181","author":"K Briggs","year":"2021","unstructured":"Briggs K, Bosc N, Camara T, Diaz C, Drew P, Drewe WC, Kors JA, Mulligen EV, Pastor Maeso M, Pognan F (2021) Guidelines for FAIR sharing of preclinical safety and off-target pharmacology data. ALTEX. https:\/\/doi.org\/10.14573\/altex.2011181","journal-title":"ALTEX."},{"issue":"1","key":"925_CR6","doi-asserted-by":"publisher","first-page":"31","DOI":"10.1186\/s13321-021-00509-z","volume":"13","author":"M Pastor","year":"2021","unstructured":"Pastor M, G\u00f3mez-Tamayo JC, Sanz F (2021) Flame: an open source framework for model development, hosting, and usage in production environments. J Cheminf 13(1):31. https:\/\/doi.org\/10.1186\/s13321-021-00509-z","journal-title":"J Cheminf"},{"key":"925_CR7","doi-asserted-by":"publisher","first-page":"2110","DOI":"10.1016\/j.csbj.2023.03.014","volume":"21","author":"J Pi\u00f1ero","year":"2023","unstructured":"Pi\u00f1ero J, Fraga PSR, Valls-Margarit J, Ronzano F, Accuosto P, Jane RL, Sanz F, Furlong LI (2023) Genomic and proteomic biomarker landscape in clinical trials. Comput Struct Biotechnol J 21:2110\u20132118. https:\/\/doi.org\/10.1016\/j.csbj.2023.03.014","journal-title":"Comput Struct Biotechnol J"},{"key":"925_CR8","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1007\/s00204-021-03141-w","volume":"95","author":"G Callegaro","year":"2021","unstructured":"Callegaro G, Kunnen S, Trairatphisan P, Grosdidier S, Niemeijer M, Den Hollander W, Guney E, Pi\u00f1ero J, Furlong LI, Webster Y, Saez-Rodriguez J, Sutherland J, Mollon J, Stevens J, Water B (2021) The human hepatocyte TXG-MAPr: gene co-expression network modules to support mechanism-based risk assessment. Arch Toxicol 95:1\u201331. https:\/\/doi.org\/10.1007\/s00204-021-03141-w","journal-title":"Arch Toxicol"},{"key":"925_CR9","unstructured":"CDISC: SEND Controlled Terminology. Accessed: 2024-07-04. https:\/\/evs.nci.nih.gov\/ftp1\/CDISC\/SEND\/"},{"issue":"8","key":"925_CR10","doi-asserted-by":"publisher","first-page":"1006","DOI":"10.1177\/0192623318805743","volume":"46","author":"S Choudhary","year":"2018","unstructured":"Choudhary S, Walker A, Funk K, Keenan C, Khan I, Maratea K (2018) The standard for the exchange of nonclinical data (SEND): challenges and promises. Toxicol Pathol 46(8):1006\u20131012. https:\/\/doi.org\/10.1177\/0192623318805743","journal-title":"Toxicol Pathol"},{"issue":"3","key":"925_CR11","doi-asserted-by":"publisher","first-page":"174","DOI":"10.1016\/j.drudis.2006.12.012","volume":"12","author":"T Souza","year":"2007","unstructured":"Souza T, Kush R, Evans JP (2007) Global clinical data interchange standards are here! Drug Discovery Today 12(3):174\u2013181. https:\/\/doi.org\/10.1016\/j.drudis.2006.12.012","journal-title":"Drug Discovery Today"},{"issue":"12","key":"925_CR12","doi-asserted-by":"publisher","first-page":"811","DOI":"10.1038\/nrd.2017.177","volume":"16","author":"F Sanz","year":"2017","unstructured":"Sanz F, Pognan F, Steger-Hartmann T, D\u00edaz C, Cases M, Pastor M, Marc P, Wichard J, Briggs K, Watson DK, Klein\u00f6der T, Yang C, Amberg A, Beaumont M, Brookes AJ, Brunak S, Cronin MTD, Ecker GF, Escher S, Greene N, Guzm\u00e1n A, Hersey A, Jacques P, Lammens L, Mestres J, Muster W, Northeved H, Pinches M, Saiz J, Sajot N, Valencia A, Lei J, Vermeulen NPE, Vock E, Wolber G, Zamora I (2017) eTOX: Legacy data sharing to improve drug safety assessment: the eTOX project. Nat Rev Drug Discovery 16(12):811\u2013812. https:\/\/doi.org\/10.1038\/nrd.2017.177","journal-title":"Nat Rev Drug Discovery"},{"issue":"1","key":"925_CR13","doi-asserted-by":"publisher","first-page":"148","DOI":"10.1093\/bioinformatics\/btw579","volume":"33","author":"C Ravagli","year":"2016","unstructured":"Ravagli C, Pognan F, Marc P (2016) OntoBrowser: a collaborative tool for curation of ontologies by subject matter experts. Bioinformatics 33(1):148\u2013149. https:\/\/doi.org\/10.1093\/bioinformatics\/btw579","journal-title":"Bioinformatics"},{"key":"925_CR14","doi-asserted-by":"publisher","unstructured":"Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention Is All You Need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS\u201917, pp. 6000\u20136010. Curran Associates Inc., Red Hook, NY, USA. https:\/\/doi.org\/10.48550\/arXiv.1706.03762","DOI":"10.48550\/arXiv.1706.03762"},{"key":"925_CR15","doi-asserted-by":"publisher","unstructured":"Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171\u20134186. Association for Computational Linguistics, Minneapolis, Minnesota. https:\/\/doi.org\/10.18653\/v1\/N19-1423","DOI":"10.18653\/v1\/N19-1423"},{"key":"925_CR16","doi-asserted-by":"publisher","unstructured":"OpenAI et al (2024) GPT-4 Technical Report. https:\/\/doi.org\/10.48550\/arXiv.2303.08774","DOI":"10.48550\/arXiv.2303.08774"},{"issue":"4","key":"925_CR17","doi-asserted-by":"publisher","first-page":"1234","DOI":"10.1093\/bioinformatics\/btz682","volume":"36","author":"J Lee","year":"2020","unstructured":"Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J (2020) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4):1234\u20131240. https:\/\/doi.org\/10.1093\/bioinformatics\/btz682","journal-title":"Bioinformatics"},{"key":"925_CR18","doi-asserted-by":"publisher","unstructured":"Michalopoulos G, Wang Y, Kaka H, Chen H, Wong A (2021) UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1744\u20131753. Association for Computational Linguistics, Online. https:\/\/doi.org\/10.18653\/v1\/2021.naacl-main.139","DOI":"10.18653\/v1\/2021.naacl-main.139"},{"key":"925_CR19","doi-asserted-by":"publisher","unstructured":"Alsentzer E, Murphy J, Boag W, Weng W-H, Jindi D, Naumann T, McDermott M (2019) Publicly Available Clinical BERT Embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop, pp. 72\u201378. Association for Computational Linguistics, Minneapolis, Minnesota, USA. https:\/\/doi.org\/10.18653\/v1\/W19-1909","DOI":"10.18653\/v1\/W19-1909"},{"issue":"3","key":"925_CR20","doi-asserted-by":"publisher","first-page":"103","DOI":"10.1093\/bioinformatics\/btad103","volume":"39","author":"O Rohanian","year":"2023","unstructured":"Rohanian O, Nouriborji M, Kouchaki S, Clifton DA (2023) On the effectiveness of compact biomedical transformers. Bioinformatics 39(3):103. https:\/\/doi.org\/10.1093\/bioinformatics\/btad103","journal-title":"Bioinformatics"},{"key":"925_CR21","doi-asserted-by":"publisher","DOI":"10.1093\/bib\/bbac409","author":"R Luo","year":"2022","unstructured":"Luo R, Sun L, Xia Y, Qin T, Zhang S, Poon H, Liu T-Y (2022) BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief Bioinform. https:\/\/doi.org\/10.1093\/bib\/bbac409","journal-title":"Brief Bioinform"},{"key":"925_CR22","doi-asserted-by":"publisher","unstructured":"Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, Naumann T, Gao J, Poon H (2020) Domain-specific language model pretraining for biomedical natural language processing. https:\/\/doi.org\/10.1145\/3458754","DOI":"10.1145\/3458754"},{"issue":"1","key":"925_CR23","doi-asserted-by":"publisher","first-page":"467","DOI":"10.1093\/bib\/bbad467","volume":"25","author":"Y Zhang","year":"2024","unstructured":"Zhang Y, Liu C, Liu M, Liu T, Lin H, Huang C-B, Ning L (2024) Attention is all you need: utilizing attention in AI-enabled drug discovery. Brief Bioinform 25(1):467. https:\/\/doi.org\/10.1093\/bib\/bbad467","journal-title":"Brief Bioinform"},{"issue":"11","key":"925_CR24","doi-asserted-by":"publisher","first-page":"2593","DOI":"10.1016\/j.drudis.2021.06.009","volume":"26","author":"Z Liu","year":"2021","unstructured":"Liu Z, Roberts RA, Lal-Nag M, Chen X, Huang R, Tong W (2021) AI-based language models powering drug discovery and development. Drug Discovery Today 26(11):2593\u20132607. https:\/\/doi.org\/10.1016\/j.drudis.2021.06.009","journal-title":"Drug Discovery Today"},{"issue":"D1","key":"925_CR25","doi-asserted-by":"publisher","first-page":"845","DOI":"10.1093\/nar\/gkz1021","volume":"48","author":"J Pi\u00f1ero","year":"2019","unstructured":"Pi\u00f1ero J, Ram\u00edrez-Anguita JM, Sa\u00fcch-Pitarch J, Ronzano F, Centeno E, Sanz F, Furlong LI (2019) The DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic Acids Res 48(D1):845\u2013855. https:\/\/doi.org\/10.1093\/nar\/gkz1021","journal-title":"Nucleic Acids Res"},{"issue":"1","key":"925_CR26","doi-asserted-by":"publisher","first-page":"37","DOI":"10.1186\/s13321-018-0290-y","volume":"10","author":"P Thompson","year":"2018","unstructured":"Thompson P, Daikou S, Ueno K, Batista-Navarro R, Tsujii J, Ananiadou S (2018) Annotation and detection of drug effects in text for pharmacovigilance. J Cheminf 10(1):37. https:\/\/doi.org\/10.1186\/s13321-018-0290-y","journal-title":"J Cheminf"},{"issue":"1","key":"925_CR27","doi-asserted-by":"publisher","first-page":"0279842","DOI":"10.1371\/journal.pone.0279842","volume":"18","author":"RM Murphy","year":"2023","unstructured":"Murphy RM, Klopotowska JE, Keizer NF, Jager KJ, Leopold JH, Dongelmans DA, Abu-Hanna A, Schut MC (2023) Adverse drug event detection using natural language processing: a scoping review of supervised learning methods. PLoS ONE 18(1):0279842. https:\/\/doi.org\/10.1371\/journal.pone.0279842","journal-title":"PLoS ONE"},{"issue":"2","key":"925_CR28","doi-asserted-by":"publisher","first-page":"8214","DOI":"10.2196\/publichealth.8214","volume":"4","author":"D Bollegala","year":"2018","unstructured":"Bollegala D, Maskell S, Sloane R, Hajne J, Pirmohamed M (2018) Causality patterns for detecting adverse drug reactions from social media: text mining approach. JMIR Public Health Surveill 4(2):8214. https:\/\/doi.org\/10.2196\/publichealth.8214","journal-title":"JMIR Public Health Surveill"},{"key":"925_CR29","doi-asserted-by":"publisher","DOI":"10.1111\/bcp.15068","author":"ATM Wasylewicz","year":"2021","unstructured":"Wasylewicz ATM, Burgt B, Weterings A, Jessurun N, Korsten E, Egberts T, Bouwman A, Kerskes M, Grouls R, Linden C (2021) Identifying adverse drug reactions from free- text electronic hospital health record notes. Br J Clin Pharmacol. https:\/\/doi.org\/10.1111\/bcp.15068","journal-title":"Br J Clin Pharmacol"},{"key":"925_CR30","doi-asserted-by":"publisher","DOI":"10.1016\/j.jbi.2021.103960","volume":"125","author":"S Narayanan","year":"2022","unstructured":"Narayanan S, Mannam K, Achan P, Ramesh MV, Rangan PV, Rajan SP (2022) A contextual multi-task neural approach to medication and adverse events identification from clinical text. J Biomed Inf 125:103960. https:\/\/doi.org\/10.1016\/j.jbi.2021.103960","journal-title":"J Biomed Inf"},{"key":"925_CR31","doi-asserted-by":"publisher","DOI":"10.1016\/j.bbiosy.2022.100061","volume":"7","author":"MPF Corradi","year":"2022","unstructured":"Corradi MPF, de Haan AM, Staumont B, Piersma AH, Geris L, Pieters RHH, Krul CAM, Teunis MAT (2022) Natural language processing in toxicology: delineating adverse outcome pathways and guiding the application of new approach methodologies. Biomater Biosyst 7:100061. https:\/\/doi.org\/10.1016\/j.bbiosy.2022.100061","journal-title":"Biomater Biosyst"},{"key":"925_CR32","unstructured":"PDS Consultants: SR-Domain template and concept. Copyright notice (2024)"},{"key":"925_CR33","doi-asserted-by":"publisher","DOI":"10.1093\/bib\/bbz130","author":"M Neves","year":"2021","unstructured":"Neves M, \u0160eva J (2021) An extensive review of tools for manual annotation of documents. Brief Bioinform. https:\/\/doi.org\/10.1093\/bib\/bbz130","journal-title":"Brief Bioinform"},{"key":"925_CR34","unstructured":"Yimam SM, Gurevych I, Castilho R, Biemann C (2013) WebAnno: a flexible, web-based and visually supported system for distributed annotations. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 1\u20136. Association for Computational Linguistics, Sofia, Bulgaria. https:\/\/www.aclweb.org\/anthology\/P13-4001"},{"issue":"4","key":"925_CR35","doi-asserted-by":"publisher","first-page":"555","DOI":"10.1162\/coli.07-034-R2","volume":"34","author":"R Artstein","year":"2008","unstructured":"Artstein R, Poesio M (2008) Survey article: inter-coder agreement for computational linguistics. Comput Linguist 34(4):555\u2013596. https:\/\/doi.org\/10.1162\/coli.07-034-R2","journal-title":"Comput Linguist"},{"key":"925_CR36","unstructured":"Castro S (2017) Fast Krippendorff: fast computation of Krippendorff\u2019s alpha agreement measure. GitHub"},{"key":"925_CR37","unstructured":"GROBID. GitHub. Accessed: 2024-07-18 (2008\u20132024). https:\/\/github.com\/kermitt2\/grobid"},{"key":"925_CR38","doi-asserted-by":"publisher","unstructured":"Smith R (2007) An overview of the tesseract OCR engine. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), vol. 2, pp. 629\u2013633. https:\/\/doi.org\/10.1109\/ICDAR.2007.4376991","DOI":"10.1109\/ICDAR.2007.4376991"},{"issue":"10","key":"925_CR39","doi-asserted-by":"publisher","first-page":"50","DOI":"10.5120\/8794-2784","volume":"55","author":"C Patel","year":"2012","unstructured":"Patel C, Patel A, Patel D (2012) Optical character recognition by open source OCR tool tesseract: a case study. Int J Comput Appl 55(10):50\u201356. https:\/\/doi.org\/10.5120\/8794-2784","journal-title":"Int J Comput Appl"},{"key":"925_CR40","doi-asserted-by":"publisher","DOI":"10.3390\/sym12050715","author":"D Sporici","year":"2020","unstructured":"Sporici D, Cusnir E, Boiangiu C-A (2020) Improving the accuracy of tesseract 4.0 ocr engine using convolution-based preprocessing. Symmetry. https:\/\/doi.org\/10.3390\/sym12050715","journal-title":"Symmetry"},{"key":"925_CR41","doi-asserted-by":"publisher","unstructured":"Brisinello M, Grbi\u0107 R, Pul M, An\u0111eli\u0107 T (2017) Improving optical character recognition performance for low quality images. In: 2017 International Symposium ELMAR, pp. 167\u2013171. https:\/\/doi.org\/10.23919\/ELMAR.2017.8124460","DOI":"10.23919\/ELMAR.2017.8124460"},{"key":"925_CR42","doi-asserted-by":"publisher","unstructured":"Manning C, Surdeanu M, Bauer J, Finkel J, Bethard S, McClosky D (2014) The Stanford CoreNLP Natural Language Processing Toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55\u201360. Association for Computational Linguistics, Baltimore, Maryland. https:\/\/doi.org\/10.3115\/v1\/P14-5010","DOI":"10.3115\/v1\/P14-5010"},{"key":"925_CR43","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pcbi.1002854","author":"H Cunningham","year":"2013","unstructured":"Cunningham H, Tablan V, Roberts A, Bontcheva K (2013) Getting more out of biomedical documents with GATE\u2019s full lifecycle open source text analytics. PLoS Comput Biol. https:\/\/doi.org\/10.1371\/journal.pcbi.1002854","journal-title":"PLoS Comput Biol"},{"key":"925_CR44","doi-asserted-by":"publisher","unstructured":"Chalkidis I, Fergadiotis M, Malakasiotis P, Aletras N, Androutsopoulos I (2020) LEGAL-BERT: the Muppets straight out of Law School. In: Cohn T, He Y, Liu Y (eds.) Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2898\u20132904. Association for Computational Linguistics, Online. https:\/\/doi.org\/10.18653\/v1\/2020.findings-emnlp.261","DOI":"10.18653\/v1\/2020.findings-emnlp.261"},{"key":"925_CR45","doi-asserted-by":"publisher","unstructured":"Araci D (2019) FinBERT: financial sentiment analysis with pre-trained language models. CoRR abs\/1908.10063[SPACE]https:\/\/doi.org\/10.48550\/arXiv.1908.10063","DOI":"10.48550\/arXiv.1908.10063"},{"key":"925_CR46","doi-asserted-by":"publisher","unstructured":"Chithrananda S, Grand G, Ramsundar B (2020) ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction. https:\/\/doi.org\/10.48550\/arXiv.2010.09885","DOI":"10.48550\/arXiv.2010.09885"},{"key":"925_CR47","doi-asserted-by":"publisher","unstructured":"Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. https:\/\/doi.org\/10.48550\/arXiv.1711.05101","DOI":"10.48550\/arXiv.1711.05101"},{"issue":"suppl\u20131","key":"925_CR48","doi-asserted-by":"publisher","first-page":"267","DOI":"10.1093\/nar\/gkh061","volume":"32","author":"O Bodenreider","year":"2004","unstructured":"Bodenreider O (2004) The Unified Medical Language System (UMLS): integrating biomedical terminolog. Nucleic Acids Res 32(suppl\u20131):267\u2013270. https:\/\/doi.org\/10.1093\/nar\/gkh061","journal-title":"Nucleic Acids Res"},{"key":"925_CR49","unstructured":"Cunningham H, Maynard D, Tablan V (2000) JAPE: a Java Annotation Patterns Engine. https:\/\/api.semanticscholar.org\/CorpusID:59651445"},{"key":"925_CR50","unstructured":"Stenetorp P, Pyysalo S, Topi\u0107 G, Ohta T, Ananiadou S, Tsujii J (2012) brat: a Web-based Tool for NLP-Assisted Text Annotation. In: Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 102\u2013107. Association for Computational Linguistics, Avignon, France. https:\/\/aclanthology.org\/E12-2021"},{"key":"925_CR51","unstructured":"Kumar V, Choudhary A, Cho E (2020) Data Augmentation using Pre-trained Transformer Models. In: Campbell WM, Waibel A, Hakkani-Tur D, Hazen TJ, Kilgour K, Cho E, Kumar V, Glaude H (eds.) Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems, pp. 18\u201326. Association for Computational Linguistics, Suzhou, China. https:\/\/aclanthology.org\/2020.lifelongnlp-1.3"},{"key":"925_CR52","unstructured":"Edwards A, Ushio A, Camacho-collados J, Ribaupierre H, Preece A (2022) Guiding Generative Language Models for Data Augmentation in Few-Shot Text Classification. In: Dragut E, Li Y, Popa L, Vucetic S, Srivastava S (eds.) Proceedings of the Fourth Workshop on Data Science with Human-in-the-Loop (Language Advances), pp. 51\u201363. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid). https:\/\/aclanthology.org\/2022.dash-1.8\/"},{"key":"925_CR53","doi-asserted-by":"publisher","unstructured":"Cai J, Huang S, Jiang Y, Tan Z, Xie P, Tu K (2023) Improving Low-resource Named Entity Recognition with Graph Propagated Data Augmentation. In: Rogers A, Boyd-Graber J, Okazaki N (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 110\u2013118. Association for Computational Linguistics, Toronto, Canada. https:\/\/doi.org\/10.18653\/v1\/2023.acl-short.11","DOI":"10.18653\/v1\/2023.acl-short.11"},{"key":"925_CR54","doi-asserted-by":"publisher","unstructured":"Zhou R, Li X, He R, Bing L, Cambria E, Si L, Miao C (2022) MELM: data augmentation with masked entity language modeling for low-resource NER. In: Muresan S, Nakov P, Villavicencio A (eds.) Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2251\u20132262. Association for Computational Linguistics, Dublin, Ireland. https:\/\/doi.org\/10.18653\/v1\/2022.acl-long.160","DOI":"10.18653\/v1\/2022.acl-long.160"},{"issue":"1","key":"925_CR55","doi-asserted-by":"publisher","first-page":"493","DOI":"10.1093\/bib\/bbad493","volume":"25","author":"S Tian","year":"2024","unstructured":"Tian S, Jin Q, Yeganova L, Lai P-T, Zhu Q, Chen X, Yang Y, Chen Q, Kim W, Comeau DC, Islamaj R, Kapoor A, Gao X, Lu Z (2024) Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Brief Bioinform 25(1):493. https:\/\/doi.org\/10.1093\/bib\/bbad493","journal-title":"Brief Bioinform"},{"key":"925_CR56","unstructured":"Chen Q, Hu Y, Peng X, Xie Q, Jin Q, Gilson A, Singer MB, Ai X, Lai P-T, Wang Z, Keloth VK, Raja K, Huang J, He H, Lin F, Du J, Zhang R, Zheng WJ, Adelman RA, Lu Z, Xu H (2024) A systematic evaluation of large language models for biomedical natural language processing: benchmarks, baselines, and recommendations. https:\/\/arxiv.org\/abs\/2305.16326"},{"issue":"3","key":"925_CR57","doi-asserted-by":"publisher","first-page":"343","DOI":"10.14573\/altex.2001311","volume":"37","author":"T Steger-Hartmann","year":"2020","unstructured":"Steger-Hartmann T, Kreuchwig A, Vaas L, Wichard J, Bringezu F, Amberg A, Muster W, Pognan F, Barber C (2020) Introducing the concept of virtual control groups into preclinical toxicology testing. ALTEX 37(3):343\u2013349. https:\/\/doi.org\/10.14573\/altex.2001311","journal-title":"ALTEX"},{"issue":"2","key":"925_CR58","doi-asserted-by":"publisher","first-page":"282","DOI":"10.14573\/altex.2310041","volume":"41","author":"E Golden","year":"2024","unstructured":"Golden E, Allen D, Amberg A, Anger LT, Baker E, Baran SW, Bringezu F, Clark M, Duchateau-Nguyen G, Escher SE et al (2024) Toward implementing virtual control groups in nonclinical safety studies: workshop report and roadmap to implementation. ALTEX 41(2):282\u2013301. https:\/\/doi.org\/10.14573\/altex.2310041","journal-title":"ALTEX"},{"issue":"4","key":"925_CR59","doi-asserted-by":"publisher","first-page":"1029","DOI":"10.1161\/hypertensionaha.120.16340","volume":"77","author":"A Vlahou","year":"2021","unstructured":"Vlahou A, Hallinan D, Apweiler R, Argiles A, Beige J, Benigni A, Bischoff R, Black PC, Boehm F, C\u00e9raline J, Chrousos GP, Delles C, Evenepoel P, Fridolin I, Glorieux G, Gool AJ, Heidegger I, Ioannidis JPA, Jankowski J, Jankowski V, Jeronimo C, Kamat AM, Masereeuw R, Mayer G, Mischak H, Ortiz A, Remuzzi G, Rossing P, Schanstra JP, Schmitz-Dr\u00e4ger BJ, Spasovski G, Staessen JA, Stamatialis D, Stenvinkel P, Wanner C, Williams SB, Zannad F, Zoccali C, Vanholder R (2021) Data sharing under the general data protection regulation. Hypertension 77(4):1029\u20131035. https:\/\/doi.org\/10.1161\/hypertensionaha.120.16340","journal-title":"Hypertension"},{"issue":"7","key":"925_CR60","doi-asserted-by":"publisher","first-page":"692","DOI":"10.1038\/ng.3312","volume":"47","author":"I Lappalainen","year":"2015","unstructured":"Lappalainen I, Almeida-King J, Kumanduri V, Senf A, Spalding JD, Ur-Rehman S, Saunders G, Kandasamy J, Caccamo M, Leinonen R, Vaughan B, Laurent T, Rowland F, Marin-Garcia P, Barker J, Jokinen P, Torres AC, Argila JR, Llobet OM, Medina I, Puy MS, Alberich M, Torre S, Navarro A, Paschall J, Flicek P (2015) The European Genome-phenome Archive of human data consented for biomedical research. Nat Genet 47(7):692\u2013695. https:\/\/doi.org\/10.1038\/ng.3312","journal-title":"Nat Genet"}],"container-title":["Journal of Cheminformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-024-00925-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13321-024-00925-x\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-024-00925-x.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,2,3]],"date-time":"2025-02-03T12:32:08Z","timestamp":1738585928000},"score":1,"resource":{"primary":{"URL":"https:\/\/jcheminf.biomedcentral.com\/articles\/10.1186\/s13321-024-00925-x"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,2,3]]},"references-count":60,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["925"],"URL":"https:\/\/doi.org\/10.1186\/s13321-024-00925-x","relation":{},"ISSN":["1758-2946"],"issn-type":[{"value":"1758-2946","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,2,3]]},"assertion":[{"value":"8 August 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"4 November 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"3 February 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no Conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"15"}}