{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,21]],"date-time":"2026-01-21T16:05:57Z","timestamp":1769011557523,"version":"3.49.0"},"reference-count":33,"publisher":"Oxford University Press (OUP)","license":[{"start":{"date-parts":[[2024,5,28]],"date-time":"2024-05-28T00:00:00Z","timestamp":1716854400000},"content-version":"vor","delay-in-days":148,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100003130","name":"Fonds Wetenschappelijk Onderzoek","doi-asserted-by":"publisher","award":["I002819N"],"award-info":[{"award-number":["I002819N"]}],"id":[{"id":"10.13039\/501100003130","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100008530","name":"European Regional Development Fund","doi-asserted-by":"publisher","award":["35276964"],"award-info":[{"award-number":["35276964"]}],"id":[{"id":"10.13039\/501100008530","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100009595","name":"Service Public de Wallonie","doi-asserted-by":"publisher","award":["2010235-ARIAC"],"award-info":[{"award-number":["2010235-ARIAC"]}],"id":[{"id":"10.13039\/501100009595","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100002661","name":"Fonds De La Recherche Scientifique - FNRS","doi-asserted-by":"publisher","award":["2.5020.11"],"award-info":[{"award-number":["2.5020.11"]}],"id":[{"id":"10.13039\/501100002661","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100004744","name":"Innoviris","doi-asserted-by":"publisher","award":["2020 RDIR 55b"],"award-info":[{"award-number":["2020 RDIR 55b"]}],"id":[{"id":"10.13039\/501100004744","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100003130","name":"Fonds Wetenschappelijk Onderzoek","doi-asserted-by":"publisher","award":["I002819N"],"award-info":[{"award-number":["I002819N"]}],"id":[{"id":"10.13039\/501100003130","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100008530","name":"European Regional Development Fund","doi-asserted-by":"publisher","award":["35276964"],"award-info":[{"award-number":["35276964"]}],"id":[{"id":"10.13039\/501100008530","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100009595","name":"Service Public de Wallonie","doi-asserted-by":"publisher","award":["2010235-ARIAC"],"award-info":[{"award-number":["2010235-ARIAC"]}],"id":[{"id":"10.13039\/501100009595","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100002661","name":"Fonds De La Recherche Scientifique - FNRS","doi-asserted-by":"publisher","award":["2.5020.11"],"award-info":[{"award-number":["2.5020.11"]}],"id":[{"id":"10.13039\/501100002661","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100004744","name":"Innoviris","doi-asserted-by":"publisher","award":["2020 RDIR 55b"],"award-info":[{"award-number":["2020 RDIR 55b"]}],"id":[{"id":"10.13039\/501100004744","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2024,5,28]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>While biomedical relation extraction (bioRE) datasets have been instrumental in the development of methods to support biocuration of single variants from texts, no datasets are currently available for the extraction of digenic or even oligogenic variant relations, despite the reports in literature that epistatic effects between combinations of variants in different loci (or genes) are important to understand disease etiologies. This work presents the creation of a unique dataset of oligogenic variant combinations, geared to train tools to help in the curation of scientific literature. To overcome the hurdles associated with the number of unlabelled instances and the cost of expertise, active learning (AL) was used to optimize the annotation, thus getting assistance in finding the most informative subset of samples to label. By pre-annotating 85 full-text articles containing the relevant relations from the Oligogenic Diseases Database (OLIDA) with PubTator, text fragments featuring potential digenic variant combinations, i.e. gene\u2013variant\u2013gene\u2013variant, were extracted. The resulting fragments of texts were annotated with ALAMBIC, an AL-based annotation platform. The resulting dataset, called DUVEL, is used to fine-tune four state-of-the-art biomedical language models: BiomedBERT, BiomedBERT-large, BioLinkBERT and BioM-BERT. More than 500\u2009000 text fragments were considered for annotation, finally resulting in a dataset with 8442 fragments, 794 of them being positive instances, covering 95% of the original annotated articles. When applied to gene\u2013variant pair detection, BiomedBERT-large achieves the highest F1 score (0.84) after fine-tuning, demonstrating significant improvement compared to the non-fine-tuned model, underlining the relevance of the DUVEL dataset. This study shows how AL may play an important role in the creation of bioRE dataset relevant for biomedical curation applications. DUVEL provides a unique biomedical corpus focusing on 4-ary relations between two genes and two variants. It is made freely available for research on GitHub and Hugging Face.<\/jats:p>\n               <jats:p>Database URL: https:\/\/huggingface.co\/datasets\/cnachteg\/duvel or https:\/\/doi.org\/10.57967\/hf\/1571<\/jats:p>","DOI":"10.1093\/database\/baae039","type":"journal-article","created":{"date-parts":[[2024,5,28]],"date-time":"2024-05-28T12:54:03Z","timestamp":1716900843000},"source":"Crossref","is-referenced-by-count":3,"title":["DUVEL: an active-learning annotated biomedical corpus for the recognition of oligogenic combinations"],"prefix":"10.1093","volume":"2024","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5034-8975","authenticated-orcid":false,"given":"Charlotte","family":"Nachtegael","sequence":"first","affiliation":[{"name":"Interuniversity Institute of Bioinformatics in Brussels, Universit\u00e9 Libre de Bruxelles-Vrije Universiteit Brussel , Boulevard du Triomphe, CP 263, Brussels 1050, Belgium"},{"name":"Machine Learning Group, Universit\u00e9 Libre de Bruxelles , Boulevard du Triomphe, CP 212, Brussels 1050, Belgium"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0257-4537","authenticated-orcid":false,"given":"Jacopo","family":"De Stefani","sequence":"additional","affiliation":[{"name":"Machine Learning Group, Universit\u00e9 Libre de Bruxelles , Boulevard du Triomphe, CP 212, Brussels 1050, Belgium"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6363-6506","authenticated-orcid":false,"given":"Anthony","family":"Cnudde","sequence":"additional","affiliation":[{"name":"Machine Learning Group, Universit\u00e9 Libre de Bruxelles , Boulevard du Triomphe, CP 212, Brussels 1050, Belgium"},{"name":"Pharmacologie, Pharmacoth\u00e9rapie et Suivi Pharmaceutique, Universit\u00e9 Libre de Bruxelles , Boulevard du Triomphe, CP 205, Brussels 1050, Belgium"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3645-1455","authenticated-orcid":false,"given":"Tom","family":"Lenaerts","sequence":"additional","affiliation":[{"name":"Interuniversity Institute of Bioinformatics in Brussels, Universit\u00e9 Libre de Bruxelles-Vrije Universiteit Brussel , Boulevard du Triomphe, CP 263, Brussels 1050, Belgium"},{"name":"Machine Learning Group, Universit\u00e9 Libre de Bruxelles , Boulevard du Triomphe, CP 212, Brussels 1050, Belgium"},{"name":"Artificial Intelligence Laboratory, Vrije Universiteit Brussel , Pleinlaan 2, Brussels 1050, Belgium"}]}],"member":"286","published-online":{"date-parts":[[2024,5,28]]},"reference":[{"key":"2024062912230297800_R1","doi-asserted-by":"crossref","first-page":"W587","DOI":"10.1093\/nar\/gkz389","article-title":"PubTator central: automated concept annotation for biomedical full text articles","volume":"47","author":"Wei","year":"2019","journal-title":"Nucleic Acids Res."},{"key":"2024062912230297800_R2","doi-asserted-by":"crossref","first-page":"139","DOI":"10.1016\/j.artmed.2004.07.016","article-title":"Comparative experiments on learning information extractors for proteins and their interactions","volume":"33","author":"Bunescu","year":"2005","journal-title":"Artif. Intell. Med."},{"key":"2024062912230297800_R3","doi-asserted-by":"crossref","DOI":"10.1186\/1471-2105-8-50","article-title":"BioInfer: a corpus for information extraction in the biomedical domain","volume":"8","author":"Pyysalo","year":"2007","journal-title":"BMC Bioinf."},{"key":"2024062912230297800_R4","doi-asserted-by":"crossref","first-page":"914","DOI":"10.1016\/j.jbi.2013.07.011","article-title":"The DDI corpus: an annotated corpus with pharmacological substances and drug-drug interactions","volume":"46","author":"Herrero-Zazo","year":"2013","journal-title":"J. Biomed. Inform."},{"key":"2024062912230297800_R5","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/2022.naacl-main.233","article-title":"A dataset for N-ary relation extraction of drug combinations","author":"Tiktinsky","year":"2022"},{"key":"2024062912230297800_R6","doi-asserted-by":"crossref","DOI":"10.1093\/database\/baad080","article-title":"Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical-protein relations","volume":"2023","author":"Miranda-Escalada","year":"2023","journal-title":"Database"},{"key":"2024062912230297800_R7","doi-asserted-by":"crossref","first-page":"101","DOI":"10.1162\/tacl_a_00049","article-title":"Cross-sentence N-ary relation extraction with graph LSTMs","volume":"5","author":"Peng","year":"2017","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"2024062912230297800_R8","article-title":"BioCreative V CDR task corpus: a resource for chemical disease relation extraction","volume":"2016","author":"Li","year":"2016","journal-title":"Database"},{"key":"2024062912230297800_R9","doi-asserted-by":"crossref","first-page":"408","DOI":"10.1093\/bioinformatics\/btq667","article-title":"Toward an automatic method for extracting cancer- and other disease-related point mutations from the biomedical literature","volume":"27","author":"Doughty","year":"2011","journal-title":"Bioinformatics"},{"key":"2024062912230297800_R10","article-title":"RENET2: high-performance full-text gene-disease relation extraction with iterative training data expansion","volume":"3","author":"Su","year":"2021","journal-title":"NAR Genom. Bioinform."},{"key":"2024062912230297800_R11","doi-asserted-by":"crossref","DOI":"10.1093\/bib\/bbac282","article-title":"BioRED: a rich biomedical relation extraction dataset","volume":"23","author":"Luo","year":"2022","journal-title":"Brief. Bioinform."},{"key":"2024062912230297800_R12","doi-asserted-by":"crossref","DOI":"10.1093\/database\/baac023","article-title":"Scaling up oligogenic diseases research with OLIDA: the Oligogenic Diseases Database","volume":"2022","author":"Nachtegael","year":"2022","journal-title":"Database"},{"key":"2024062912230297800_R13","doi-asserted-by":"crossref","first-page":"132","DOI":"10.1093\/bib\/bbv024","article-title":"Community challenges in biomedical text mining over 10 years: success, failure and the future","volume":"17","author":"Huang","year":"2016","journal-title":"Brief. Bioinform."},{"key":"2024062912230297800_R14","doi-asserted-by":"crossref","first-page":"436","DOI":"10.1038\/nature14539","article-title":"Deep learning","volume":"521","author":"LeCun","year":"2015","journal-title":"Nature"},{"key":"2024062912230297800_R15","first-page":"1070","article-title":"An analysis of active learning strategies for sequence labeling tasks","author":"Settles","year":"2008"},{"key":"2024062912230297800_R16","doi-asserted-by":"crossref","DOI":"10.1371\/journal.pone.0292356","article-title":"A study of deep active learning methods to reduce labelling efforts in biomedical relation extraction","volume":"18","author":"Nachtegael","year":"2023","journal-title":"PLoS One"},{"key":"2024062912230297800_R17","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/W19-5034","article-title":"ScispaCy: Fast and robust models for biomedical natural language processing","author":"Neumann","year":"2019"},{"key":"2024062912230297800_R18","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3458754","article-title":"Domain-specific language model pretraining for biomedical natural language processing","volume":"3","author":"Gu","year":"2022","journal-title":"ACM Trans. Comput. Healthc."},{"key":"2024062912230297800_R19","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/W19-5006","article-title":"Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets","author":"Peng","year":"2019"},{"key":"2024062912230297800_R20","first-page":"161","article-title":"An Improved Baseline for Sentence-level Relation Extraction","author":"Zhou","year":"2022"},{"key":"2024062912230297800_R21","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/2023.eacl-demo.14","article-title":"ALAMBIC: Active learning automation methods to battle inefficient curation","author":"Nachtegael","year":"2023"},{"key":"2024062912230297800_R22","first-page":"309","volume-title":"Advances in Intelligent Data Analysis, Lecture Notes in Computer Science","author":"Scheffer","year":"2001"},{"key":"2024062912230297800_R23","article-title":"Active learning to recognize multiple types of plankton","author":"Luo","year":"2004"},{"key":"2024062912230297800_R24","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/2020.emnlp-demos.6","article-title":"Transformers: State-of-the-art natural language processing","author":"Wolf","year":"2020"},{"key":"2024062912230297800_R25","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/2022.acl-long.551","article-title":"LinkBERT: Pretraining language models with document links","author":"Yasunaga","year":"2022"},{"key":"2024062912230297800_R26","first-page":"221","article-title":"BioM-Transformers: Building Large Biomedical Language Models with BERT, ALBERT and ELECTRA","author":"Alrowili","year":"2021"},{"key":"2024062912230297800_R27","doi-asserted-by":"crossref","first-page":"3708","DOI":"10.1007\/s11227-015-1541-6","article-title":"Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms","volume":"72","author":"Li","year":"2016","journal-title":"J. Supercomput."},{"key":"2024062912230297800_R28","doi-asserted-by":"crossref","first-page":"213","DOI":"10.1037\/h0026256","article-title":"Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit","volume":"70","author":"Cohen","year":"1968","journal-title":"Psychol. Bull"},{"key":"2024062912230297800_R29","article-title":"ELECTRA: Pre-training text encoders as discriminators rather than generators","author":"Clark","year":"2019"},{"key":"2024062912230297800_R30","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/2021.naacl-main.102","article-title":"Understanding by understanding not: Modeling negation in language models","author":"Hosseini","year":"2021"},{"key":"2024062912230297800_R31","doi-asserted-by":"crossref","first-page":"5678","DOI":"10.1093\/bioinformatics\/btaa1087","article-title":"BERT-GT: cross-sentence n -ary relation extraction with BERT and Graph Transformer","volume":"36","author":"Lai","year":"2021","journal-title":"Bioinformatics"},{"key":"2024062912230297800_R32","doi-asserted-by":"crossref","DOI":"10.1016\/j.jbi.2023.104445","article-title":"Extracting biomedical relation from cross-sentence text using syntactic dependency graph attention network","volume":"144","author":"Zhou","year":"2023","journal-title":"J. Biomed. Inform."},{"key":"2024062912230297800_R33","first-page":"159","article-title":"Label sleuth: from unlabeled text to a classifier in a few hours","author":"Shnarch","year":"2022"}],"container-title":["Database"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/database\/article-pdf\/doi\/10.1093\/database\/baae039\/58365505\/baae039.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/database\/article-pdf\/doi\/10.1093\/database\/baae039\/58365505\/baae039.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,6,29]],"date-time":"2024-06-29T12:32:25Z","timestamp":1719664345000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/database\/article\/doi\/10.1093\/database\/baae039\/7683721"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,1,1]]},"references-count":33,"URL":"https:\/\/doi.org\/10.1093\/database\/baae039","relation":{},"ISSN":["1758-0463"],"issn-type":[{"value":"1758-0463","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2024,1,1]]},"published":{"date-parts":[[2024,1,1]]},"article-number":"baae039"}}