{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,22]],"date-time":"2025-11-22T11:19:24Z","timestamp":1763810364310,"version":"3.41.2"},"reference-count":44,"publisher":"Oxford University Press (OUP)","license":[{"start":{"date-parts":[[2020,12,1]],"date-time":"2020-12-01T00:00:00Z","timestamp":1606780800000},"content-version":"vor","delay-in-days":335,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001871","name":"Funda\u00e7\u00e3o para a Ci\u00eancia e a Tecnologia","doi-asserted-by":"publisher","award":["PTDC\/CCI-BIO\/28685\/2017"],"award-info":[{"award-number":["PTDC\/CCI-BIO\/28685\/2017"]}],"id":[{"id":"10.13039\/501100001871","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001871","name":"Funda\u00e7\u00e3o para a Ci\u00eancia e a Tecnologia","doi-asserted-by":"publisher","award":["UIDB\/00408\/2020"],"award-info":[{"award-number":["UIDB\/00408\/2020"]}],"id":[{"id":"10.13039\/501100001871","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001871","name":"Funda\u00e7\u00e3o para a Ci\u00eancia e a Tecnologia","doi-asserted-by":"publisher","award":["UIDP\/00408\/2020"],"award-info":[{"award-number":["UIDP\/00408\/2020"]}],"id":[{"id":"10.13039\/501100001871","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100004895","name":"Fundo Social Europeu","doi-asserted-by":"crossref","award":["SFRH\/BD\/145221\/2019"],"award-info":[{"award-number":["SFRH\/BD\/145221\/2019"]}],"id":[{"id":"10.13039\/501100004895","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2020,12,1]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Biomedical relation extraction (RE) datasets are vital in the construction of knowledge bases and to potentiate the discovery of new interactions. There are several ways to create biomedical RE datasets, some more reliable than others, such as resorting to domain expert annotations. However, the emerging use of crowdsourcing platforms, such as Amazon Mechanical Turk (MTurk), can potentially reduce the cost of RE dataset construction, even if the same level of quality cannot be guaranteed. There is a lack of power of the researcher to control who, how and in what context workers engage in crowdsourcing platforms. Hence, allying distant supervision with crowdsourcing can be a more reliable alternative. The crowdsourcing workers would be asked only to rectify or discard already existing annotations, which would make the process less dependent on their ability to interpret complex biomedical sentences. In this work, we use a previously created distantly supervised human phenotype\u2013gene relations (PGR) dataset to perform crowdsourcing validation. We divided the original dataset into two annotation tasks: Task 1, 70% of the dataset annotated by one worker, and Task 2, 30% of the dataset annotated by seven workers. Also, for Task 2, we added an extra rater on-site and a domain expert to further assess the crowdsourcing validation quality. Here, we describe a detailed pipeline for RE crowdsourcing validation, creating a new release of the PGR dataset with partial domain expert revision, and assess the quality of the MTurk platform. We applied the new dataset to two state-of-the-art deep learning systems (BiOnt and BioBERT) and compared its performance with the original PGR dataset, as well as combinations between the two, achieving a 0.3494 increase in average F-measure. The code supporting our work and the new release of the PGR dataset is available at https:\/\/github.com\/lasigeBioTM\/PGR-crowd.<\/jats:p>","DOI":"10.1093\/database\/baaa104","type":"journal-article","created":{"date-parts":[[2020,11,13]],"date-time":"2020-11-13T04:20:07Z","timestamp":1605241207000},"source":"Crossref","is-referenced-by-count":7,"title":["A hybrid approach toward biomedical relation extraction training corpora: combining distant supervision with crowdsourcing"],"prefix":"10.1093","volume":"2020","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0597-9273","authenticated-orcid":false,"given":"Diana","family":"Sousa","sequence":"first","affiliation":[{"name":"LASIGE, Departamento de Inform\u00e1tica, Faculdade de Ci\u00eancias, Universidade de Lisboa, Lisboa 1749-016, Portugal"}]},{"given":"Andre","family":"Lamurias","sequence":"additional","affiliation":[{"name":"LASIGE, Departamento de Inform\u00e1tica, Faculdade de Ci\u00eancias, Universidade de Lisboa, Lisboa 1749-016, Portugal"}]},{"given":"Francisco M","family":"Couto","sequence":"additional","affiliation":[{"name":"LASIGE, Departamento de Inform\u00e1tica, Faculdade de Ci\u00eancias, Universidade de Lisboa, Lisboa 1749-016, Portugal"}]}],"member":"286","published-online":{"date-parts":[[2020,12,1]]},"reference":[{"key":"2020120107365293400_R1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1093\/database\/baaa006","article-title":"Text mining meets community curation: a newly designed curation platform to improve author experience and participation at WormBase","volume":"2020","author":"Arnaboldi","year":"2020","journal-title":"Database"},{"key":"2020120107365293400_R2","doi-asserted-by":"crossref","first-page":"914","DOI":"10.1016\/j.jbi.2013.07.011","article-title":"The DDI corpus: an annotated corpus with pharmacological substances and drug-drug interactions","volume":"46","author":"Herrero-Zazo","year":"2013","journal-title":"J. Biomed. Inform."},{"key":"2020120107365293400_R3","doi-asserted-by":"crossref","first-page":"1226","DOI":"10.1093\/bioinformatics\/btz678","article-title":"Applying citizen science to gene, drug and disease relationship extraction from biomedical abstracts","volume":"36","author":"Tsueng","year":"2020","journal-title":"Bioinformatics"},{"key":"2020120107365293400_R4","first-page":"1487","article-title":"A silver standard corpus of human phenotype-gene relations","author":"Sousa","year":"2019"},{"key":"2020120107365293400_R5","first-page":"1747","article-title":"Ranking sentences for extractive summarization with reinforcement learning","author":"Narayan","year":"2018"},{"key":"2020120107365293400_R6","first-page":"204","article-title":"Non-expert correction of automatically generated relation annotations","author":"Gormley","year":"2010"},{"key":"2020120107365293400_R7","first-page":"897","article-title":"Effective crowd annotation for relation extraction","author":"Liu","year":"2016"},{"key":"2020120107365293400_R8","first-page":"290","article-title":"Annotating relations between named entities with crowdsourcing","author":"Collovini","year":"2018"},{"key":"2020120107365293400_R9","first-page":"1","article-title":"Creating speech and language data with Amazon\u2019s Mechanical Turk","author":"Callison-Burch","year":"2010"},{"key":"2020120107365293400_R10","first-page":"64","article-title":"Quality management on Amazon Mechanical Turk","author":"Ipeirotis","year":"2010"},{"key":"2020120107365293400_R11","first-page":"180","article-title":"Preliminary experience with Amazon\u2019s Mechanical Turk for annotating medical named entities","author":"Yetisgen-Yildiz","year":"2010"},{"key":"2020120107365293400_R12","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1093\/database\/bav016","article-title":"Scaling drug indication curation through crowdsourcing","volume":"2015","author":"Khare","year":"2015","journal-title":"Database"},{"key":"2020120107365293400_R13","doi-asserted-by":"crossref","first-page":"9","DOI":"10.1007\/s10579-012-9176-1","article-title":"Perspectives on crowdsourcing annotations for natural language processing","volume":"47","author":"Wang","year":"2013","journal-title":"Lang. Resour. Eval."},{"key":"2020120107365293400_R14","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1093\/database\/baw051","article-title":"A crowdsourcing workflow for extracting chemical-induced disease relations from free text","volume":"2016","author":"Li","year":"2016","journal-title":"Database"},{"key":"2020120107365293400_R15","first-page":"525","article-title":"Towards hybrid NER: a study of content and crowdsourcing-related performance factors","author":"Feyisetan","year":"2015"},{"key":"2020120107365293400_R16","doi-asserted-by":"crossref","first-page":"533","DOI":"10.1007\/s11606-017-4246-0","article-title":"Comparing Amazon\u2019s Mechanical Turk platform to conventional data collection methods in the health and medical research literature","volume":"33","author":"Mortensen","year":"2018","journal-title":"J. Gen. Intern. Med."},{"key":"2020120107365293400_R17","doi-asserted-by":"crossref","first-page":"413","DOI":"10.1162\/COLI_a_00057","article-title":"Amazon Mechanical Turk: gold mine or coal mine?","volume":"37","author":"Fort","year":"2011","journal-title":"Comput. Linguist."},{"key":"2020120107365293400_R18","doi-asserted-by":"crossref","first-page":"184","DOI":"10.1177\/0963721414531598","article-title":"Inside the Turk: understanding Mechanical Turk as a participant pool","volume":"23","author":"Paolacci","year":"2014","journal-title":"Curr. Dir. Psychol. Sci."},{"key":"2020120107365293400_R19","first-page":"3651","article-title":"Learning latent forests for medical relation extraction","author":"Guo","year":"2020"},{"key":"2020120107365293400_R20","first-page":"208","article-title":"Leveraging dependency forest for neural medical relation extraction","author":"Song","year":"2019"},{"key":"2020120107365293400_R21","first-page":"8034","article-title":"Relation extraction exploiting full dependency forests","author":"Jin","year":"2020"},{"key":"2020120107365293400_R22","first-page":"4585","article-title":"ProGene-A large-scale, high-quality protein-gene annotated benchmark corpus","author":"Faessler","year":"2020"},{"key":"2020120107365293400_R23","doi-asserted-by":"crossref","first-page":"276","DOI":"10.11613\/BM.2012.031","article-title":"Interrater reliability: the kappa statistic","volume":"22","author":"McHugh","year":"2012","journal-title":"Biochem. Med."},{"key":"2020120107365293400_R24","first-page":"1","volume-title":"Computing Krippendorff\u2019s Alpha-reliability","author":"Krippendorff","year":"2011"},{"key":"2020120107365293400_R25","first-page":"367","article-title":"BiOnt: deep learning using multiple biomedical ontologies for relation extraction","author":"Sousa","year":"2020"},{"key":"2020120107365293400_R26","doi-asserted-by":"crossref","first-page":"1234","DOI":"10.1093\/bioinformatics\/btz682","article-title":"BioBERT: a pre-trained biomedical language representation model for biomedical text mining","volume":"36","author":"Lee","year":"2020","journal-title":"Bioinformatics"},{"key":"2020120107365293400_R27","doi-asserted-by":"crossref","first-page":"D865","DOI":"10.1093\/nar\/gkw1039","article-title":"The human phenotype ontology","volume":"45","author":"K\u00f6hler","year":"2017","journal-title":"Nucleic Acids Res."},{"key":"2020120107365293400_R28","doi-asserted-by":"crossref","first-page":"48","DOI":"10.5808\/GI.2020.18.2.e20","article-title":"Improving accessibility and distinction between negative results in biomedical relation extraction","volume":"18","author":"Sousa","year":"2020","journal-title":"Genomics Inform."},{"key":"2020120107365293400_R29","doi-asserted-by":"crossref","first-page":"411","DOI":"10.1017\/S1930297500002205","article-title":"Running experiments on Amazon Mechanical Turk","volume":"5","author":"Paolacci","year":"2010","journal-title":"Judgm. Decis. Mak."},{"key":"2020120107365293400_R30","first-page":"282","article-title":"Microtask crowdsourcing for disease mention annotation in PubMed abstracts","author":"Good","year":"2014"},{"key":"2020120107365293400_R31","doi-asserted-by":"crossref","first-page":"80","DOI":"10.1504\/IJWET.2019.100344","article-title":"Finding and validating medical information shared on Twitter: experiences using a crowdsourcing approach","volume":"14","author":"Duberstein","year":"2019","journal-title":"Int. J. Web Eng. Tech."},{"key":"2020120107365293400_R32","first-page":"273","article-title":"A crowdsourcing framework for medical data sets","volume":"2018","author":"Ye","year":"2018","journal-title":"AMIA Summits Transl. Sci. Proc."},{"key":"2020120107365293400_R33","doi-asserted-by":"crossref","first-page":"86","DOI":"10.1016\/j.jbi.2017.04.003","article-title":"Crowd control: effectively utilizing unscreened crowd wor-kers for biomedical data annotation","volume":"69","author":"Cocos","year":"2017","journal-title":"J. Biomed. Inform."},{"key":"2020120107365293400_R34","doi-asserted-by":"crossref","DOI":"10.2196\/jmir.9380","article-title":"ComprehENotes, an instrument to assess patient reading comprehension of electronic health record notes: development and validation","volume":"20","author":"Lalor","year":"2018","journal-title":"J. Med. Internet Res."},{"key":"2020120107365293400_R35","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s13637-017-0057-1","article-title":"Autism spectrum disorder detection from semi-structured and unstructured medical data","volume":"2017","author":"Yuan","year":"2016","journal-title":"EURASIP J. Bioinform. Syst. Biol."},{"key":"2020120107365293400_R36","doi-asserted-by":"crossref","first-page":"253","DOI":"10.1007\/978-1-4614-1034-8_13","volume-title":"Expert Knowledge and Its Application in Landscape Ecology","author":"Kappel","year":"2012"},{"volume-title":"Highlights of the Expert Judgment Policy Symposium and Technical Workshop","year":"2006","author":"Cooke","key":"2020120107365293400_R37"},{"key":"2020120107365293400_R38","doi-asserted-by":"crossref","DOI":"10.1186\/s12874-016-0200-9","article-title":"Measuring inter-rater reliability for nominal data\u2013which coefficients and confidence intervals are appropriate?","volume":"16","author":"Zapf","year":"2016","journal-title":"BMC Med. Res. Methodol."},{"key":"2020120107365293400_R39","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s12859-018-2584-5","article-title":"BO-LSTM: classifying relations via long short-term memory networks along biomedical ontologies","volume":"20","author":"Lamurias","year":"2019","journal-title":"BMC Bioinform."},{"key":"2020120107365293400_R40","doi-asserted-by":"crossref","first-page":"25","DOI":"10.1038\/75556","article-title":"Gene ontology: tool for the unification of biology","volume":"25","author":"Ashburner","year":"2000","journal-title":"Nat. Genet."},{"key":"2020120107365293400_R41","first-page":"4171","article-title":"BERT: pre-training of deep bidirectional transformers for language understanding","author":"Devlin","year":"2019"},{"key":"2020120107365293400_R42","first-page":"1","article-title":"Efficient road lane marking detection with deep learning","author":"Chen","year":"2018"},{"key":"2020120107365293400_R43","first-page":"41","article-title":"Unbabel: how to combine AI with the crowd to scale professional-quality translation","author":"Gra\u00e7a","year":"2018"},{"key":"2020120107365293400_R44","doi-asserted-by":"crossref","first-page":"2765","DOI":"10.1093\/bioinformatics\/btx283","article-title":"Foldit Standalone: a video game-derived protein structure manipulation interface using Rosetta","volume":"33","author":"Kleffner","year":"2017","journal-title":"Bioinformatics"}],"container-title":["Database"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/database\/article-pdf\/doi\/10.1093\/database\/baaa104\/34612108\/baaa104.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"http:\/\/academic.oup.com\/database\/article-pdf\/doi\/10.1093\/database\/baaa104\/34612108\/baaa104.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,10,12]],"date-time":"2023-10-12T07:12:12Z","timestamp":1697094732000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/database\/article\/doi\/10.1093\/database\/baaa104\/6013761"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020]]},"references-count":44,"URL":"https:\/\/doi.org\/10.1093\/database\/baaa104","relation":{},"ISSN":["1758-0463"],"issn-type":[{"type":"electronic","value":"1758-0463"}],"subject":[],"published-other":{"date-parts":[[2020]]},"published":{"date-parts":[[2020]]},"article-number":"baaa104"}}