{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,12]],"date-time":"2025-10-12T04:06:00Z","timestamp":1760241960268,"version":"build-2065373602"},"reference-count":30,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2018,11,23]],"date-time":"2018-11-23T00:00:00Z","timestamp":1542931200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Data"],"abstract":"<jats:p>Gold standard corpora (GSCs) are essential for the supervised training and evaluation of systems that perform natural language processing (NLP) tasks. Currently, most of the resources used in biomedical NLP tasks are mainly in English. Little effort has been reported for other languages including Romanian and, thus, access to such language resources is poor. In this paper, we present the construction of the first morphologically and terminologically annotated biomedical corpus of the Romanian language (MoNERo), meant to serve as a gold standard for biomedical part-of-speech (POS) tagging and biomedical named entity recognition (bioNER). It contains 14,012 tokens distributed in three medical subdomains: cardiology, diabetes and endocrinology, extracted from books, journals and blogposts. In order to automatically annotate the corpus with POS tags, we used a Romanian tag set which has 715 labels, while diseases, anatomy, procedures and chemicals and drugs labels were manually annotated for bioNER with a Cohen Kappa coefficient of 92.8% and revealed the occurrence of 1877 medical named entities. The automatic annotation of the corpus has been manually checked. The corpus is publicly available and can be used to facilitate the development of NLP algorithms for the Romanian language.<\/jats:p>","DOI":"10.3390\/data3040053","type":"journal-article","created":{"date-parts":[[2018,11,23]],"date-time":"2018-11-23T12:20:28Z","timestamp":1542975628000},"page":"53","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["Towards the Construction of a Gold Standard Biomedical Corpus for the Romanian Language"],"prefix":"10.3390","volume":"3","author":[{"given":"Maria","family":"Mitrofan","sequence":"first","affiliation":[{"name":"Romanian Academy Research Institute for Artificial Intelligence, 13 Calea 13 Septembrie, Bucharest 050711, Romania"}]},{"given":"Verginica","family":"Barbu Mititelu","sequence":"additional","affiliation":[{"name":"Romanian Academy Research Institute for Artificial Intelligence, 13 Calea 13 Septembrie, Bucharest 050711, Romania"}]},{"given":"Grigorina","family":"Mitrofan","sequence":"additional","affiliation":[{"name":"National Institute of Diabetes and Metabolic Diseases \u201cN.C. Paulescu\u201d, 5-7 Ion Movil\u0103 Street, Bucharest 020475, Romania"}]}],"member":"1968","published-online":{"date-parts":[[2018,11,23]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Pakhomov, S., Coden, A., and Chute, C. (2004, January 28\u201329). Creating a test corpus of clinical notes manually tagged for part-of-speech information. Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, Geneva, Switzerland.","DOI":"10.3115\/1567594.1567607"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/j.jbi.2013.12.006","article-title":"NCBI disease corpus: A resource for disease name recognition and concept normalization","volume":"47","author":"Islamaj","year":"2014","journal-title":"J. Biomed. Inform."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Lee, K., Lee, S., Park, S., Kim, S., Kim, S., Choi, K., Tan, A.C., and Kang, J. (2016). BRONCO: Biomedical entity relation oncology corpus for extracting gene-variantdisease-drug relations. J. Biol. Databases Curation.","DOI":"10.1093\/database\/baw043"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Verspoor, K., Yepes, A.J., Cavedon, L., McIntosh, T., Herten-Crabb, A., Thomas, Z., and Plazzer, J.P. (2013). Annotating the biomedical literature for the human variome. J. Biol. Databases Curation.","DOI":"10.1093\/database\/bat019"},{"key":"ref_5","unstructured":"Boytcheva, S., Nikolova, I., Paskaleva, E., Angelova, G., Tcharaktchiev, D., and Dimitrova, N. (2009, January 14\u201316). Extraction and exploration of correlations in patient status data. Proceedings of the Workshop on Biomedical Information Extraction, Borovets, Bulgaria."},{"key":"ref_6","unstructured":"N\u00e9v\u00e9ol, L.A., Jeremy, C.G., Rosset, S., and Zweigenbaum, P. (2018, November 23). The Quaero French Medical Corpus: A Resource for Medical Entity Recognition and Normalization. Available online: http:\/\/nactem.ac.uk\/biotxtm2014\/papers\/Neveoletal.pdf."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"8","DOI":"10.1016\/j.jbi.2017.06.013","article-title":"DrugSemantics: A corpus for Named Entity Recognition in Spanish Summaries of Product Characteristics","volume":"72","author":"Moreno","year":"2017","journal-title":"J. Biomed. Inform."},{"key":"ref_8","unstructured":"Brants, T. (May, January 29). TnT: A statistical part-of-speech tagger. Proceedings of the Sixth Conference on Association for Computational Linguistics Applied Natural Language, Washington, DC, USA."},{"key":"ref_9","unstructured":"Ion, R. (2007). Word Sense Disambiguation Methods Applied to English and Romanian. [Ph.D. Thesis, Romanian Academy]. (In Romanian)."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Mitrofan, M., and Ion, R. (2017, January 4\u20136). Adapting the TTL Romanian POS Tagger to the Biomedical Domain. Proceedings of the Biomedical NLP Workshop Associated with RANLP, Varna, Bulgaria.","DOI":"10.26615\/978-954-452-044-1_002"},{"key":"ref_11","unstructured":"Do\u011fan, R.I., and Zhiyong, L. (2012, January 8). An improved corpus of disease mentions in PubMed citations. Proceedings of the 2012 Workshop on Biomedical Natural Language Processing, Montr\u00e9al, QC, Canada."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"S2","DOI":"10.1186\/1758-2946-7-S1-S2","article-title":"The CHEMDNER corpus of chemicals and drugs and its annotation principles","volume":"7","author":"Krallinger","year":"2015","journal-title":"J. Cheminform."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Pyysalo, S., Ginter, F., Heimonen, J., Bj\u00f6rne, J., Boberg, J., J\u00e4rvinen, J., and Salakoski, T. (2007). BioInfer: A corpus for information extraction in the biomedical domain. BMC Bioinform., 8.","DOI":"10.1186\/1471-2105-8-50"},{"key":"ref_14","unstructured":"Gy\u00f6rgy, M., and Vincze, V. (2012, January 3\u20137). Joint Part-of-Speech Tagging and Named Entity Recognition Using Factor Graphs. Proceedings of the International Conference on Text, Speech and Dialogue, Brno, Czech Republic."},{"key":"ref_15","unstructured":"Barbu Mititelu, V., Tufi\u0219, D., and Irimia, E. (2018, January 7\u201312). The Reference Corpus of the Contemporary Romanian Language (CoRoLa). Proceedings of the 11th Language Resources and Evaluation Conference-LREC, Miyazaki, Japan."},{"key":"ref_16","unstructured":"Mitrofan, M., and Tufi\u0219, D. (2018, January 7\u201312). BioRo: The Biomedical Corpus for the Romanian Language. Proceedings of the 11th edition of the Language Resources and Evaluation Conference, Miyazaki, Japan."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"131","DOI":"10.1007\/s10579-011-9174-8","article-title":"MULTEXT-East: Morphosyntactic resources for Central and Eastern European languages, Language Resources and Evaluation","volume":"46","author":"Erjavec","year":"2012","journal-title":"Lang. Resour. Eval."},{"key":"ref_18","unstructured":"Tufi\u0219, D., Barbu, A.M., P\u0103tra\u0219cu, V., Rotariu, G., and Popescu, C. (1997). Corpora and Corpus-Based Morpho-Lexical Processing. Recent Advances in Romanian Language Technology, Romanian Academy Publishing House."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Tufi\u015f, D. (1999). Tiered tagging and combined language models classifiers. International Workshop on Text, Speech and Dialogue, Springer.","DOI":"10.1007\/3-540-48239-3_5"},{"key":"ref_20","unstructured":"(2018, April 04). Available online: https:\/\/metamap.nlm.nih.gov\/Docs\/SemGroups_2013.txt."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Sang, E.F., and Veenstra, J. (1999, January 8\u201312). Representing text chunks. Proceedings of the Ninth Conference on European Chapters of the Association for Computational Linguistics, Bergen, Norway.","DOI":"10.3115\/977035.977059"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"3","DOI":"10.1075\/li.30.1.03nad","article-title":"A survey of named entity recognition and classification","volume":"30","author":"Nadeau","year":"2007","journal-title":"Lingvist. Investig."},{"key":"ref_23","first-page":"144","article-title":"Building gold standard corpora for medical natural language processing tasks","volume":"Volume 2012","author":"Deleger","year":"2012","journal-title":"AMIA Annual Symposium Proceedings"},{"key":"ref_24","unstructured":"Tufi\u015f, D., and Irimia, E. (2006, January 22\u201328). RoCo-News\u2014A Hand Validated Journalistic Corpus of Romanian. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC), Genoa, Italy."},{"key":"ref_25","unstructured":"Ion, R., Irimia, E., \u0218tef\u0103nescu, D., and Tufi\u0219, D. (2012, January 23\u201325). ROMBAC: The Romanian Balanced Annotated Corpus. Proceedings of the Language Resources and Evaluation Conference (LREC), Istanbul, Turkey."},{"key":"ref_26","first-page":"183","article-title":"A corpus-based investigation of definite description use","volume":"24","author":"Poessio","year":"1998","journal-title":"Comput. Linguist."},{"key":"ref_27","first-page":"249","article-title":"Assessing agreement on classification tasks: The kappa statistic","volume":"22","author":"Carletta","year":"1996","journal-title":"Comput. Linguist."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"12","DOI":"10.1186\/s13326-018-0179-8","article-title":"Clinical Natural Language Processing in languages other than English: Opportunities and challenges","volume":"9","author":"Dalianis","year":"2018","journal-title":"J. Biomed. Semant."},{"key":"ref_29","unstructured":"(2018, April 05). Available online: http:\/\/slp.racai.ro\/index.php\/resources\/monero-3\/."},{"key":"ref_30","unstructured":"Ion, R., Irimia, E., and Barbu Mititelu, V. (2018, January 7\u201312). Ensemble Romanian Dependency Parsing with Neural Networks. Proceedings of the LREC, Miyazaki, Japan."}],"container-title":["Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2306-5729\/3\/4\/53\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T15:31:38Z","timestamp":1760196698000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2306-5729\/3\/4\/53"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,11,23]]},"references-count":30,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2018,12]]}},"alternative-id":["data3040053"],"URL":"https:\/\/doi.org\/10.3390\/data3040053","relation":{},"ISSN":["2306-5729"],"issn-type":[{"type":"electronic","value":"2306-5729"}],"subject":[],"published":{"date-parts":[[2018,11,23]]}}}