{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,4]],"date-time":"2026-02-04T03:19:15Z","timestamp":1770175155219,"version":"3.49.0"},"reference-count":35,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2026,1,6]],"date-time":"2026-01-06T00:00:00Z","timestamp":1767657600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2026,2,3]],"date-time":"2026-02-03T00:00:00Z","timestamp":1770076800000},"content-version":"vor","delay-in-days":28,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Biomed Semant"],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Background<\/jats:title>\n                    <jats:p>Around 30 million people in Europe are affected by a rare (or orphan) disease, defined as a condition occurring in fewer than 1 in 2,000 individuals. The primary challenge is to automatically and efficiently identify scientific articles and guidelines that address a particular rare disease. We present a novel methodology to annotate and index scientific text with taxonomical concepts describing rare diseases from the OrphaNet taxonomy. This task is complicated by several technical challenges, including the lack of sufficiently large, human-annotated datasets for supervised training and the polysemy\/synonymy and surface-form variation of rare disease names, which can hinder any annotation engine.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>\n                      We introduce a framework that operationalizes OrphaNet for large-scale literature annotation by integrating the TERMite engine with curated synonym expansion, label normalization (including deprecated\/renamed concepts), and fuzzy matching. On benchmark datasets, the approach achieves\n                      <jats:bold>precision\u2009=\u200992%<\/jats:bold>\n                      ,\n                      <jats:bold>recall\u2009=\u200975%<\/jats:bold>\n                      , and\n                      <jats:bold>F1\u2009=\u200983%<\/jats:bold>\n                      , outperforming an string-matching baseline. Applying the pipeline to Scopus produces disease-specific corpora suitable for bibliometric and scientometric analyses (e.g., institution, country, and subject-area profiles). These outputs power the\n                      <jats:italic>Rare Diseases Monitor<\/jats:italic>\n                      dashboard for exploring national and global research activity.\n                    <\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Conclusion<\/jats:title>\n                    <jats:p>To our knowledge, this is the first systematic, scalable semantic framework for annotating and indexing rare disease literature at scale. By operationalizing OrphaNet in an automated, reproducible pipeline and addressing data scarcity and lexical variability, the work advances biomedical semantics for rare diseases and enables disease-centric monitoring, evaluation, and discovery across the research landscape.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1186\/s13326-025-00346-1","type":"journal-article","created":{"date-parts":[[2026,1,6]],"date-time":"2026-01-06T06:25:12Z","timestamp":1767680712000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Annotating and indexing scientific articles with rare diseases"],"prefix":"10.1186","volume":"17","author":[{"given":"Hosein","family":"Azarbonyad","sequence":"first","affiliation":[]},{"given":"Zubair","family":"Afzal","sequence":"additional","affiliation":[]},{"given":"Rik","family":"Iping","sequence":"additional","affiliation":[]},{"given":"Max","family":"Dumoulin","sequence":"additional","affiliation":[]},{"given":"Ilse","family":"Nederveen","sequence":"additional","affiliation":[]},{"given":"Jiangtao","family":"Yu","sequence":"additional","affiliation":[]},{"given":"Georgios","family":"Tsatsaronis","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2026,1,6]]},"reference":[{"issue":"5","key":"346_CR1","doi-asserted-by":"publisher","first-page":"803","DOI":"10.1002\/humu.22078","volume":"33","author":"A Rath","year":"2012","unstructured":"Rath A, Olry A, Dhombres F, Brandt MM, Urbero B, Ayme S. Representation of rare diseases in health information systems: the orphanet approach to serve a wide range of end users. Hum Mutat. 2012;33(5):803\u201308.","journal-title":"Hum Mutat"},{"issue":"4","key":"346_CR2","doi-asserted-by":"publisher","first-page":"150","DOI":"10.3390\/info10040150","volume":"10","author":"K Kowsari","year":"2019","unstructured":"Kowsari K, Jafari Meimandi K, Heidarysafa M, Mendu S, Barnes L, Brown D. Text classification algorithms: a survey. Information. 2019;10(4):150.","journal-title":"Information"},{"issue":"8","key":"346_CR3","doi-asserted-by":"publisher","first-page":"1819","DOI":"10.1109\/TKDE.2013.39","volume":"26","author":"M-L Zhang","year":"2013","unstructured":"Zhang M-L, Zhou. Z.-H.: a review on multi-label learning algorithms. IEEE Trans Knowl Data Eng. 2013;26(8):1819\u201337.","journal-title":"IEEE Trans Knowl Data Eng"},{"key":"346_CR4","doi-asserted-by":"crossref","unstructured":"Joachims T. Text categorization with support vector machines: learning with many relevant features. European Conference on Machine Learning. 1998, pp. 137\u201342.","DOI":"10.1007\/BFb0026683"},{"key":"346_CR5","first-page":"379","volume-title":"ICML","author":"S Scott","year":"1999","unstructured":"Scott S, Matwin S. Feature engineering for text classification. In: ICML. Vol. 99. 1999. p. 379\u201388."},{"issue":"3","key":"346_CR6","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3439726","volume":"54","author":"S Minaee","year":"2021","unstructured":"Minaee S, Kalchbrenner N, Cambria E, Nikzad N, Chenaghlu M, Gao J. Deep learning\u2013based text classification: a comprehensive review. ACM Comput Surv (CSUR). 2021;54(3):1\u201340.","journal-title":"ACM Comput Surv (CSUR)"},{"issue":"2","key":"346_CR7","first-page":"1","volume":"13","author":"Q Li","year":"2022","unstructured":"Li Q, Peng H, Li J, Xia C, Yang R, Sun L, Yu PS, He L. A survey on text classification: from traditional to deep learning. ACM Trans Intell Syst Technol (TIST). 2022;13(2):1\u201341.","journal-title":"ACM Trans Intell Syst Technol (TIST)"},{"key":"346_CR8","doi-asserted-by":"crossref","unstructured":"Liu J, Chang W-C, Wu Y, Yang Y. Deep learning for extreme multi-label text classification. Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2017, pp. 115\u201324.","DOI":"10.1145\/3077136.3080834"},{"issue":"1","key":"346_CR9","doi-asserted-by":"publisher","first-page":"89","DOI":"10.1017\/S1351324920000029","volume":"27","author":"H Azarbonyad","year":"2021","unstructured":"Azarbonyad H, Dehghani M, Marx M, Kamps J. Learning to rank for multi-label text classification: combining different sources of information. Nat Language Eng. 2021;27(1):89\u2013111.","journal-title":"Nat Language Eng"},{"issue":"3","key":"346_CR10","first-page":"265","volume":"88","author":"CE Lipscomb","year":"2000","unstructured":"Lipscomb CE. Medical subject headings (mesh). Bull Med Lib Assoc. 2000;88(3):265.","journal-title":"Bull Med Lib Assoc"},{"issue":"9","key":"346_CR11","first-page":"518","volume":"152","author":"SS Weinreich","year":"2008","unstructured":"Weinreich SS, Mangon R, Sikkens J, Teeuw ME, Cornel M. Orphanet: a european database for rare diseases. Nederlands tijdschrift voor geneeskunde. 2008;152(9):518\u201319.","journal-title":"Nederlands tijdschrift voor geneeskunde"},{"key":"346_CR12","doi-asserted-by":"publisher","first-page":"125","DOI":"10.1016\/j.asoc.2019.03.041","volume":"79","author":"F Gargiulo","year":"2019","unstructured":"Gargiulo F, Silvestri S, Ciampi M, De Pietro G. Deep neural network for hierarchical extreme multi-label text classification. Appl Soft Comput. 2019;79:125\u201338.","journal-title":"Appl Soft Comput"},{"key":"346_CR13","doi-asserted-by":"crossref","unstructured":"Peng H, Li J, He Y, Liu Y, Bao M, Wang L, Song Y, Yang Q. Large-scale hierarchical text classification with recursively regularized deep graph-cnn. Proceedings of the 2018 World Wide Web Conference. 2018, pp. 1063\u201372.","DOI":"10.1145\/3178876.3186005"},{"key":"346_CR14","doi-asserted-by":"crossref","unstructured":"Zangari A, Marcuzzo M, Schiavinato M, Rizzo M, Gasparetto A, Albarelli A, et al. Hierarchical text classification: a review of current research. Expert Syst Appl 224. 2023.","DOI":"10.3390\/electronics13071199"},{"key":"346_CR15","doi-asserted-by":"publisher","first-page":"101104","DOI":"10.1016\/j.csl.2020.101104","volume":"65","author":"B \u0160krlj","year":"2021","unstructured":"\u0160krlj B, Martinc M, Kralj J, Lavra\u010d N, Pollak S. tax2vec: constructing interpretable features from taxonomies for short text classification. Comput Speech Lan. 2021;65:101104.","journal-title":"Comput Speech Lan"},{"key":"346_CR16","doi-asserted-by":"crossref","unstructured":"Jiang T, Wang D, Sun L, Chen Z, Zhuang F, Yang Q. Exploiting global and local hierarchies for hierarchical text classification. arXiv preprint arXiv:2205.02613 (2022.","DOI":"10.18653\/v1\/2022.emnlp-main.268"},{"key":"346_CR17","first-page":"2074","volume":"35","author":"S Kharbanda","year":"2022","unstructured":"Kharbanda S, Banerjee A, Schultheis E, Babbar R. Cascadexml: rethinking transformers for end-to-end multi-resolution training in extreme multi-label classification. Adv Neural Inf Process Syst. 2022;35:2074\u201387.","journal-title":"Adv Neural Inf Process Syst"},{"key":"346_CR18","doi-asserted-by":"crossref","unstructured":"Gururangan S, Dang T, Card D, Smith NA. Variational pretraining for semi-supervised text classification. arXiv preprint arXiv:1906.02242 (2019.","DOI":"10.18653\/v1\/P19-1590"},{"key":"346_CR19","unstructured":"Berthelot D, Carlini N, Goodfellow I, Papernot N, Oliver A, Raffel CA. Mixmatch: a holistic approach to semi-supervised learning. Adv Neural Inf Process Syst 32. 2019."},{"key":"346_CR20","doi-asserted-by":"crossref","unstructured":"Shen J, Qiu W, Meng Y, Shang J, Ren X, Han J. Taxoclass: hierarchical multi-label text classification using only class names. NAAC\u201921: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, {NAACL-HLT} 2021. 2021;2021.","DOI":"10.18653\/v1\/2021.naacl-main.335"},{"key":"346_CR21","doi-asserted-by":"crossref","unstructured":"Meng Y, Zhang Y, Huang J, Xiong C, Ji H, Zhang C, Han J. Text classification using label names only: a language model self-training approach. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020, pp. 9006\u201317.","DOI":"10.18653\/v1\/2020.emnlp-main.724"},{"key":"346_CR22","doi-asserted-by":"publisher","first-page":"139","DOI":"10.1162\/tacl_a_00259","volume":"7","author":"N Pappas","year":"2019","unstructured":"Pappas N, Henderson J. Gile: a generalized input-label embedding for text classification. Trans Assoc Comput Linguist. 2019;7:139\u201355.","journal-title":"Trans Assoc Comput Linguistics"},{"key":"346_CR23","doi-asserted-by":"crossref","unstructured":"Jett\u00e9 N, Quan H, Hemmelgarn B, Drosler S, Maass C, Oec D-G, Moskal L, Paoin W, Sundararajan V, Gao S, et al. The development, evolution, and modifications of icd-10: challenges to the international comparability of morbidity data. Med Care. 2010;1105\u201310.","DOI":"10.1097\/MLR.0b013e3181ef9d3e"},{"issue":"suppl_1","key":"346_CR24","doi-asserted-by":"publisher","first-page":"267","DOI":"10.1093\/nar\/gkh061","volume":"32","author":"O Bodenreider","year":"2004","unstructured":"Bodenreider O. The unified medical language system (umls): integrating biomedical terminology. Nucleic Acids Res. 2004;32(suppl_1):267\u201370.","journal-title":"Nucleic Acids Res"},{"issue":"2","key":"346_CR25","doi-asserted-by":"publisher","first-page":"109","DOI":"10.2165\/00002018-199920020-00002","volume":"20","author":"EG Brown","year":"1999","unstructured":"Brown EG, Wood L, Wood S. The medical dictionary for regulatory activities (meddra). Drug Saf. 1999;20(2):109\u201317.","journal-title":"Drug Saf"},{"key":"346_CR26","doi-asserted-by":"crossref","unstructured":"Azam SS, Raju M, Pagidimarri V, Kasivajjala VC. Cascadenet: an lstm based deep learning model for automated icd-10 coding. Advances in Information and Communication: Proceedings of the 2019 Future of Information and Communication Conference (FICC). 2020, pp. 55\u201374, Volume 2.","DOI":"10.1007\/978-3-030-12385-7_6"},{"key":"346_CR27","unstructured":"Isaradech N, Khumrin P. Auto-mapping clinical documents to icd-10 using snomed-ct. AMIA Summits Transl Sci Proc. 2021;2021:296."},{"issue":"5","key":"346_CR28","doi-asserted-by":"publisher","first-page":"660","DOI":"10.1136\/amiajnl-2010-000055","volume":"18","author":"M Huang","year":"2011","unstructured":"Huang M, N\u00e9v\u00e9ol A, Lu Z. Recommending mesh terms for annotating biomedical articles. J Am Med Inf Assoc. 2011;18(5):660\u201367.","journal-title":"J Am Med Inf Assoc"},{"key":"346_CR29","doi-asserted-by":"crossref","unstructured":"Jin Q, Dhingra B, Cohen W, Lu X. Attentionmesh: simple, effective and interpretable automatic mesh indexer. Proceedings of the 6th BioASQ Workshop A Challenge on Large-scale Biomedical Semantic Indexing and Question Answering. 2018, pp. 47\u201356.","DOI":"10.18653\/v1\/W18-5306"},{"key":"346_CR30","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s13326-017-0123-3","volume":"8","author":"Y Mao","year":"2017","unstructured":"Mao Y, Lu Z. Mesh now: automatic mesh indexing at pubmed scale via learning to rank. J Biomed Semant. 2017;8:1\u20139.","journal-title":"J Biomed Semant"},{"key":"346_CR31","volume-title":"Multi-lingual icd-10 coding using a hybrid rule-based and supervised classification approach at clef ehealth 2017","author":"J Seva","year":"2017","unstructured":"Seva J, Kittner M, Roller R, Leser U. Multi-lingual icd-10 coding using a hybrid rule-based and supervised classification approach at clef ehealth 2017. In: CLEF (Working Notes); 2017."},{"key":"346_CR32","unstructured":"Boytcheva S. Automatic matching of icd-10 codes to diagnoses in discharge letters. Proceedings of the Second Workshop on Biomedical Natural Language Processing. 2011, pp. 11\u201318."},{"issue":"8","key":"346_CR33","doi-asserted-by":"publisher","first-page":"23230","DOI":"10.2196\/23230","volume":"9","author":"P-F Chen","year":"2021","unstructured":"Chen P-F, Wang S-M, Liao W-C, Kuo L-C, Chen K-C, Lin Y-C, Yang C-Y, Chiu C-H, Chang S-C, Lai F, et al. Automatic icd-10 coding and training system: deep neural network based on supervised learning. JMIR Med Inf. 2021;9(8):23230.","journal-title":"JMIR Med Inf"},{"key":"346_CR34","first-page":"1181","volume-title":"Results of the bioasq track of the question answering lab at clef 2014","author":"G Balikas","year":"2014","unstructured":"Balikas G, Partalas I, Ngomo A-CN, Krithara A, Paliouras G. Results of the bioasq track of the question answering lab at clef 2014. In: CLEF (Working Notes; 2014. p. 1181\u201393."},{"issue":"3","key":"346_CR35","doi-asserted-by":"publisher","first-page":"704","DOI":"10.1162\/qss_a_00320","volume":"5","author":"R Iping","year":"2024","unstructured":"Iping R, Nederveen I, Ranjbar-Sahraei B, Azarbonyad H, Dumoulin M, Tsatsaronis G, Mathijssen IM. The development of a research intelligence tool for rare disease research in the Netherlands. Quant Sci Stud. 2024;5(3):704\u201317.","journal-title":"Quant Sci Stud"}],"container-title":["Journal of Biomedical Semantics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13326-025-00346-1","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13326-025-00346-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13326-025-00346-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,3]],"date-time":"2026-02-03T13:45:58Z","timestamp":1770126358000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1186\/s13326-025-00346-1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,1,6]]},"references-count":35,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2026,12]]}},"alternative-id":["346"],"URL":"https:\/\/doi.org\/10.1186\/s13326-025-00346-1","relation":{},"ISSN":["2041-1480"],"issn-type":[{"value":"2041-1480","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,1,6]]},"assertion":[{"value":"30 June 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"12 December 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"6 January 2026","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not Applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare no competing interests.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"3"}}