{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,28]],"date-time":"2025-10-28T03:09:01Z","timestamp":1761620941724,"version":"3.32.0"},"reference-count":47,"publisher":"Springer Science and Business Media LLC","issue":"1","content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"published-print":{"date-parts":[[2006,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Background<\/jats:title><jats:p>Word sense disambiguation (WSD) is critical in the biomedical domain for improving the precision of natural language processing (NLP), text mining, and information retrieval systems because ambiguous words negatively impact accurate access to literature containing biomolecular entities, such as genes, proteins, cells, diseases, and other important entities. Automated techniques have been developed that address the WSD problem for a number of text processing situations, but the problem is still a challenging one. Supervised WSD machine learning (ML) methods have been applied in the biomedical domain and have shown promising results, but the results typically incorporate a number of confounding factors, and it is problematic to truly understand the effectiveness and generalizability of the methods because these factors interact with each other and affect the final results. Thus, there is a need to explicitly address the factors and to systematically quantify their effects on performance.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>Experiments were designed to measure the effect of \"sample size\" (i.e. size of the datasets), \"sense distribution\" (i.e. the distribution of the different meanings of the ambiguous word) and \"degree of difficulty\" (i.e. the measure of the distances between the meanings of the senses of an ambiguous word) on the performance of WSD classifiers. Support Vector Machine (SVM) classifiers were applied to an automatically generated data set containing four ambiguous biomedical abbreviations:<jats:italic>BPD<\/jats:italic>,<jats:italic>BSA<\/jats:italic>,<jats:italic>PCA<\/jats:italic>, and<jats:italic>RSV<\/jats:italic>, which were chosen because of varying degrees of differences in their respective senses. Results showed that: 1) increasing the sample size generally reduced the error rate, but this was limited mainly to well-separated senses (i.e. cases where the distances between the senses were large); in difficult cases an unusually large increase in sample size was needed to increase performance slightly, which was impractical, 2) the sense distribution did not have an effect on performance when the senses were separable, 3) when there was a majority sense of over 90%, the WSD classifier was not better than use of the simple majority sense, 4) error rates were proportional to the similarity of senses, and 5) there was no statistical difference between results when using a 5-fold or 10-fold cross-validation method. Other issues that impact performance are also enumerated.<\/jats:p><\/jats:sec><jats:sec><jats:title>Conclusion<\/jats:title><jats:p>Several different independent aspects affect performance when using ML techniques for WSD. We found that combining them into one single result obscures understanding of the underlying methods. Although we studied only four abbreviations, we utilized a well-established statistical method that guarantees the results are likely to be generalizable for abbreviations with similar characteristics. The results of our experiments show that in order to understand the performance of these ML methods it is critical that papers report on the baseline performance, the distribution and sample size of the senses in the datasets, and the standard deviation or confidence intervals. In addition, papers should also characterize the difficulty of the WSD task, the WSD situations addressed and not addressed, as well as the ML methods and features used. This should lead to an improved understanding of the generalizablility and the limitations of the methodology.<\/jats:p><\/jats:sec>","DOI":"10.1186\/1471-2105-7-334","type":"journal-article","created":{"date-parts":[[2006,8,16]],"date-time":"2006-08-16T18:16:03Z","timestamp":1155752163000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":33,"title":["Machine learning and word sense disambiguation in the biomedical domain: design and evaluation issues"],"prefix":"10.1186","volume":"7","author":[{"given":"Hua","family":"Xu","sequence":"first","affiliation":[]},{"given":"Marianthi","family":"Markatou","sequence":"additional","affiliation":[]},{"given":"Rositsa","family":"Dimova","sequence":"additional","affiliation":[]},{"given":"Hongfang","family":"Liu","sequence":"additional","affiliation":[]},{"given":"Carol","family":"Friedman","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2006,7,5]]},"reference":[{"key":"1073_CR1","doi-asserted-by":"publisher","first-page":"224","DOI":"10.1186\/gb-2005-6-7-224","volume":"6","author":"M Krallinger","year":"2005","unstructured":"Krallinger M, Valencia A: Text-mining and information-retrieval services for molecular biology. Genome Biol 2005, 6: 224. 10.1186\/gb-2005-6-7-224","journal-title":"Genome Biol"},{"key":"1073_CR2","doi-asserted-by":"publisher","first-page":"821","DOI":"10.1089\/106652703322756104","volume":"10","author":"H Shatkay","year":"2003","unstructured":"Shatkay H, Feldman R: Mining the biomedical literature in the genomic era: an overview. J Comput Biol 2003, 10: 821\u2013855. 10.1089\/106652703322756104","journal-title":"J Comput Biol"},{"key":"1073_CR3","first-page":"707","volume":"707\u201318","author":"K Fukuda","year":"1998","unstructured":"Fukuda K, Tamura A, Tsunoda T, Takagi T: Toward information extraction: identifying protein names from biological papers. Pac Symp Biocomput 1998, 707\u201318: 707\u2013718.","journal-title":"Pac Symp Biocomput"},{"key":"1073_CR4","first-page":"17","volume":"17\u201321","author":"AR Aronson","year":"2001","unstructured":"Aronson AR: Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc AMIA Symp 2001, 17\u201321: 17\u201321.","journal-title":"Proc AMIA Symp"},{"key":"1073_CR5","first-page":"903","volume":"903\u20137","author":"M Weeber","year":"2000","unstructured":"Weeber M, Klein H, Aronson AR, Mork JG, de Jong-van den Berg LT, Vos R: Text-based discovery in biomedicine: the architecture of the DAD-system. Proc AMIA Symp 2000, 903\u20137: 903\u2013907.","journal-title":"Proc AMIA Symp"},{"key":"1073_CR6","volume-title":"UMLS Knowledge Sources","author":"NLM","year":"2000","unstructured":"NLM: UMLS Knowledge Sources. 11th edition. 2000.","edition":"11"},{"key":"1073_CR7","volume-title":"Ambiguity of UMLS metathesaurus 2004 Edition","author":"AR Aronson","year":"2004","unstructured":"Aronson AR, Shooshan SE:Ambiguity of UMLS metathesaurus 2004 Edition. 2004. [http:\/\/skr.nlm.nih.gov\/papers\/references\/ambiguity04.pdf]"},{"key":"1073_CR8","doi-asserted-by":"publisher","first-page":"621","DOI":"10.1197\/jamia.M1101","volume":"9","author":"H Liu","year":"2002","unstructured":"Liu H, Johnson SB, Friedman C: Automatic resolution of ambiguous terms based on machine learning and conceptual relations in the UMLS. J Am Med Inform Assoc 2002, 9: 621\u2013636. 10.1197\/jamia.M1101","journal-title":"J Am Med Inform Assoc"},{"key":"1073_CR9","unstructured":"Sehgal AK, Srinivasan P, Bodenreider O: Gene terms and English words: An ambiguous mix. SIGIR'04 Workshop on Search and Discovery in BioInformatics"},{"issue":"Suppl 1","key":"1073_CR10","doi-asserted-by":"publisher","first-page":"S97","DOI":"10.1093\/bioinformatics\/17.suppl_1.S97","volume":"17","author":"V Hatzivassiloglou","year":"2001","unstructured":"Hatzivassiloglou V, Duboue PA, Rzhetsky A: Disambiguating proteins, genes, and RNA in text: a machine learning approach. Bioinformatics 2001, 17(Suppl 1):S97\u2013106.","journal-title":"Bioinformatics"},{"key":"1073_CR11","doi-asserted-by":"publisher","first-page":"248","DOI":"10.1093\/bioinformatics\/bth496","volume":"21","author":"L Chen","year":"2005","unstructured":"Chen L, Liu H, Friedman C: Gene name ambiguity of eukaryotic nomenclatures. Bioinformatics 2005, 21: 248\u2013256. 10.1093\/bioinformatics\/bth496","journal-title":"Bioinformatics"},{"key":"1073_CR12","doi-asserted-by":"publisher","first-page":"2597","DOI":"10.1093\/bioinformatics\/bth291","volume":"20","author":"MJ Schuemie","year":"2004","unstructured":"Schuemie MJ, Weeber M, Schijvenaars BJA, van Mulligen EM, van der Eijk CC, Jelier R, et al.: Distribution of information in biomedical abstracts and full-text publications. Bioinformatics 2004, 20: 2597\u20132604. 10.1093\/bioinformatics\/bth291","journal-title":"Bioinformatics"},{"key":"1073_CR13","doi-asserted-by":"publisher","first-page":"149","DOI":"10.1186\/1471-2105-6-149","volume":"6","author":"BJ Schijvenaars","year":"2005","unstructured":"Schijvenaars BJ, Mons B, Weeber M, Schuemie MJ, van Mulligen EM, Wain HM, et al.: Thesaurus-based disambiguation of gene symbols. BMC Bioinformatics 2005, 6: 149. 10.1186\/1471-2105-6-149","journal-title":"BMC Bioinformatics"},{"key":"1073_CR14","first-page":"D54","volume":"3","author":"D Maglott","year":"2005","unstructured":"Maglott D, Ostell J, Pruitt KD, Tatusova T: Entrez Gene: Gene-centered information at NCBI. Nucleic Acids Res 2005, 3: D54-D58.","journal-title":"Nucleic Acids Res"},{"key":"1073_CR15","first-page":"208","volume-title":"Machine Translation of Languages","author":"VH Yngve","year":"1955","unstructured":"Yngve VH: Syntax and the problem of multiple meaning. In Machine Translation of Languages. New York, John Wiley & Sons; 1955:208\u2013226."},{"key":"1073_CR16","unstructured":"Mooney RJ: Comparative experiments on disambiguating word senses: An illustration of the role of bias in machine learning. Proc 1996 Conf on Empirical Methods in Natural Language Processing 82\u201391."},{"key":"1073_CR17","doi-asserted-by":"crossref","unstructured":"Ng HT, Lee HB: Integrating multiple knowledge sources to disambiguate word sense: An examplar-based approach. Proc 34th Ann Meeting Assoc for Comput Ling 40\u201347.","DOI":"10.3115\/981863.981869"},{"key":"1073_CR18","unstructured":"Merkel M, Andersson M: Combination of contextual features for word sense disambiguation. SENSEVAL-2 Workshop 123\u2013127."},{"key":"1073_CR19","doi-asserted-by":"crossref","unstructured":"Bruce R, Wiebe J: Word sense disambiguation using decomposable models. Proceedings of the Thirty-second Annual Meeting of the Association of Computational Linguistics 139\u2013146.","DOI":"10.3115\/981732.981752"},{"key":"1073_CR20","doi-asserted-by":"crossref","first-page":"41","DOI":"10.3115\/1118693.1118699","volume-title":"Proc EMNLP","author":"YK Lee","year":"2002","unstructured":"Lee YK, Ng HT: An empirical evaluation of knowledge sources and learning algorithms for word sense disambiguation. Proc EMNLP 2002, 41\u201348."},{"key":"1073_CR21","volume-title":"SENSEVAL-2","author":"S Cotton","year":"1998","unstructured":"Cotton S, Edmonds P, Kilgarriff A, Palmer M: SENSEVAL-2.1998. [http:\/\/www.sle.sharp.co.uk\/senseval2\/]"},{"key":"1073_CR22","unstructured":"Mohammad S, Pedersen T: Combining lexical and syntactic features for supervised word sense disambiguation. Proc of the CoNLL"},{"key":"1073_CR23","volume-title":"Providing Machine Tractable Dictionary Tools","author":"Y Wilks","year":"1990","unstructured":"Wilks Y, Fass D, Guo C, MacDonald J, Plate T, Slator B: Providing Machine Tractable Dictionary Tools. Cambridge, MA: MIT Press; 1990."},{"key":"1073_CR24","unstructured":"Liddy ED, Paik W: Statistically-guided word sense disambiguation. AAAI Fall Symp 93 98\u2013107."},{"key":"1073_CR25","doi-asserted-by":"publisher","first-page":"52","DOI":"10.1093\/nar\/30.1.52","volume":"30","author":"A Hamosh","year":"2002","unstructured":"Hamosh A, Scott AF, Amberger J, Bocchini C, Valle D, McKusick VA: Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucl Acids Res 2002, 30: 52\u201355. 10.1093\/nar\/30.1.52","journal-title":"Nucl Acids Res"},{"key":"1073_CR26","doi-asserted-by":"publisher","first-page":"96","DOI":"10.1002\/asi.20257","volume":"57","author":"SM Humphrey","year":"2006","unstructured":"Humphrey SM, Rogers WJ, Kilicoglu H, Demner-Fushman D, Rindflesch TC: Word sense disambiguation by selecting the best semantic type based on Journal Descriptor Indexing: Preliminary experiment. Journal of the American Society for Information Science and Technology 2006, 57: 96\u2013113. 10.1002\/asi.20257","journal-title":"Journal of the American Society for Information Science and Technology"},{"key":"1073_CR27","first-page":"605","volume":"5","author":"F Ginter","year":"2004","unstructured":"Ginter F, Boberg J, Salakoski T, Salakoski T: New Techniques for Disambiguation in Natural Language and Their Application to Biological Text. Journal of Machine Learning Research 2004, 5: 605\u2013621.","journal-title":"Journal of Machine Learning Research"},{"key":"1073_CR28","doi-asserted-by":"crossref","unstructured":"Podowski RM, Cleary JG, Goncharoff NT, Amoutzias G, Hayes WS: AZuRE, a scalable system for automated term disambiguation of gene and protein names. Proc 2004 IEEE CSB","DOI":"10.1142\/S0219720005001223"},{"key":"1073_CR29","doi-asserted-by":"publisher","first-page":"320","DOI":"10.1197\/jamia.M1533","volume":"11","author":"H Liu","year":"2004","unstructured":"Liu H, Teller V, Friedman C: A multi-aspect comparison study of supervised word sense disambiguation. J Am Med Inform Assoc 2004, 11: 320\u2013331. 10.1197\/jamia.M1533","journal-title":"J Am Med Inform Assoc"},{"key":"1073_CR30","doi-asserted-by":"publisher","first-page":"573","DOI":"10.1016\/j.ijmedinf.2005.03.013","volume":"74","author":"G Leroy","year":"2005","unstructured":"Leroy G, Rindflesch TC: Effects of information and machine learning algorithms on word sense disambiguation with small datasets. Int J Med Inform 2005, 74: 573\u2013585. 10.1016\/j.ijmedinf.2005.03.013","journal-title":"Int J Med Inform"},{"key":"1073_CR31","doi-asserted-by":"publisher","first-page":"3658","DOI":"10.1093\/bioinformatics\/bti586","volume":"21","author":"S Gaudan","year":"2005","unstructured":"Gaudan S, Krisch H, Rebholz-Schuhmann D: Resolving abbreviations to their senses in Medline. Bioinformatics 2005, 21: 3658\u20133664. 10.1093\/bioinformatics\/bti586","journal-title":"Bioinformatics"},{"key":"1073_CR32","doi-asserted-by":"publisher","first-page":"554","DOI":"10.1089\/cmb.2005.12.554","volume":"12","author":"MJ Schuemie","year":"2005","unstructured":"Schuemie MJ, Kors JA, Mons B: Word sense disambiguation in the biomedical domain: an overview. J Comput Biol 2005, 12: 554\u2013565. 10.1089\/cmb.2005.12.554","journal-title":"J Comput Biol"},{"key":"1073_CR33","doi-asserted-by":"publisher","first-page":"675","DOI":"10.1080\/01621459.1937.10503522","volume":"32","author":"M Friedman","year":"1937","unstructured":"Friedman M: The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association 1937, 32: 675\u2013701. 10.2307\/2279372","journal-title":"Journal of the American Statistical Association"},{"key":"1073_CR34","first-page":"141","volume":"5","author":"R Rifkin","year":"2004","unstructured":"Rifkin R, Klatau A: In defense of one-vs-all classification. Journal of Machine Learning Research 2004, 5: 141.","journal-title":"Journal of Machine Learning Research"},{"key":"1073_CR35","first-page":"1127","volume":"6","author":"M Markatou","year":"2005","unstructured":"Markatou M, Tian H, Biswas S, Hripcsak G: Analysis of variance of cross-validation estimators of the generalization error. Journal of Machine Learning Research 2005, 6: 1127\u20131168.","journal-title":"Journal of Machine Learning Research"},{"key":"1073_CR36","doi-asserted-by":"publisher","first-page":"113","DOI":"10.1017\/S1351324999002211","volume":"5","author":"P Resnik","year":"2000","unstructured":"Resnik P, Yarowsky D: Distinguishing systems and distinguishing senses: New evaluation tools for words sense disambiguation. Natural Lang Eng 2000, 5: 113\u2013133. 10.1017\/S1351324999002211","journal-title":"Natural Lang Eng"},{"key":"1073_CR37","unstructured":"Pedersen T, Bruce R: Distinguishing word senses in untagged text. Second Conference on Empirical Methods in Natural Language Processing"},{"key":"1073_CR38","doi-asserted-by":"publisher","first-page":"415","DOI":"10.1109\/72.991427","volume":"13","author":"G Hsu","year":"2006","unstructured":"Hsu G, Lin C: A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks 2006, 13: 415\u2013425. 10.1109\/72.991427","journal-title":"IEEE Transactions on Neural Networks"},{"key":"1073_CR39","doi-asserted-by":"publisher","first-page":"1895","DOI":"10.1162\/089976698300017197","volume":"10","author":"TG Dietterich","year":"1998","unstructured":"Dietterich TG: Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation 1998, 10: 1895\u20131924. 10.1162\/089976698300017197","journal-title":"Neural Computation"},{"key":"1073_CR40","doi-asserted-by":"publisher","first-page":"317","DOI":"10.1023\/A:1009752403260","volume":"1","author":"SL Salzberg","year":"1997","unstructured":"Salzberg SL: On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery 1997, 1: 317\u2013328. 10.1023\/A:1009752403260","journal-title":"Data Mining and Knowledge Discovery"},{"key":"1073_CR41","doi-asserted-by":"crossref","unstructured":"Engelson SP, Dagan I: Minimizing manual annotation cost in supervised training from corpora. 34th Annual Meeting of Association for Computational Linguistics 34:319\u2013326.","DOI":"10.3115\/981863.981905"},{"key":"1073_CR42","first-page":"371","volume":"10","author":"J Pustejovsky","year":"2001","unstructured":"Pustejovsky J, Castano J, Cochran B, Kotecki M, Morrell M: Automatic extraction of acronym-meaning pairs from MEDLINE databases. Medinfo 2001, 10: 371\u2013375.","journal-title":"Medinfo"},{"key":"1073_CR43","volume-title":"Construction and assessment of classification rules","author":"DJ Hand","year":"1997","unstructured":"Hand DJ: Construction and assessment of classification rules. Chichester, England: John Wiley & Sons; 1997."},{"key":"1073_CR44","doi-asserted-by":"publisher","first-page":"3","DOI":"10.1080\/757583645","volume":"21","author":"DJ Hand","year":"1994","unstructured":"Hand DJ: Assessing Classification Rules. Journal of Applied Statistics 1994, 21: 3\u201316.","journal-title":"Journal of Applied Statistics"},{"key":"1073_CR45","doi-asserted-by":"publisher","first-page":"873","DOI":"10.1109\/34.31448","volume":"11","author":"K Fukunaga","year":"1989","unstructured":"Fukunaga K, Hayes RR: Effect of sample size in classifier design. IEEE Transactions in Pattern Analysis and MachineIntelligence 1989, 11: 873\u2013885. 10.1109\/34.31448","journal-title":"IEEE Transactions in Pattern Analysis and MachineIntelligence"},{"key":"1073_CR46","volume-title":"Spider-MachineLearning Package","author":"J Weston","year":"2005","unstructured":"Weston J, Elisseeff A, BakIr G, Sinz F: Spider-MachineLearning Package.2005. [http:\/\/www.kyb.tuebingen.mpg.de\/bs\/people\/spider\/index.html]"},{"key":"1073_CR47","unstructured":"Weston J, Watkins C: Multiclass support vector machines. Proceedings of ESANN99. D.Facto Press.; 1999."}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/1471-2105-7-334.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,1,10]],"date-time":"2025-01-10T13:24:02Z","timestamp":1736515442000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/1471-2105-7-334"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2006,7,5]]},"references-count":47,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2006,12]]}},"alternative-id":["1073"],"URL":"https:\/\/doi.org\/10.1186\/1471-2105-7-334","relation":{},"ISSN":["1471-2105"],"issn-type":[{"type":"electronic","value":"1471-2105"}],"subject":[],"published":{"date-parts":[[2006,7,5]]},"assertion":[{"value":"26 January 2006","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"5 July 2006","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"5 July 2006","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"334"}}