{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,9,15]],"date-time":"2024-09-15T18:41:26Z","timestamp":1726425686570},"reference-count":70,"publisher":"Cambridge University Press (CUP)","issue":"3","license":[{"start":{"date-parts":[[2013,2,11]],"date-time":"2013-02-11T00:00:00Z","timestamp":1360540800000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/www.cambridge.org\/core\/terms"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2014,7]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>This paper addresses the task of automatic extraction of definitions by thoroughly exploring an approach that solely relies on machine learning techniques, and by focusing on the issue of the imbalance of relevant datasets. We obtained a breakthrough in terms of the automatic extraction of definitions, by extensively and systematically experimenting with different sampling techniques and their combination, as well as a range of different types of classifiers. Performance consistently scored in the range of 0.95\u20130.99 of area under the receiver operating characteristics, with a notorious improvement between 17 and 22 percentage points regarding the baseline of 0.73\u20130.77, for datasets with different rates of imbalance. Thus, the present paper also represents a contribution to the seminal work in natural language processing that points toward the importance of exploring the research path of applying sampling techniques to mitigate the bias induced by highly imbalanced datasets, and thus greatly improving the performance of a large range of tools that rely on them.<\/jats:p>","DOI":"10.1017\/s1351324912000381","type":"journal-article","created":{"date-parts":[[2013,2,11]],"date-time":"2013-02-11T10:23:31Z","timestamp":1360578211000},"page":"327-359","source":"Crossref","is-referenced-by-count":11,"title":["Coping with highly imbalanced datasets: A case study with definition extraction in a multilingual setting"],"prefix":"10.1017","volume":"20","author":[{"given":"ROSA","family":"DEL GAUDIO","sequence":"first","affiliation":[]},{"given":"GUSTAVO","family":"BATISTA","sequence":"additional","affiliation":[]},{"given":"ANT\u00d3NIO","family":"BRANCO","sequence":"additional","affiliation":[]}],"member":"56","published-online":{"date-parts":[[2013,2,11]]},"reference":[{"key":"S1351324912000381_ref70","first-page":"783","volume-title":"Proceeding Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning","author":"Zhu","year":"2007"},{"key":"S1351324912000381_ref68","first-page":"786","volume-title":"Proceedings of the Twentieth International Conference on Machine Learning \u2013 ICML 2003 Workshop on Learning from Imbalanced Data Sets","author":"Wu","year":"2003"},{"doi-asserted-by":"publisher","key":"S1351324912000381_ref66","DOI":"10.1109\/TSMC.1972.4309137"},{"unstructured":"Westerhout E. , and Monachesi P. 2008. Creating glossaries using pattern-based and machine learning techniques. In Proceedings of the International Conference on Language Resources and Evaluation, pp. 3074\u201381.","key":"S1351324912000381_ref65"},{"key":"S1351324912000381_ref64","first-page":"219","volume-title":"Proceedings of the Computational Linguistics in the Netherlands (CLIN 2007)","author":"Westerhout","year":"2007"},{"key":"S1351324912000381_ref62","first-page":"88","volume-title":"Proceedings of the Student Research Workshop at EACL","author":"Westerhout","year":"2009"},{"key":"S1351324912000381_ref61","first-page":"35","volume-title":"Proceedings of the International Conference on Data Mining","author":"Weiss","year":"2007"},{"key":"S1351324912000381_ref60","first-page":"20","volume-title":"Proceedings of the Workshop on Information Extraction Beyond The Document","author":"Walter","year":"2006"},{"key":"S1351324912000381_ref58","first-page":"63","volume-title":"Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora: Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics (EMNLP\u201900), vol. 13","author":"Toutanova","year":"2000"},{"key":"S1351324912000381_ref57","first-page":"769","article-title":"Two modifications of CNN","volume":"6","author":"Tomek","year":"1976","journal-title":"IEEE Transactions on Systems, Man and Cybernetics"},{"volume-title":"Definition Extraction for Glossary Creation: A Study on Extracting Definitions for Semi-automatic Glossary Creation in Dutch","year":"2010","author":"Westerhout","key":"S1351324912000381_ref63"},{"doi-asserted-by":"publisher","key":"S1351324912000381_ref56","DOI":"10.1145\/1597735.1597754"},{"key":"S1351324912000381_ref53","doi-asserted-by":"crossref","first-page":"74","DOI":"10.1075\/term.14.1.05sie","article-title":"Definitional verbal patterns for semantic relation extraction","volume":"14","author":"Sierra","year":"2008","journal-title":"Terminology"},{"key":"S1351324912000381_ref52","first-page":"229","volume-title":"Proceeding of the 12th EURALEX International Congress","author":"Sierra","year":"2006"},{"key":"S1351324912000381_ref51","first-page":"47","volume-title":"Proceedings of the First Workshop on Definition Extraction at the Recent Advances in Natural Language Processing Conference (RANLP 2009)","author":"Sepp\u00e4l\u00e4","year":"2009"},{"key":"S1351324912000381_ref50","first-page":"1927","volume-title":"Proceedings of the International Conference on Language Resources and Evaluation","author":"Saggion","year":"2004"},{"key":"S1351324912000381_ref49","first-page":"898","volume-title":"Proceedings of the 16th International Joint Conference on Artificial Intelligence (IJCAI\u201999), vol. 2","author":"Roth","year":"1999"},{"key":"S1351324912000381_ref6","first-page":"31","volume-title":"Actes des 6 \u00c9mes Rencontres Terminologie et Intelligence Artificielle (TIA 2005)","author":"Baneyx","year":"2005"},{"key":"S1351324912000381_ref45","first-page":"817","volume-title":"Seventh International Congress on Lexicography (EURALEX 96)","author":"Pearson","year":"1996"},{"key":"S1351324912000381_ref24","first-page":"837","volume-title":"Proceedings of the Sixth International Language Resources and Evaluation (LREC\u201908)","author":"Deg\u00f3rski","year":"2008"},{"key":"S1351324912000381_ref32","first-page":"338","volume-title":"Eleventh Conference on Uncertainty in Artificial Intelligence","author":"John","year":"1995"},{"unstructured":"Tjong E. , Sang K. , Bouma G. and de Rijke M. 2005. Developing offline strategies for answering medical questions. In Proceedings of the AAAI-05 Workshop on Question Answering in Restricted Domains, pp. 41\u20135.","key":"S1351324912000381_ref55"},{"key":"S1351324912000381_ref37","first-page":"231","volume-title":"Encyclopedia of Machine Learning","author":"Ling","year":"2008"},{"key":"S1351324912000381_ref33","doi-asserted-by":"crossref","first-page":"180","DOI":"10.1145\/354756.354817","volume-title":"Proceeding of the Ninth International Conference on Information and Knowledge Management","author":"Joho","year":"2000"},{"volume-title":"Data Mining: Practical Machine Learning Tools and Techniques","year":"2005","author":"Witten","key":"S1351324912000381_ref67"},{"doi-asserted-by":"publisher","key":"S1351324912000381_ref41","DOI":"10.3115\/1220355.1220554"},{"doi-asserted-by":"publisher","key":"S1351324912000381_ref59","DOI":"10.1145\/1557019.1557112"},{"key":"S1351324912000381_ref1","first-page":"42","volume-title":"Proceedings of the ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages","author":"Aceda\u0144ski","year":"2012"},{"key":"S1351324912000381_ref35","first-page":"237","volume-title":"International Conference on Natural Language Processing (GoTAL 2008)","author":"Kobyli\u0144ski","year":"2008"},{"doi-asserted-by":"publisher","key":"S1351324912000381_ref2","DOI":"10.1007\/BF00153759"},{"key":"S1351324912000381_ref8","first-page":"35","volume-title":"Proceedings of the Second Brazilian Workshop on Bioinformatics","author":"Batista","year":"2003"},{"doi-asserted-by":"publisher","key":"S1351324912000381_ref30","DOI":"10.3115\/992133.992154"},{"doi-asserted-by":"publisher","key":"S1351324912000381_ref16","DOI":"10.1023\/A:1010933404324"},{"doi-asserted-by":"publisher","key":"S1351324912000381_ref40","DOI":"10.1075\/nlp.2.15mey"},{"key":"S1351324912000381_ref9","first-page":"20","article-title":"A study of the behavior of several methods for balancing machine learning training data","volume":"6","author":"Batista","year":"2004","journal-title":"Special Interest Group on Knowledge Discovery and Data Mining Explorations Newsletter \u2013 Special Issue on Learning from Imbalanced Datasets"},{"unstructured":"Muresan S. , and Klavans J. 2002. A method for automatically building and evaluating dictionary resources. In Proceedings of the Language Resources and Evaluation Conference (LREC), pp. 231\u20134.","key":"S1351324912000381_ref42"},{"doi-asserted-by":"publisher","key":"S1351324912000381_ref3","DOI":"10.1007\/978-3-642-04235-5_33"},{"doi-asserted-by":"publisher","key":"S1351324912000381_ref5","DOI":"10.3115\/1220575.1220616"},{"doi-asserted-by":"publisher","key":"S1351324912000381_ref54","DOI":"10.1016\/j.jbi.2008.09.001"},{"doi-asserted-by":"publisher","key":"S1351324912000381_ref7","DOI":"10.1075\/scl.11"},{"key":"S1351324912000381_ref4","first-page":"195","article-title":"Processing dictionary definitions with phrasal pattern hierarchies","volume":"13","author":"Alshawi","year":"1987","journal-title":"American Journal of Computational Linguistics"},{"doi-asserted-by":"publisher","key":"S1351324912000381_ref29","DOI":"10.1109\/TIT.1968.1054155"},{"key":"S1351324912000381_ref12","first-page":"1063","article-title":"Analysis of a random forests model","volume":"13","author":"Biau","year":"2012","journal-title":"Journal of Machine Learning Research"},{"key":"S1351324912000381_ref13","first-page":"26","volume-title":"Proceedings of the First Workshop on Definition Extraction (WDE\u201909)","author":"Borg","year":"2009"},{"volume-title":"XML, corpus encoding standard, document XCES 0.2","year":"2002","author":"Ide","key":"S1351324912000381_ref31"},{"doi-asserted-by":"publisher","key":"S1351324912000381_ref14","DOI":"10.1016\/S0031-3203(96)00142-2"},{"doi-asserted-by":"publisher","key":"S1351324912000381_ref11","DOI":"10.1109\/ICDM.2006.93"},{"doi-asserted-by":"crossref","unstructured":"Branco A. , and Silva J. R. 2006. LX-Suite: shallow processing tools for Portuguese. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL\u201906), pp. 179\u201383.","key":"S1351324912000381_ref15","DOI":"10.3115\/1608974.1609003"},{"unstructured":"Chang C.-C. , and Lin C.-J. 2001. LIBSVM: a library for support vector machines. http:\/\/www.csie.ntu.edu.tw\/cjlin\/libsvm.","key":"S1351324912000381_ref17"},{"unstructured":"de Freitas M. C. 2007. Elabora\u00e7\u00e3o autom\u00e1tica de ontologias de Dom\u00ednio: Discuss\u00e3o e Resultados. PhD thesis, Pontif\u00edcia Universidade Cat\u00f3lica de Rio de Janeiro.","key":"S1351324912000381_ref22"},{"key":"S1351324912000381_ref48","doi-asserted-by":"crossref","first-page":"77","DOI":"10.1613\/jair.279","article-title":"Improved use of continuous attributes in C4.5","volume":"4","author":"Quinlan","year":"1996","journal-title":"Journal of Artificial Intelligence Research"},{"key":"S1351324912000381_ref18","first-page":"1286","article-title":"Offline definition extraction using machine learning for knowledge-oriented question answering","volume":"3","author":"Chang","year":"2007","journal-title":"Proceeding of International Conference on Intelligent Computing ICIC"},{"doi-asserted-by":"publisher","key":"S1351324912000381_ref20","DOI":"10.1145\/1007730.1007733"},{"volume-title":"Using random forest to learn imbalanced data","year":"2004","author":"Chen","key":"S1351324912000381_ref21"},{"doi-asserted-by":"publisher","key":"S1351324912000381_ref23","DOI":"10.1109\/IMCSIT.2008.4747264"},{"key":"S1351324912000381_ref25","first-page":"85","volume-title":"Proceedings of the 9th European Conference on Machine Learning","author":"Demir\u00f6z","year":"1997"},{"key":"S1351324912000381_ref26","first-page":"973","volume-title":"Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI\u201901)","author":"Elkan","year":"2001"},{"key":"S1351324912000381_ref34","first-page":"324","volume-title":"Proceedings of the American Medical Informatics Association Symposium (AMIA 2001)","author":"Klavans","year":"2001"},{"key":"S1351324912000381_ref44","first-page":"1","volume-title":"Proceeding of the 19th International Conference on Computational Linguistics","author":"Park","year":"2002"},{"doi-asserted-by":"publisher","key":"S1351324912000381_ref47","DOI":"10.1007\/978-3-540-87391-4_23"},{"key":"S1351324912000381_ref27","first-page":"64","volume-title":"Proceedings of the EACL workshop on Learning Structured Information in Natural Language Applications","author":"Fahmi","year":"2006"},{"volume-title":"ROC graphs: notes and practical considerations for researchers","year":"2004","author":"Fawcett","key":"S1351324912000381_ref28"},{"key":"S1351324912000381_ref10","first-page":"24","volume-title":"Advances in Intelligent Data Analysis VI, Sixth International Symposium on Intelligent Data Analysis, IDA 2005","author":"Batista","year":"2005"},{"key":"S1351324912000381_ref36","first-page":"63","volume-title":"AIME \u201801: Proceedings of the Eighth Conference on AI in Medicine in Europe","author":"Laurikkala","year":"2001"},{"key":"S1351324912000381_ref19","doi-asserted-by":"crossref","first-page":"321","DOI":"10.1613\/jair.953","article-title":"SMOTE: synthetic minority over-sampling technique","volume":"16","author":"Chawla","year":"2002","journal-title":"Journal of Artificial Intelligence Research"},{"doi-asserted-by":"publisher","key":"S1351324912000381_ref38","DOI":"10.1016\/j.csl.2005.06.002"},{"unstructured":"Malaise V. , Zweigenbaum P. , and Bachimont B. 2004. Detecting semantic relations between terms in definitions. In The Third Edition of CompuTerm Workshop (CompuTerm 2004) at Coling, pp. 55\u201362.","key":"S1351324912000381_ref39"},{"doi-asserted-by":"publisher","key":"S1351324912000381_ref69","DOI":"10.1142\/S0218001405003983"},{"doi-asserted-by":"publisher","key":"S1351324912000381_ref43","DOI":"10.3115\/991719.991734"},{"doi-asserted-by":"publisher","key":"S1351324912000381_ref46","DOI":"10.1109\/TKDE.2011.59"}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S1351324912000381","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,2,8]],"date-time":"2022-02-08T06:29:27Z","timestamp":1644301767000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S1351324912000381\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2013,2,11]]},"references-count":70,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2014,7]]}},"alternative-id":["S1351324912000381"],"URL":"https:\/\/doi.org\/10.1017\/s1351324912000381","relation":{},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"type":"print","value":"1351-3249"},{"type":"electronic","value":"1469-8110"}],"subject":[],"published":{"date-parts":[[2013,2,11]]}}}