{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,12]],"date-time":"2025-10-12T01:53:54Z","timestamp":1760234034570,"version":"build-2065373602"},"reference-count":32,"publisher":"MDPI AG","issue":"6","license":[{"start":{"date-parts":[[2021,3,17]],"date-time":"2021-03-17T00:00:00Z","timestamp":1615939200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Applied Sciences"],"abstract":"<jats:p>The exponential growth of documents in the web makes it very hard for researchers to be aware of the relevant work being done within the scientific community. The task of efficiently retrieving information has therefore become an important research topic. The objective of this study is to test how the efficiency of the text classification changes if different weights are previously assigned to the sections that compose the documents. The proposal takes into account the place (section) where terms are located in the document, and each section has a weight that can be modified depending on the corpus. To carry out the study, an extended version of the OHSUMED corpus with full documents have been created. Through the use of WEKA, we compared the use of abstracts only with that of full texts, as well as the use of section weighing combinations to assess their significance in the scientific article classification process using the SMO (Sequential Minimal Optimization), the WEKA Support Vector Machine (SVM) algorithm implementation. The experimental results show that the proposed combinations of the preprocessing techniques and feature selection achieve promising results for the task of full text scientific document classification. We also have evidence to conclude that enriched datasets with text from certain sections achieve better results than using only titles and abstracts.<\/jats:p>","DOI":"10.3390\/app11062674","type":"journal-article","created":{"date-parts":[[2021,3,17]],"date-time":"2021-03-17T11:48:22Z","timestamp":1615981702000},"page":"2674","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["Classification of Full Text Biomedical Documents: Sections Importance Assessment"],"prefix":"10.3390","volume":"11","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8107-8747","authenticated-orcid":false,"given":"Carlos Adriano","family":"Oliveira Gon\u00e7alves","sequence":"first","affiliation":[{"name":"Computer Science Department, University of Vigo, Escuela Superior de Ingenier\u00eda Inform\u00e1tica, 32004 Ourense, Spain"},{"name":"CINBIO\u2014Biomedical Research Centre, University of Vigo, 36310 Vigo, Spain"},{"name":"SING Research Group, Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, 36310 Vigo, Spain"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0940-3554","authenticated-orcid":false,"given":"Rui","family":"Camacho","sequence":"additional","affiliation":[{"name":"Faculdade de Engenharia da Universidade do Porto, LIAAD-INESC TEC, 4200-465 Porto, Portugal"}]},{"given":"C\u00e9lia Talma","family":"Gon\u00e7alves","sequence":"additional","affiliation":[{"name":"ISCAP\u2014P.PORTO, CEOS.PP, LIACC, Campus da FEUP, 4369-00 Porto, Portugal"}]},{"given":"Adri\u00e1n","family":"Seara Vieira","sequence":"additional","affiliation":[{"name":"Computer Science Department, University of Vigo, Escuela Superior de Ingenier\u00eda Inform\u00e1tica, 32004 Ourense, Spain"},{"name":"CINBIO\u2014Biomedical Research Centre, University of Vigo, 36310 Vigo, Spain"},{"name":"SING Research Group, Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, 36310 Vigo, Spain"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6089-6166","authenticated-orcid":false,"given":"Lourdes","family":"Borrajo Diz","sequence":"additional","affiliation":[{"name":"Computer Science Department, University of Vigo, Escuela Superior de Ingenier\u00eda Inform\u00e1tica, 32004 Ourense, Spain"},{"name":"CINBIO\u2014Biomedical Research Centre, University of Vigo, 36310 Vigo, Spain"},{"name":"SING Research Group, Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, 36310 Vigo, Spain"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7172-6947","authenticated-orcid":false,"given":"Eva","family":"Lorenzo Iglesias","sequence":"additional","affiliation":[{"name":"Computer Science Department, University of Vigo, Escuela Superior de Ingenier\u00eda Inform\u00e1tica, 32004 Ourense, Spain"},{"name":"CINBIO\u2014Biomedical Research Centre, University of Vigo, 36310 Vigo, Spain"},{"name":"SING Research Group, Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, 36310 Vigo, Spain"}]}],"member":"1968","published-online":{"date-parts":[[2021,3,17]]},"reference":[{"key":"ref_1","unstructured":"Salton, G. (1971). The SMART Retrieval System\u2014Experiments in Automatic Document Processing, Prentice-Hall Inc."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"N\u00e9dellec, C., and Rouveirol, C. (1998). Text categorization with support vector machines: Learning with many relevant features. Machine Learning: ECML-98, Springer.","DOI":"10.1007\/BFb0026664"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/505282.505283","article-title":"Machine learning in automated text categorization","volume":"34","author":"Sebastiani","year":"2002","journal-title":"ACM Comput. Surv."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Sun, Z., Errami, M., Long, T., Renard, C., Choradia, N., and Garner, H. (2010). Systematic characterizations of text similarity in full text biomedical publications. PLoS ONE, 5.","DOI":"10.1371\/journal.pone.0012704"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Westergaard, D., St\u00e6rfeldt, H.H., T\u00f8nsberg, C., Jensen, L.J., and Brunak, S. (2018). A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts. PLoS Comput. Biol., 14.","DOI":"10.1371\/journal.pcbi.1005962"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Lin, J. (2009). Is searching full text more effective than searching abstracts?. BMC Bioinform., 10.","DOI":"10.1186\/1471-2105-10-46"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"P\u00e9rez-Ag\u00fcera, J.R., Arroyo, J., Greenberg, J., Iglesias, J.P., and Fresno, V. (2010, January 26\u201330). Using BM25F for Semantic Search. Proceedings of the 3rd International Semantic Search Workshop (SEMSEARCH\u201910), Raleigh, NC, USA.","DOI":"10.1145\/1863879.1863881"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Guo, Y., Chen, D., and Le, J. (2009, January 23\u201325). An Extended Vector Space Model for XML Information Retrieval. Proceedings of the Second International Workshop on Knowledge Discovery and Data Mining, Moscow, Russia.","DOI":"10.1109\/WKDD.2009.218"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Ai, Q., Yang, L., Guo, J., and Croft, W.B. (2016, January 12\u201316). Analysis of the Paragraph Vector Model for Information Retrieval. Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval, Newark, DE, USA.","DOI":"10.1145\/2970398.2970409"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Sinclair, G., and Webber, B.L. (2004, January 28\u201329). Classification from full text: A comparison of canonical sections of scientific papers. Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA\/BioNLP), Geneva, Switzerland.","DOI":"10.3115\/1567594.1567608"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"52","DOI":"10.1145\/1089815.1089823","article-title":"A baseline feature set for learning rhetorical zones using full articles in the biomedical domain","volume":"7","author":"Mullen","year":"2005","journal-title":"SIGKDD Explor. Newsl."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"643","DOI":"10.1007\/s11192-019-03053-8","article-title":"Sections-based bibliographic coupling for research paper recommendation","volume":"119","author":"Habib","year":"2019","journal-title":"Scientometrics"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Collins, E., Augenstein, I., and Riedel, S. (2017, January 3\u20134). A supervised approach to extractive summarisation of scientific papers. Proceedings of the CoNLL 2017\u201421st Conference on Computational Natural Language Learning, Vancouver, BC, Canada.","DOI":"10.18653\/v1\/K17-1021"},{"key":"ref_14","unstructured":"Li, T., and Lepage, Y. (2019, January 12\u201315). Informative sections and relevant words for the generation of NLP article abstracts. Proceedings of the 25th Annual Meeting of the Japanese Association for Natural Language Processing, Nagoya, Japan."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"835","DOI":"10.1007\/s11192-020-03583-6","article-title":"Using neural-network based paragraph embeddings for the calculation of within and between document similarities","volume":"155","author":"Thijs","year":"2020","journal-title":"Scientometrics"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Hebler, N., Rottmann, M., and Ziegler, A. (2020). Empirical analysis of the text structure of original research articles in medical journals. PLoS ONE, 15.","DOI":"10.1371\/journal.pone.0240288"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/1747-5333-1-2","article-title":"A tutorial on information retrieval: Basic terms and concepts","volume":"1","author":"Zhou","year":"2006","journal-title":"J. Biomed. Discov. Collab."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"513","DOI":"10.1016\/0306-4573(88)90021-0","article-title":"Term-weighting approaches in automatic text retrieval","volume":"24","author":"Salton","year":"1988","journal-title":"Inform. Process. Manag."},{"key":"ref_19","unstructured":"Croft, B.W., and van Rijsbergen, C.J. (1994). Ohsumed: An Interactive Retrieval Evaluation and New Large Test Collection for Research, Springer."},{"key":"ref_20","unstructured":"Gon\u00e7alves, C.A., Gon\u00e7alves, C.T., Camacho, R., and Oliveira, E.C. (2010, January 8\u20139). The impact of pre-processing on the classification of MEDLINE documents. Proceedings of the 10th International Workshop on Pattern Recognition in Information Systems, Porto, Portugal."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Fellbaum, C. (1998). WordNet: An Electronic Lexical Database, MIT Press.","DOI":"10.7551\/mitpress\/7287.001.0001"},{"key":"ref_22","unstructured":"Rebholz-Schuhmann, D., Pezik, P., Lee, V., Kim, J.-J., del Gratta, R., Sasaki, Y., McNaught, J., Montemagni, S., Monachini, M., and Calzolari, N. (2008, January 19\u201323). Biolexicon: Towards a reference terminological resource in the biomedical domain. Proceedings of the 16th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB-2008), Toronto, ON, Canada."},{"key":"ref_23","unstructured":"Porter, M.F. (1997). An Algorithm for Suffix Stripping. Readings in Information Retrieval, Morgan Kaufmann Publishers Inc."},{"key":"ref_24","unstructured":"Hall, M.A. (1999). Correlation-Based Feature Selection for Machine Learning. [Ph.D. Thesis, Department Of Computer Science, Waikato University]."},{"key":"ref_25","unstructured":"Borase, P.N., Kinariwala, S.A., and Rustagi, J.S. (2016). Image Re-Ranking Using Information Gain and Relative Consistency through Multi-Graph Learning, Foundation of Computer Science (FCS)."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"503","DOI":"10.2174\/1574893611666160617094720","article-title":"An hmm-based text classifier less sensitive to document management problems","volume":"11","author":"Iglesias","year":"2016","journal-title":"Curr. Bioinform."},{"key":"ref_27","unstructured":"Mitchell, T.M. (1997). Machine Learning, McGraw-Hill Inc.. [1st ed.]."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"10","DOI":"10.1145\/1656274.1656278","article-title":"The weka data mining software: An update","volume":"11","author":"Hall","year":"2009","journal-title":"SIGKDD Explor. Newsl."},{"key":"ref_29","unstructured":"Witten, I.H., Frank, E., Trigg, L., Hall, M., Holmes, G., and Cunningham, S.J. (2021, March 07). Weka: Practical Machine Learning Tools and Techniques with Java Implementations. Available online: https:\/\/researchcommons.waikato.ac.nz\/handle\/10289\/1040."},{"key":"ref_30","unstructured":"Witten, I.H., and Frank, E. (2000). Data Mining: Practical Machine Learning Tools and Techniques with Java Implementation, Morgan Kaufmann."},{"key":"ref_31","first-page":"249","article-title":"Assessing Agreement on Classification Tasks: The Kappa Statistic","volume":"22","author":"Carletta","year":"1996","journal-title":"Comput. Ling."},{"key":"ref_32","first-page":"502","article-title":"Learnsec: A framework for full text analysis","volume":"Volume 10870","author":"Iglesias","year":"2018","journal-title":"Proceedings of the 13th International Conference on Hybrid Artificial Intelligence Systems HAIS"}],"container-title":["Applied Sciences"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2076-3417\/11\/6\/2674\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T05:36:59Z","timestamp":1760161019000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2076-3417\/11\/6\/2674"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,3,17]]},"references-count":32,"journal-issue":{"issue":"6","published-online":{"date-parts":[[2021,3]]}},"alternative-id":["app11062674"],"URL":"https:\/\/doi.org\/10.3390\/app11062674","relation":{},"ISSN":["2076-3417"],"issn-type":[{"type":"electronic","value":"2076-3417"}],"subject":[],"published":{"date-parts":[[2021,3,17]]}}}