{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,3]],"date-time":"2026-03-03T20:32:17Z","timestamp":1772569937942,"version":"3.50.1"},"reference-count":40,"publisher":"MDPI AG","issue":"9","license":[{"start":{"date-parts":[[2023,9,20]],"date-time":"2023-09-20T00:00:00Z","timestamp":1695168000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Sichuan Science and Technology Program","award":["2023YFSY0026"],"award-info":[{"award-number":["2023YFSY0026"]}]},{"name":"Sichuan Science and Technology Program","award":["2023YFH0004"],"award-info":[{"award-number":["2023YFH0004"]}]},{"name":"Sichuan Science and Technology Program","award":["SC22ZDCY09"],"award-info":[{"award-number":["SC22ZDCY09"]}]},{"name":"Sichuan Social Science Major Project","award":["2023YFSY0026"],"award-info":[{"award-number":["2023YFSY0026"]}]},{"name":"Sichuan Social Science Major Project","award":["2023YFH0004"],"award-info":[{"award-number":["2023YFH0004"]}]},{"name":"Sichuan Social Science Major Project","award":["SC22ZDCY09"],"award-info":[{"award-number":["SC22ZDCY09"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Systems"],"abstract":"<jats:p>Text classification has been highlighted as the key process to organize online texts for better communication in the Digital Media Age. Text classification establishes classification rules based on text features, so the accuracy of feature selection is the basis of text classification. Facing fast-increasing Chinese electronic documents in the digital environment, scholars have accumulated quite a few algorithms for the feature selection for the automatic classification of Chinese texts in recent years. However, discussion about how to adapt existing feature selection algorithms for various types of Chinese texts is still inadequate. To address this, this study proposes three improved feature selection algorithms and tests their performance on different types of Chinese texts. These include an enhanced CHI square with mutual information (MI) algorithm, which simultaneously introduces word frequency and term adjustment (CHMI); a term frequency\u2013CHI square (TF\u2013CHI) algorithm, which enhances weight calculation; and a term frequency\u2013inverse document frequency (TF\u2013IDF) algorithm enhanced with the extreme gradient boosting (XGBoost) algorithm, which improves the algorithm\u2019s ability of word filtering (TF\u2013XGBoost). This study randomly chooses 3000 texts from six different categories of the Sogou news corpus to obtain the confusion matrix and evaluate the performance of the new algorithms with precision and the\u00a0F1-score. Experimental comparisons are conducted on support vector machine (SVM) and naive Bayes (NB) classifiers. The experimental results demonstrate that the feature selection algorithms proposed in this paper improve performance across various news corpora, although the best feature selection schemes for each type of corpus are different. Further studies of the application of the improved feature selection methods in other languages and the improvement in classifiers are suggested.<\/jats:p>","DOI":"10.3390\/systems11090483","type":"journal-article","created":{"date-parts":[[2023,9,20]],"date-time":"2023-09-20T21:47:03Z","timestamp":1695246423000},"page":"483","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":142,"title":["Adapting Feature Selection Algorithms for the Classification of Chinese Texts"],"prefix":"10.3390","volume":"11","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5599-2607","authenticated-orcid":false,"given":"Xuan","family":"Liu","sequence":"first","affiliation":[{"name":"School of Public Affairs and Administration, University of Electronic Science and Technology of China, Chengdu 611731, China"}]},{"given":"Shuang","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Automation, University of Electronic Science and Technology of China, Chengdu 610054, China"}]},{"given":"Siyu","family":"Lu","sequence":"additional","affiliation":[{"name":"School of Automation, University of Electronic Science and Technology of China, Chengdu 610054, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9818-9205","authenticated-orcid":false,"given":"Zhengtong","family":"Yin","sequence":"additional","affiliation":[{"name":"College of Resource and Environment Engineering, Guizhou University, Guiyang 550025, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8299-7937","authenticated-orcid":false,"given":"Xiaolu","family":"Li","sequence":"additional","affiliation":[{"name":"School of Geographic Science, Southwest University, Chongqing 400715, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5022-610X","authenticated-orcid":false,"given":"Lirong","family":"Yin","sequence":"additional","affiliation":[{"name":"Department of Geography and Anthropology, Louisiana State University, Baton Rouge, LA 70803, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6398-7461","authenticated-orcid":false,"given":"Jiawei","family":"Tian","sequence":"additional","affiliation":[{"name":"School of Automation, University of Electronic Science and Technology of China, Chengdu 610054, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8486-1654","authenticated-orcid":false,"given":"Wenfeng","family":"Zheng","sequence":"additional","affiliation":[{"name":"School of Automation, University of Electronic Science and Technology of China, Chengdu 610054, China"}]}],"member":"1968","published-online":{"date-parts":[[2023,9,20]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"306","DOI":"10.1057\/s41599-023-01816-6","article-title":"Emotion classification for short texts: An improved multi-label method","volume":"10","author":"Liu","year":"2023","journal-title":"Humanit. Soc. Sci. Commun."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/505282.505283","article-title":"Machine learning in automated text categorization","volume":"34","author":"Sebastiani","year":"2002","journal-title":"ACM Comput. Surv."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"2947","DOI":"10.1016\/j.ymssp.2010.05.015","article-title":"Mutual information algorithms","volume":"24","author":"Jiang","year":"2010","journal-title":"Mech. Syst. Signal Process."},{"key":"ref_4","unstructured":"Lancaster, H.O., and Seneta, E. (2005). Encyclopedia of Biostatistics, John Wiley & Sons."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"457","DOI":"10.1016\/j.ins.2023.01.069","article-title":"A joint multiobjective optimization of feature selection and classifier design for high-dimensional data classification","volume":"626","author":"Bai","year":"2023","journal-title":"Inf. Sci."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Liu, X., Zhou, G., Kong, M., Yin, Z., Li, X., Yin, L., and Zheng, W. (2023). Developing Multi-Labelled Corpus of Twitter Short Texts: A Semi-Automatic Method. Systems, 11.","DOI":"10.3390\/systems11080390"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Bai, R., Wang, X., and Liao, J. (2010, January 23\u201325). Extract semantic information from wordnet to improve text classification performance. Proceedings of the International Conference on Advanced Computer Science and Information Technology, Miyazaki, Japan.","DOI":"10.1007\/978-3-642-13577-4_36"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"111402","DOI":"10.1115\/1.4037649","article-title":"A data-driven text mining and semantic network analysis for design information retrieval","volume":"139","author":"Shi","year":"2017","journal-title":"J. Mech. Des."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"1465","DOI":"10.1109\/TIP.2016.2523340","article-title":"Category specific dictionary learning for attribute specific feature selection","volume":"25","author":"Wang","year":"2016","journal-title":"IEEE Trans. Image Process."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Szczepanek, R. (2023). A Deep Learning Model of Spatial Distance and Named Entity Recognition (SD-NER) for Flood Mark Text Classification. Water, 15.","DOI":"10.3390\/w15061197"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"159","DOI":"10.1147\/rd.22.0159","article-title":"The automatic creation of literature abstracts","volume":"2","author":"Luhn","year":"1958","journal-title":"IBM J. Res. Dev."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"216","DOI":"10.1145\/321033.321035","article-title":"On relevance, probabilistic indexing and information retrieval","volume":"7","author":"Maron","year":"1960","journal-title":"J. ACM"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"404","DOI":"10.1145\/321075.321084","article-title":"Automatic indexing: An experimental inquiry","volume":"8","author":"Maron","year":"1961","journal-title":"J. ACM"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"613","DOI":"10.1145\/361219.361220","article-title":"A vector space model for automatic indexing","volume":"18","author":"Salton","year":"1975","journal-title":"Commun. ACM"},{"key":"ref_15","unstructured":"Bengio, Y., Ducharme, R., and Vincent, P. (December, January 29). A neural probabilistic language model. Proceedings of the 13th 2000 Neural Information Processing Systems (NIPS) Conference, Denver, CO, USA."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Collobert, R., and Weston, J. (2008, January 5\u20139). A unified architecture for natural language processing: Deep neural networks with multitask learning. Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland.","DOI":"10.1145\/1390156.1390177"},{"key":"ref_17","first-page":"2493","article-title":"Natural language processing (almost) from scratch","volume":"12","author":"Collobert","year":"2011","journal-title":"J. Mach. Learn. Res."},{"key":"ref_18","unstructured":"Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., and Dean, J. (2013, January 5\u201310). Distributed representations of words and phrases and their compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems, Carson City, NV, USA."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"205","DOI":"10.1016\/j.cognition.2013.07.003","article-title":"The effect of statistical learning on internal stimulus representations: Predictable items are enhanced even when not predicted","volume":"129","author":"Barakat","year":"2013","journal-title":"Cognition"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv.","DOI":"10.3115\/v1\/D14-1181"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"2298","DOI":"10.1109\/TPAMI.2016.2646371","article-title":"An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition","volume":"39","author":"Shi","year":"2016","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Cao, S., Lu, W., Zhou, J., and Li, X. (2018, January 2\u20137). cw2vec: Learning Chinese word embeddings with stroke n-gram information. Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA.","DOI":"10.1609\/aaai.v32i1.12029"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"35208","DOI":"10.1109\/ACCESS.2019.2904602","article-title":"Composite feature extraction and selection for text classification","volume":"7","author":"Wan","year":"2019","journal-title":"IEEE Access"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Zhu, M., and Yang, X. (2019, January 14\u201317). Chinese texts classification system. Proceedings of the 2019 IEEE 2nd International Conference on Information and Computer Technologies (ICICT), Kahului, HI, USA.","DOI":"10.1109\/INFOCT.2019.8710894"},{"key":"ref_25","unstructured":"Pan, L., Hang, C.-W., Sil, A., and Potdar, S. (March, January 22). Improved text classification via contrastive adversarial training. Proceedings of the AAAI Conference on Artificial Intelligence, Online."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"1819","DOI":"10.1109\/TKDE.2013.39","article-title":"A review on multi-label learning algorithms","volume":"26","author":"Zhang","year":"2013","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"232","DOI":"10.1016\/j.eswa.2016.03.045","article-title":"Ensemble of keyword extraction methods and classifiers in text classification","volume":"57","author":"Onan","year":"2016","journal-title":"Expert Syst. Appl."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"218","DOI":"10.1016\/j.eswa.2017.07.019","article-title":"Opinion mining using ensemble text hidden Markov models for text classification","volume":"94","author":"Kang","year":"2018","journal-title":"Expert Syst. Appl."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"4760","DOI":"10.1016\/j.eswa.2011.09.160","article-title":"Comparison of term frequency and document frequency based feature selection metrics in text categorization","volume":"39","author":"Azam","year":"2012","journal-title":"Expert Syst. Appl."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"114765","DOI":"10.1016\/j.eswa.2021.114765","article-title":"Feature Selection for Classification using Principal Component Analysis and Information Gain","volume":"174","author":"Omuya","year":"2021","journal-title":"Expert Syst. Appl."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Vora, S., and Yang, H. (2017, January 18\u201320). A comprehensive study of eleven feature selection algorithms and their impact on text classification. Proceedings of the 2017 Computing Conference, London, UK.","DOI":"10.1109\/SAI.2017.8252136"},{"key":"ref_32","first-page":"25","article-title":"Text mining: Use of TF-IDF to examine the relevance of words to documents","volume":"181","author":"Qaiser","year":"2018","journal-title":"Int. J. Comput. Appl."},{"key":"ref_33","unstructured":"Sun, J. (2022, September 01). Jieba Chinese Word Segmentation Tool. Available online: https:\/\/github.com\/fxsjy\/jieba."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Yao, Z., and Ze-wen, C. (2011, January 28\u201329). Research on the construction and filter method of stop-word list in text preprocessing. Proceedings of the 2011 Fourth International Conference on Intelligent Computation Technology and Automation, Shenzhen, China.","DOI":"10.1109\/ICICTA.2011.64"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Zhang, C., Wang, X., Yu, S., and Wang, Y. (2018, January 6\u20138). Research on keyword extraction of Word2vec model in Chinese corpus. Proceedings of the 2018 IEEE\/ACIS 17th International Conference on Computer and Information Science (ICIS), Singapore.","DOI":"10.1109\/ICIS.2018.8466534"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Shah, F.P., and Patel, V. (2016, January 23\u201325). A review on feature selection and feature extraction for text classification. Proceedings of the 2016 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET), Chennai, India.","DOI":"10.1109\/WiSPNET.2016.7566545"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Zhai, Y., Song, W., Liu, X., Liu, L., and Zhao, X. (2018, January 23\u201325). A chi-square statistics-based feature selection method in text classification. Proceedings of the 2018 IEEE 9th International Conference on Software Engineering and Service Science (ICSESS), Beijing, China.","DOI":"10.1109\/ICSESS.2018.8663882"},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"271","DOI":"10.1016\/j.ins.2020.08.051","article-title":"Two-stage three-way enhanced technique for ensemble learning in inclusive policy text classification","volume":"547","author":"Liang","year":"2021","journal-title":"Inf. Sci."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Chen, T., and Guestrin, C. (2016, January 14\u201318). Xgboost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA.","DOI":"10.1145\/2939672.2939785"},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"522","DOI":"10.1016\/j.ins.2021.05.055","article-title":"Approximating XGBoost with an interpretable decision tree","volume":"572","author":"Sagi","year":"2021","journal-title":"Inf. Sci."}],"container-title":["Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2079-8954\/11\/9\/483\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T20:54:03Z","timestamp":1760129643000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2079-8954\/11\/9\/483"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,9,20]]},"references-count":40,"journal-issue":{"issue":"9","published-online":{"date-parts":[[2023,9]]}},"alternative-id":["systems11090483"],"URL":"https:\/\/doi.org\/10.3390\/systems11090483","relation":{},"ISSN":["2079-8954"],"issn-type":[{"value":"2079-8954","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,9,20]]}}}