{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T00:56:10Z","timestamp":1760144170610,"version":"build-2065373602"},"reference-count":37,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2024,3,22]],"date-time":"2024-03-22T00:00:00Z","timestamp":1711065600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Algorithms"],"abstract":"<jats:p>Given the emergence of China as a political and economic power in the 21st century, there is increased interest in analyzing Chinese news articles to better understand developing trends in China. Because of the volume of the material, automating the categorization of Chinese-language news articles by headline text or titles can be an effective way to sort the articles into categories for efficient review. A 383,000-headline dataset labeled with 15 categories from the Toutiao website was evaluated via natural language processing to predict topic categories. The influence of six data preparation variations on the predictive accuracy of four algorithms was studied. The simplest model (Na\u00efve Bayes) achieved 85.1% accuracy on a holdout dataset, while the most complex model (Neural Network using BERT) demonstrated 89.3% accuracy. The most useful data preparation steps were identified, and another goal examined the underlying complexity and computational costs of automating the categorization process. It was discovered the BERT model required 170x more time to train, was slower to predict by a factor of 18,600, and required 27x more disk space to save, indicating it may be the best choice for low-volume applications when the highest accuracy is needed. However, for larger-scale operations where a slight performance degradation is tolerated, the Na\u00efve Bayes algorithm could be the best choice. Nearly one in four records in the Toutiao dataset are duplicates, and this is the first published analysis with duplicates removed.<\/jats:p>","DOI":"10.3390\/a17040132","type":"journal-article","created":{"date-parts":[[2024,3,22]],"date-time":"2024-03-22T10:03:59Z","timestamp":1711101839000},"page":"132","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["The Impact of Data Preparation and Model Complexity on the Natural Language Classification of Chinese News Headlines"],"prefix":"10.3390","volume":"17","author":[{"given":"Torrey","family":"Wagner","sequence":"first","affiliation":[{"name":"Data Analytics Certificate Program, Graduate School of Engineering and Management, Air Force Institute of Technology, Wright-Patterson AFB, OH 45433, USA"}]},{"given":"Dennis","family":"Guhl","sequence":"additional","affiliation":[{"name":"Data Analytics Certificate Program, Graduate School of Engineering and Management, Air Force Institute of Technology, Wright-Patterson AFB, OH 45433, USA"}]},{"given":"Brent","family":"Langhals","sequence":"additional","affiliation":[{"name":"Data Analytics Certificate Program, Graduate School of Engineering and Management, Air Force Institute of Technology, Wright-Patterson AFB, OH 45433, USA"}]}],"member":"1968","published-online":{"date-parts":[[2024,3,22]]},"reference":[{"key":"ref_1","unstructured":"Policy Planning Staff (2020). The Elements of the China Challenge, U.S. Secretary of State."},{"key":"ref_2","unstructured":"Williams, H.J., and Blum, I. (2022, August 01). Defining Second Generation Open Source Intelligence (OSINT) for the Defense Enterprise. Available online: https:\/\/www.rand.org\/pubs\/research_reports\/RR1964.html."},{"key":"ref_3","unstructured":"Li, J., Wang, B., Ni, A.J., and Liu, Q. (2020, January 19\u201321). Text Mining Analysis on Users\u2019 Reviews for News Aggregator Toutiao. Proceedings of the International Conference on Artificial Intelligence in Information and Communication, Fukuoka, Japan."},{"key":"ref_4","unstructured":"Github User Aceimnorstuvwxz (2022, July 21). Github User Aceimnorstuvwxz. Github Toutiao Text Classfication Dataset (Public). July 2018. Available online: https:\/\/github.com\/aceimnorstuvwxz\/toutiao-text-classfication-dataset."},{"key":"ref_5","first-page":"1","article-title":"Short Text Classification of Chinese with Label Information Assisting","volume":"22","author":"Xu","year":"2023","journal-title":"ACM Trans. Asian Low-Resour. Lang. Inf. Process."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Xu, L., Hu, H., Xhang, X., Li, L., Cao, C., and Lan, Z. (2020). CLUE: A Chinese Language Understanding Evaluation Benchmark. arXiv.","DOI":"10.18653\/v1\/2020.coling-main.419"},{"key":"ref_7","unstructured":"Wang, S., Sun, Y., Xiang, Y., Wu, Z., Ding, S., Gong, W., and Wang, H. (2021). Ernie 3.0 titan: Exploring larger-scale knowledge enhanced pre-training for language understanding and generation. arXiv."},{"key":"ref_8","unstructured":"Zhang, A., and ChatGPT and Other Transformers: How to Select Large Language Model for Your NLP Projects (2023, March 07). Medium, 2 2023. Available online: https:\/\/alina-li-zhang.medium.com\/chatgpt-and-other-transformers-how-to-select-large-language-model-for-your-nlp-projects-908de1a152d8."},{"key":"ref_9","unstructured":"Zhang, J., Zhao, Y., Saleh, M., and Liu, P.J. (2020, January 13\u201318). PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization. Proceedings of the 37th International Conference on Machine Learning, Vienna, Austria."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. (2020, January 5\u201310). BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online.","DOI":"10.18653\/v1\/2020.acl-main.703"},{"key":"ref_11","unstructured":"Di Pietro, M. (2022, August 02). Text Classification with NLP: Tf-Idf vs. Word2Vec vs. BERT. Toward Data SCience, 18 July 2020. Available online: https:\/\/towardsdatascience.com\/text-classification-with-nlp-tf-idf-vs-word2vec-vs-bert-41ff868d1794."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"032036","DOI":"10.1088\/1742-6596\/1748\/3\/032036","article-title":"A Text Classification Algorithm Based on Topic Model and Convolutional Nueral Network","volume":"1748","author":"Ge","year":"2021","journal-title":"J. Phys. Conf. Ser."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"199629","DOI":"10.1109\/ACCESS.2020.3035669","article-title":"Feature Enhanced Non-Equilibrium Bi-Directional Long Short-Term Memory Model for Chinese Text Classification","volume":"8","author":"Huan","year":"2020","journal-title":"IEEE Access"},{"key":"ref_14","unstructured":"Duan, W., He, X., Zhou, Z., Rao, H., and Thiele, L. (September, January 30). Injecting Descriptive Meta-Information Into Pre-trained Language Models with Hypernetworks. Proceedings of the Interspeech, Brno, Czechia."},{"key":"ref_15","first-page":"53","article-title":"Laebl Oriented Hierarchical Attention Neural Network for Short Text Classification","volume":"5","author":"Xia","year":"2022","journal-title":"Acad. J. Eng. Technol. Sci."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"942","DOI":"10.1016\/j.dcan.2022.09.015","article-title":"Effective short text classification via the fusion of hybrid features for IoT social data","volume":"8","author":"Luo","year":"2022","journal-title":"Digit. Commun. Netw."},{"key":"ref_17","first-page":"9840836","article-title":"Chinese Short Text Classification by ERNIE Based on LTC_Block","volume":"2023","author":"Zhang","year":"2023","journal-title":"Hindawi Wirel. Commun. Mob. Comput."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Liu, B., and Lin, G. (2020, January 25\u201330). Chinese Document Classification with Bi-Directional Convolutional Language Model. Proceedings of the 43rd Internation ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, China.","DOI":"10.1145\/3397271.3401248"},{"key":"ref_19","unstructured":"IBM Corporation (2011). IBM SPSS Modeler CRISP-DM Guide, IBM Corporation."},{"key":"ref_20","unstructured":"(2022, July 21). Github User fxsjy (Sun Junyi), \u201cfxsjy\/jieba,\u201d 15 February 2020. Available online: https:\/\/github.com\/fxsjy\/jieba."},{"key":"ref_21","unstructured":"Kung, S., and Chinese Natural Language (Pre)processing: An Introduction (2022, August 02). Towards Data Science, 20 November 2020. Available online: https:\/\/towardsdatascience.com\/chinese-natural-language-pre-processing-an-introduction-995d16c2705f."},{"key":"ref_22","first-page":"2469","article-title":"A Comparative Analysis Of News Categorization Using Machine Learning Approaches","volume":"9","author":"Deb","year":"2020","journal-title":"Int. J. Sci. Technol. Res."},{"key":"ref_23","unstructured":"Grandini, M., Bagli, E., and Visani, G. (2022, August 17). Metrics for Multi-Class Classification: An Overview. 14 August 2020. Available online: https:\/\/arxiv.org\/pdf\/2008.05756.pdf."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"James, G., Witten, D., Hastie, T., and Tibsharani, R. (2013). An Introduction to Statistical Learning with Applications in R, Springer.","DOI":"10.1007\/978-1-4614-7138-7"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"427","DOI":"10.1016\/j.ipm.2009.03.002","article-title":"A Systematic Analysis of Performance Measures for Classification Tasks","volume":"45","author":"Sokolova","year":"2009","journal-title":"Inf. Process. Manag."},{"key":"ref_26","unstructured":"G\u00e9ron, A. (2019). Hands-on Machine Learning with Scikit-learn, Keras, and TensorFlow, O\u2019Riley."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Liu, X., Wang, S., Lu, S., Yin, Z., Li, X., Yin, L., Tian, J., and Zheng, W. (2023). Adapting Feature Selection Algorithms for the Classification of Chinese Texts. Systems, 11.","DOI":"10.3390\/systems11090483"},{"key":"ref_28","unstructured":"Das, M., Kamalanathan, S., and Alphonse, P. (2021, January 22\u201323). A Comparative Study on TF-IDF Feature Weighting Method and its Analysis using Unstructured Dataset. Proceedings of the COLINS-2021: 5th International Conference on Computational Linguistics and Intelligent Systems, Kharkiv, Ukraine."},{"key":"ref_29","unstructured":"Soma, J. (2022, August 31). TF-IDF with Chinese Sentences. Data Science for Journalism. Available online: https:\/\/investigate.ai\/text-analysis\/using-tf-idf-with-chinese\/."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"286","DOI":"10.48175\/IJARSCT-1241","article-title":"Efficient Implementation using Multinomial Naive Bayes for Prediction of Fake Job Profile","volume":"5","author":"Shishupal","year":"2021","journal-title":"Int. J. Adv. Res. Sci. Commun. Technol."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Saul, J., Wagner, T., Mbonimpa, E., and Langhals, B. (2023, January 24\u201327). Atmospheric Meteorological Effects on Forecasting Daily Lightning Occurrence at Cape Canaveral Space Force Station. Proceedings of the World Congress in Computer Science, Computer Engineering, and Applied Computing, Las Vegas, NV, USA.","DOI":"10.1109\/CSCE60160.2023.00305"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Tucker, T., Wagner, T., Auclair, P., and Langhals, B. (2023, January 24\u201327). Machine Learning Prediction of DoD Personal Property Shipment Costs. Proceedings of the World Congress in Computer Science, Computer Engineering, and Applied Computing, Las Vegas, NV, USA.","DOI":"10.1109\/CSCE60160.2023.00303"},{"key":"ref_33","unstructured":"Lakshmanan, V., Robinson, S., and Munn, M. (2020). Machine Learning Design Patterns, O\u2019Reilly Media."},{"key":"ref_34","unstructured":"(2023, October 22). Google. Google Machine Learning Course Step 3: Prepare Your Data. 18 July 2022. Available online: https:\/\/developers.google.com\/machine-learning\/guides\/text-classification\/step-3."},{"key":"ref_35","unstructured":"Widrow, B. (1987, January 23). ADALINE and MADALINE. Proceedings of the 1st International Conference on Neural Networks, San Diego, CA, USA."},{"key":"ref_36","unstructured":"Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv."},{"key":"ref_37","unstructured":"Loshchilov, I., and Hutter, F. (2019, January 6\u20139). Decoupled Weight Decay Regularization. Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA."}],"container-title":["Algorithms"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1999-4893\/17\/4\/132\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T14:18:05Z","timestamp":1760105885000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1999-4893\/17\/4\/132"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,3,22]]},"references-count":37,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2024,4]]}},"alternative-id":["a17040132"],"URL":"https:\/\/doi.org\/10.3390\/a17040132","relation":{},"ISSN":["1999-4893"],"issn-type":[{"type":"electronic","value":"1999-4893"}],"subject":[],"published":{"date-parts":[[2024,3,22]]}}}