{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T01:00:43Z","timestamp":1760144443885,"version":"build-2065373602"},"reference-count":51,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2024,4,7]],"date-time":"2024-04-07T00:00:00Z","timestamp":1712448000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["BDCC"],"abstract":"<jats:p>The precise categorization of brief texts holds significant importance in various applications within the ever-changing realm of artificial intelligence (AI) and natural language processing (NLP). Short texts are everywhere in the digital world, from social media updates to customer reviews and feedback. Nevertheless, short texts\u2019 limited length and context pose unique challenges for accurate classification. This research article delves into the influence of data sorting methods on the quality of manual labeling in hierarchical classification, with a particular focus on short texts. The study is set against the backdrop of the increasing reliance on manual labeling in AI and NLP, highlighting its significance in the accuracy of hierarchical text classification. Methodologically, the study integrates AI, notably zero-shot learning, with human annotation processes to examine the efficacy of various data-sorting strategies. The results demonstrate how different sorting approaches impact the accuracy and consistency of manual labeling, a critical aspect of creating high-quality datasets for NLP applications. The study\u2019s findings reveal a significant time efficiency improvement in terms of labeling, where ordered manual labeling required 760 min per 1000 samples, compared to 800 min for traditional manual labeling, illustrating the practical benefits of optimized data sorting strategies. Comparatively, ordered manual labeling achieved the highest mean accuracy rates across all hierarchical levels, with figures reaching up to 99% for segments, 95% for families, 92% for classes, and 90% for bricks, underscoring the efficiency of structured data sorting. It offers valuable insights and practical guidelines for improving labeling quality in hierarchical classification tasks, thereby advancing the precision of text analysis in AI-driven research. This abstract encapsulates the article\u2019s background, methods, results, and conclusions, providing a comprehensive yet succinct study overview.<\/jats:p>","DOI":"10.3390\/bdcc8040041","type":"journal-article","created":{"date-parts":[[2024,4,8]],"date-time":"2024-04-08T10:11:55Z","timestamp":1712571115000},"page":"41","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["Data Sorting Influence on Short Text Manual Labeling Quality for Hierarchical Classification"],"prefix":"10.3390","volume":"8","author":[{"ORCID":"https:\/\/orcid.org\/0009-0000-0628-8218","authenticated-orcid":false,"given":"Olga","family":"Narushynska","sequence":"first","affiliation":[{"name":"Department of Automated Control Systems, Lviv Polytechnic National University, 79013 Lviv, Ukraine"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5974-9310","authenticated-orcid":false,"given":"Vasyl","family":"Teslyuk","sequence":"additional","affiliation":[{"name":"Department of Automated Control Systems, Lviv Polytechnic National University, 79013 Lviv, Ukraine"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7214-5108","authenticated-orcid":false,"given":"Anastasiya","family":"Doroshenko","sequence":"additional","affiliation":[{"name":"Department of Automated Control Systems, Lviv Polytechnic National University, 79013 Lviv, Ukraine"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-9978-7072","authenticated-orcid":false,"given":"Maksym","family":"Arzubov","sequence":"additional","affiliation":[{"name":"Department of Automated Control Systems, Lviv Polytechnic National University, 79013 Lviv, Ukraine"}]}],"member":"1968","published-online":{"date-parts":[[2024,4,7]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"89","DOI":"10.1007\/978-3-030-31787-4_7","article-title":"Automatic Content Analysis of Social Media Short Texts: Scoping Review of Methods and Tools","volume":"Volume 1068","author":"Costa","year":"2020","journal-title":"Computer Supported Qualitative Research"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"45181","DOI":"10.1109\/ACCESS.2023.3274199","article-title":"Chat2VIS: Generating Data Visualizations via Natural Language Using ChatGPT, Codex and GPT-3 Large Language Models","volume":"11","author":"Maddigan","year":"2023","journal-title":"IEEE Access"},{"key":"ref_3","unstructured":"Zhou, X., Wu, T., Chen, H., Yang, Q., and He, X. (2019, January 16\u201319). Automatic Annotation of Text Classification Data Set in Specific Field Using Named Entity Recognition. Proceedings of the 2019 IEEE 19th International Conference on Communication Technology (ICCT), Xi\u2019an, China."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Doroshenko, A., and Tkachenko, R. (2018, January 11\u201314). Classification of Imbalanced Classes Using the Committee of Neural Networks. Proceedings of the 2018 IEEE 13th International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT), Lviv, Ukraine.","DOI":"10.1109\/STC-CSIT.2018.8526611"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Chang, C.-M., Mishra, S.D., and Igarashi, T. (2019, January 14\u201318). A Hierarchical Task Assignment for Manual Image Labeling. Proceedings of the 2019 IEEE Symposium on Visual Languages and Human-Centric Computing (VL\/HCC), Memphis, TN, USA.","DOI":"10.1109\/VLHCC.2019.8818828"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Savchuk, D., and Doroshenko, A. (2021, January 22\u201325). Investigation of Machine Learning Classification Methods Effectiveness. Proceedings of the 2021 IEEE 16th International Conference on Computer Sciences and Information Technologies (CSIT), Lviv, Ukraine.","DOI":"10.1109\/CSIT52700.2021.9648582"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"117215","DOI":"10.1016\/j.eswa.2022.117215","article-title":"Comprehensive Comparative Study of Multi-Label Classification Methods","volume":"203","author":"Bogatinovski","year":"2022","journal-title":"Expert Syst. Appl."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Nava-Mu\u00f1oz, S., Graff, M., and Escalante, H.J. (2024). Analysis of Systems\u2019 Performance in Natural Language Processing Competitions. arXiv.","DOI":"10.1016\/j.patrec.2024.03.010"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"8627","DOI":"10.1007\/s00500-023-08048-5","article-title":"Multi-Label Classification via Closed Frequent Labelsets and Label Taxonomies","volume":"27","author":"Ferrandin","year":"2023","journal-title":"Soft Comput."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Narushynska, O., Teslyuk, V., and Vovchuk, B.-D. (2017, January 5\u20138). Search Model of Customer\u2019s Optimal Route in the Store Based on Algorithm of Machine Learning A*. Proceedings of the 2017 12th International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT), Lviv, Ukraine.","DOI":"10.1109\/STC-CSIT.2017.8098787"},{"key":"ref_11","unstructured":"Chang, C., Zhang, J., Ge, J., Zhang, Z., Wei, J., and Li, L. (2024). Interaction-Based Driving Scenario Classification and Labeling. arXiv."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Xu, X., Li, B., Shen, Y., Luo, B., Zhang, C., and Hao, F. (2023). Short Text Classification Based on Hierarchical Heterogeneous Graph and LDA Fusion. Electronics, 12.","DOI":"10.3390\/electronics12122560"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Tang, H., Kamei, S., and Morimoto, Y. (2023). Data Augmentation Methods for Enhancing Robustness in Text Classification Tasks. Algorithms, 16.","DOI":"10.3390\/a16010059"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"86038","DOI":"10.1109\/ACCESS.2022.3197769","article-title":"Robust Natural Language Processing: Recent Advances, Challenges, and Future Directions","volume":"10","author":"Omar","year":"2022","journal-title":"IEEE Access"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Jin, R., Du, J., Huang, W., Liu, W., Luan, J., Wang, B., and Xiong, D. (2024). A Comprehensive Evaluation of Quantization Strategies for Large Language Models. arXiv.","DOI":"10.18653\/v1\/2024.findings-acl.726"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Peng, Z., Abdollahi, B., Xie, M., and Fang, Y. (2021, January 11\u201315). Multi-Label Classification of Short Texts with Label Correlated Recurrent Neural Networks. Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval, Virtual Event, Canada.","DOI":"10.1145\/3471158.3472246"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Arzubov, M., Narushynska, O., Batyuk, A., and Cherkas, N. (2023, January 19\u201321). Concept of Server-Side Clusterization of Semi-Static Big Geodata for Web Maps. Proceedings of the 2023 IEEE 18th International Conference on Computer Science and Information Technologies (CSIT), Lviv, Ukraine.","DOI":"10.1109\/CSIT61576.2023.10324155"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Chen, M., Ubul, K., Xu, X., Aysa, A., and Muhammat, M. (2022). Connecting Text Classification with Image Classification: A New Preprocessing Method for Implicit Sentiment Text Classification. Sensors, 22.","DOI":"10.3390\/s22051899"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Bercaru, G., Truic\u0103, C.-O., Chiru, C.-G., and Rebedea, T. (2023). Improving Intent Classification Using Unlabeled Data from Large Corpora. Mathematics, 11.","DOI":"10.3390\/math11030769"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Gilardi, F., Alizadeh, M., and Kubli, M. (2023). ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks. arXiv.","DOI":"10.1073\/pnas.2305016120"},{"key":"ref_21","first-page":"200308","article-title":"ChatGPT and Finetuned BERT: A Comparative Study for Developing Intelligent Design Support Systems","volume":"21","author":"Qiu","year":"2024","journal-title":"Intell. Syst. Appl."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Shah, A., and Chava, S. (2023). Zero Is Not Hero Yet: Benchmarking Zero-Shot Performance of LLMs for Financial Tasks. arXiv.","DOI":"10.2139\/ssrn.4458613"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Reiss, M.V. (2023). Testing the Reliability of ChatGPT for Text Annotation and Classification: A Cautionary Remark. arXiv.","DOI":"10.31219\/osf.io\/rvy5p"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P.K., and Aroyo, L.M. (2021, January 8\u201313). \u201cEveryone Wants to Do the Model Work, Not the Data Work\u201d: Data Cascades in High-Stakes AI. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, Yokohama, Japan.","DOI":"10.1145\/3411764.3445518"},{"key":"ref_25","unstructured":"Troxler, A., and Schelldorfer, J. (2022). Actuarial Applications of Natural Language Processing Using Transformers: Case Studies for Using Text Features in an Actuarial Context. arXiv."},{"key":"ref_26","unstructured":"Scott, D., Bel, N., and Zong, C. (2020, January 8\u201313). Improving Human-Labeled Data through Dynamic Automatic Conflict Resolution. Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"31","DOI":"10.1016\/j.procir.2021.05.005","article-title":"Extracting Functional Requirements from Design Documentation Using Machine Learning","volume":"100","author":"Akay","year":"2021","journal-title":"Procedia CIRP"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Braylan, A., Alonso, O., and Lease, M. (2022, January 25\u201329). Measuring Annotator Agreement Generally across Complex Structured, Multi-Object, and Free-Text Annotation Tasks. Proceedings of the ACM Web Conference 2022, Lyon, France.","DOI":"10.1145\/3485447.3512242"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Church, K., Liberman, M., and Kordoni, V. (2021). Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future, Association for Computational Linguistics.","DOI":"10.18653\/v1\/2021.bppf-1.1"},{"key":"ref_30","unstructured":"Zhu, Y., and Zamani, H. (2023). ICXML: An In-Context Learning Framework for Zero-Shot Extreme Multi-Label Classification. arXiv."},{"key":"ref_31","unstructured":"Doshi, I., Sajjalla, S., Choudhari, J., Bhatt, R., and Dasgupta, A. (2020). Efficient Hierarchical Clustering for Classification and Anomaly Detection. arXiv."},{"key":"ref_32","unstructured":"Kasundra, J., Schulz, C., Mirsafian, M., and Skylaki, S. (2023). A Framework for Monitoring and Retraining Language Models in Real-World Applications. arXiv."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Xu, H., Chen, M., Huang, L., Vucetic, S., and Yin, W. (2024). X-Shot: A Unified System to Handle Frequent, Few-Shot and Zero-Shot Learning Simultaneously in Classification. arXiv.","DOI":"10.18653\/v1\/2024.findings-acl.276"},{"key":"ref_34","unstructured":"(2023, December 07). Global Product Classification (GPC). Available online: https:\/\/www.gs1.org\/standards\/gpc."},{"key":"ref_35","unstructured":"(2023, December 07). Directionsforme. Available online: https:\/\/www.directionsforme.org\/."},{"key":"ref_36","unstructured":"Martorana, M., Kuhn, T., Stork, L., and van Ossenbruggen, J. (2024). Text Classification of Column Headers with a Controlled Vocabulary: Leveraging LLMs for Metadata Enrichment. arXiv."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Miranda, L.J.V. (2023). Developing a Named Entity Recognition Dataset for Tagalog. arXiv.","DOI":"10.18653\/v1\/2023.sealp-1.2"},{"key":"ref_38","unstructured":"Lukasik, M., Narasimhan, H., Menon, A.K., Yu, F., and Kumar, S. (2024). Metric-Aware LLM Inference. arXiv."},{"key":"ref_39","unstructured":"Luengo, D., and Subbotin, S. (2019). Computer Modeling and Intelligent Systems. Proceedings of the 2nd International Conference CMIS-2019, Vol-2353: Main Conference, Zaporizhzhia, Ukraine, 15\u201319 April 2019, Available online: http:\/\/ceur-ws.org\/Vol-2353\/."},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"107202","DOI":"10.1016\/j.infsof.2023.107202","article-title":"Zero-Shot Learning for Requirements Classification: An Exploratory Study","volume":"159","author":"Alhoshan","year":"2023","journal-title":"Inf. Softw. Technol."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Rondinelli, A., Bongiovanni, L., and Basile, V. (2022). Zero-Shot Topic Labeling for Hazard Classification. Information, 13.","DOI":"10.3390\/info13100444"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Wang, Y., Zhang, H., Zhu, B., Chen, S., and Zhang, D. (2022, January 27). OneLabeler: A Flexible System for Building Data Labeling Tools. Proceedings of the CHI Conference on Human Factors in Computing Systems, New Orleans, LA, USA.","DOI":"10.1145\/3491102.3517612"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Zhao, X., Ouyang, S., Yu, Z., Wu, M., and Li, L. (2022). Pre-Trained Language Models Can Be Fully Zero-Shot Learners. arXiv.","DOI":"10.18653\/v1\/2023.acl-long.869"},{"key":"ref_44","unstructured":"Yadav, S., Kaushik, A., and McDaid, K. (2024). Leveraging Weakly Annotated Data for Hate Speech Detection in Code-Mixed Hinglish: A Feasibility-Driven Transfer Learning Approach with Large Language Models. arXiv."},{"key":"ref_45","unstructured":"Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. (2019). BART: Denoising Sequence-to-Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension. arXiv.","DOI":"10.18653\/v1\/2020.acl-main.703"},{"key":"ref_47","first-page":"462","article-title":"Generating Training Data with Language Models: Towards Zero-Shot Language Understanding","volume":"Volume 35","author":"Koyejo","year":"2022","journal-title":"Proceedings of the Advances in Neural Information Processing Systems"},{"key":"ref_48","unstructured":"Gurevych, I., and Miyao, Y. (2018). Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics."},{"key":"ref_49","unstructured":"Vidra, N., Clifford, T., Jijo, K., Chung, E., and Zhang, L. (2024). Improving Classification Performance With Human Feedback: Label a Few, We Label the Rest. arXiv."},{"key":"ref_50","doi-asserted-by":"crossref","first-page":"S22","DOI":"10.1093\/bioinformatics\/17.suppl_1.S22","article-title":"Fast Optimal Leaf Ordering for Hierarchical Clustering","volume":"17","author":"Gifford","year":"2001","journal-title":"Bioinformatics"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Novoselova, N., Wang, J., and Klawonn, F. (2015). Optimized Leaf Ordering with Class Labels for Hierarchical Clustering. J. Bioinform. Comput. Biol., 13.","DOI":"10.1142\/S0219720015500122"}],"container-title":["Big Data and Cognitive Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-2289\/8\/4\/41\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T14:24:20Z","timestamp":1760106260000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-2289\/8\/4\/41"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,4,7]]},"references-count":51,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2024,4]]}},"alternative-id":["bdcc8040041"],"URL":"https:\/\/doi.org\/10.3390\/bdcc8040041","relation":{},"ISSN":["2504-2289"],"issn-type":[{"type":"electronic","value":"2504-2289"}],"subject":[],"published":{"date-parts":[[2024,4,7]]}}}