{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T00:34:43Z","timestamp":1759970083228,"version":"build-2065373602"},"reference-count":79,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2025,1,5]],"date-time":"2025-01-05T00:00:00Z","timestamp":1736035200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"SDGs Global Leadership Program from the Japan International Cooperation Agency (JICA)"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["MAKE"],"abstract":"<jats:p>The now-globally recognized concerns of AI\u2019s environmental implications resulted in a growing awareness of the need to reduce AI carbon footprints, as well as to carry out AI processes responsibly and in an environmentally friendly manner. Benchmarking, a critical step when evaluating AI solutions with machine learning models, particularly with language models, has recently become a focal point of research aimed at reducing AI carbon emissions. Contemporary approaches to AI model benchmarking, however, do not enforce (nor do they assume) a model initial selection process. Consequently, modern model benchmarking is no different from a \u201cbrute force\u201d testing of all candidate models before the best-performing one could be deployed. Obviously, the latter approach is inefficient and environmentally harmful. To address the carbon footprint challenges associated with language model selection, this study presents an original benchmarking approach with a model initial selection on a proxy evaluative task. The proposed approach, referred to as Language Model-Dataset Fit (LMDFit) benchmarking, is devised to complement the standard model benchmarking process with a procedure that would eliminate underperforming models from computationally extensive and, therefore, environmentally unfriendly tests. The LMDFit approach draws parallels from the organizational personnel selection process, where job candidates are first evaluated by conducting a number of basic skill assessments before they would be hired, thus mitigating the consequences of hiring unfit candidates for the organization. LMDFit benchmarking compares candidate model performances on a target-task small dataset to disqualify less-relevant models from further testing. A semantic similarity assessment of random texts is used as the proxy task for the initial selection, and the approach is explicated in the context of various text classification assignments. Extensive experiments across eight text classification tasks (both single- and multi-class) from diverse domains are conducted with seven popular pre-trained language models (both general-purpose and domain-specific). The results obtained demonstrate the efficiency of the proposed LMDFit approach in terms of the overall benchmarking time as well as estimated emissions (a 37% reduction, on average) in comparison to the conventional benchmarking process.<\/jats:p>","DOI":"10.3390\/make7010003","type":"journal-article","created":{"date-parts":[[2025,1,6]],"date-time":"2025-01-06T08:08:52Z","timestamp":1736150932000},"page":"3","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Benchmarking with a Language Model Initial Selection for Text Classification Tasks"],"prefix":"10.3390","volume":"7","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8807-6864","authenticated-orcid":false,"given":"Agus","family":"Riyadi","sequence":"first","affiliation":[{"name":"Graduate School of Information Science and Engineering, Ritsumeikan University, Ibaraki 5678570, Osaka, Japan"},{"name":"Ministry of National Development Planning\/BAPPENAS, Jakarta 10310, Indonesia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5999-8061","authenticated-orcid":false,"given":"Mate","family":"Kovacs","sequence":"additional","affiliation":[{"name":"College of Information Science and Engineering, Ritsumeikan University, Ibaraki 5678570, Osaka, Japan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2383-3158","authenticated-orcid":false,"given":"Uwe","family":"Serd\u00fclt","sequence":"additional","affiliation":[{"name":"College of Information Science and Engineering, Ritsumeikan University, Ibaraki 5678570, Osaka, Japan"},{"name":"Center for Democracy Studies Aarau (ZDA), University of Zurich, 8006 Zurich, Switzerland"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-6610-5015","authenticated-orcid":false,"given":"Victor","family":"Kryssanov","sequence":"additional","affiliation":[{"name":"College of Information Science and Engineering, Ritsumeikan University, Ibaraki 5678570, Osaka, Japan"}]}],"member":"1968","published-online":{"date-parts":[[2025,1,5]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Strubell, E., Ganesh, A., and McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. arXiv.","DOI":"10.18653\/v1\/P19-1355"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"e1507","DOI":"10.1002\/widm.1507","article-title":"A systematic review of Green AI","volume":"13","author":"Verdecchia","year":"2023","journal-title":"WIREs Data Min. Knowl. Discov."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"54","DOI":"10.1145\/3381831","article-title":"Green AI","volume":"63","author":"Schwartz","year":"2020","journal-title":"Commun. ACM"},{"key":"ref_4","first-page":"1","article-title":"Trustworthy AI: From Principles to Practices","volume":"55","author":"Li","year":"2023","journal-title":"ACM Comput. Surv."},{"key":"ref_5","unstructured":"Blagec, K., Dorffner, G., Moradi, M., and Samwald, M. (2021). A critical analysis of metrics used for measuring progress in artificial intelligence. arXiv."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"S104","DOI":"10.1002\/job.1891","article-title":"The role of trustworthiness in recruitment and selection: A review and guide for future research","volume":"34","author":"Klotz","year":"2013","journal-title":"J. Organ. Behav."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Ahuja, K., Dandapat, S., Sitaram, S., and Choudhury, M. (2022). Beyond Static Models and Test Sets: Benchmarking the Potential of Pre-trained Models Across Tasks and Languages. arXiv.","DOI":"10.18653\/v1\/2022.nlppower-1.7"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Ahuja, K., Kumar, S., Dandapat, S., and Choudhury, M. (2022). Multi task learning for zero shot performance prediction of multilingual models. arXiv.","DOI":"10.18653\/v1\/2022.acl-long.374"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Xia, M., Anastasopoulos, A., Xu, R., Yang, Y., and Neubig, G. (2020). Predicting performance for natural language processing tasks. arXiv.","DOI":"10.18653\/v1\/2020.acl-main.764"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Kadi\u0137is, E., Vaibhav, S., and Klinger, R. (2022, January 10\u201315). Embarrassingly simple performance prediction for abductive natural language inference. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua, Online.","DOI":"10.18653\/v1\/2022.naacl-main.441"},{"key":"ref_11","first-page":"1","article-title":"A Survey on Text Classification: From Traditional to Deep Learning","volume":"13","author":"Li","year":"2022","journal-title":"ACM Trans. Intell. Syst. Technol."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"1129","DOI":"10.1016\/j.ipm.2018.08.001","article-title":"Semantic text classification: A survey of past and recent advances","volume":"54","author":"Ganiz","year":"2018","journal-title":"Inf. Process. Manag."},{"key":"ref_13","first-page":"352","article-title":"Comparing BERT against traditional machine learning models in text classification","volume":"2","year":"2023","journal-title":"J. Comput. Cogn. Eng."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"e6815","DOI":"10.1002\/cpe.6815","article-title":"Towards a sustainable artificial intelligence: A case study of energy efficiency in decision tree algorithms","volume":"35","author":"Ferro","year":"2023","journal-title":"Concurr. Comput. Pract. Exp."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Guti\u00e9rrez, M., Moraga, M.\u00c1., and Garc\u00eda, F. (2022, January 13\u201317). Analysing the energy impact of different optimisations for machine learning models. Proceedings of the 2022 International Conference on ICT for Sustainability (ICT4S), Plovdiv, Bulgaria.","DOI":"10.1109\/ICT4S55073.2022.00016"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"257","DOI":"10.1108\/14635770810876593","article-title":"Benchmarking the Benchmarking Models","volume":"15","author":"Gurumurthy","year":"2008","journal-title":"Benchmarking Int. J."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S.R. (2018). GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv.","DOI":"10.18653\/v1\/W18-5446"},{"key":"ref_18","first-page":"3266","article-title":"Superglue: A stickier benchmark for general-purpose language understanding systems","volume":"32","author":"Wang","year":"2019","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Liang, Y., Duan, N., Gong, Y., Wu, N., Guo, F., Qi, W., Gong, M., Shou, L., Jiang, D., and Cao, G. (2020). XGLUE: A new benchmark dataset for cross-lingual pre-training, understanding and generation. arXiv.","DOI":"10.18653\/v1\/2020.emnlp-main.484"},{"key":"ref_20","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Lundgard, A. (2020). Measuring justice in machine learning. arXiv.","DOI":"10.1145\/3351095.3372838"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3616865","article-title":"Fairness in machine learning: A survey","volume":"56","author":"Caton","year":"2020","journal-title":"ACM Comput. Surv."},{"key":"ref_23","first-page":"6000","article-title":"Attention is all you need","volume":"30","author":"Vaswani","year":"2017","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_24","first-page":"100334","article-title":"Pre-trained transformers: An empirical comparison","volume":"9","author":"Casola","year":"2022","journal-title":"Mach. Learn. Appl."},{"key":"ref_25","unstructured":"Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv."},{"key":"ref_26","unstructured":"Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2019). DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv."},{"key":"ref_27","first-page":"5753","article-title":"Xlnet: Generalized autoregressive pretraining for language understanding","volume":"32","author":"Yang","year":"2019","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_28","unstructured":"Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., and Soricut, R. (2019). Albert: A lite bert for self-supervised learning of language representations. arXiv."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Ye, Z., Liu, P., Fu, J., and Neubig, G. (2021). Towards more fine-grained and reliable NLP performance prediction. arXiv.","DOI":"10.18653\/v1\/2021.eacl-main.324"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Boughorbel, S., Jarray, F., and El-Anbari, M. (2017). Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PLoS ONE, 12.","DOI":"10.1371\/journal.pone.0177678"},{"key":"ref_31","unstructured":"Jiang, Y., Neyshabur, B., Mobahi, H., Krishnan, D., and Bengio, S. (2019). Fantastic Generalization Measures and Where to Find Them. arXiv."},{"key":"ref_32","unstructured":"Dziugaite, G.K., Drouin, A., Neal, B., Rajkumar, N., Caballero, E., Wang, L., Mitliagkas, I., and Roy, D.M. (2020, January 6\u201312). In search of robust measures of generalization. Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS \u201920, Red Hook, NY, USA."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"4122","DOI":"10.1038\/s41467-021-24025-8","article-title":"Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data","volume":"12","author":"Martin","year":"2021","journal-title":"Nat. Commun."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Yang, Y., Theisen, R., Hodgkinson, L., Gonzalez, J.E., Ramchandran, K., Martin, C.H., and Mahoney, M.W. (2023, January 6\u201310). Test Accuracy vs. Generalization Gap: Model Selection in NLP without Accessing Training or Testing Data. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA.","DOI":"10.1145\/3580305.3599518"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Beltagy, I., Lo, K., and Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. arXiv.","DOI":"10.18653\/v1\/D19-1371"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., and Androutsopoulos, I. (2020). LEGAL-BERT: The muppets straight out of law school. arXiv.","DOI":"10.18653\/v1\/2020.findings-emnlp.261"},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"225","DOI":"10.1016\/j.aiopen.2021.08.002","article-title":"Pre-trained models: Past, present and future","volume":"2","author":"Han","year":"2021","journal-title":"AI Open"},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"1234","DOI":"10.1093\/bioinformatics\/btz682","article-title":"BioBERT: A pre-trained biomedical language representation model for biomedical text mining","volume":"36","author":"Lee","year":"2020","journal-title":"Bioinformatics"},{"key":"ref_39","unstructured":"Hazourli, A. (2024, October 23). FinancialBERT\u2014A Pretrained Language Model for Financial Text Mining. Available online: https:\/\/huggingface.co\/ahmedrachid\/FinancialBERT."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"ValizadehAslani, T., Shi, Y., Ren, P., Wang, J., Zhang, Y., Hu, M., Zhao, L., and Liang, H. (2023). PharmBERT: A domain-specific BERT model for drug labels. Briefings Bioinform., 24.","DOI":"10.1093\/bib\/bbad226"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Stammbach, D., Webersinke, N., Bingler, J.A., Kraus, M., and Leippold, M. (2022). A Dataset for Detecting Real-World Environmental Claims. arXiv.","DOI":"10.2139\/ssrn.4207369"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Webersinke, N., Kraus, M., Bingler, J.A., and Leippold, M. (2021). Climatebert: A pretrained language model for climate-related text. arXiv.","DOI":"10.2139\/ssrn.4229146"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Schimanski, T., Bingler, J., Hyslop, C., Kraus, M., and Leippold, M. (2023). Climatebert-netzero: Detecting and assessing net zero and reduction targets. arXiv.","DOI":"10.2139\/ssrn.4599483"},{"key":"ref_44","first-page":"649","article-title":"Character-level convolutional networks for text classification","volume":"28","author":"Zhang","year":"2015","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Li, Z., Xu, J., Zeng, J., Li, L., Zheng, X., Zhang, Q., Chang, K.W., and Hsieh, C.J. (2021). Searching for an effective defender: Benchmarking defense against adversarial word substitution. arXiv.","DOI":"10.18653\/v1\/2021.emnlp-main.251"},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Xiong, Y., Feng, Y., Wu, H., Kamigaito, H., and Okumura, M. (2021, January 1\u20136). Fusing label embedding into bert: An efficient improvement for text classification. Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online.","DOI":"10.18653\/v1\/2021.findings-acl.152"},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"782","DOI":"10.1002\/asi.23062","article-title":"Good debt or bad debt: Detecting semantic orientations in economic texts","volume":"65","author":"Malo","year":"2014","journal-title":"J. Assoc. Inf. Sci. Technol."},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Soong, G.H., and Tan, C.C. (2021, January 6). Sentiment Analysis on 10-K Financial Reports using Machine Learning Approaches. Proceedings of the 2021 IEEE 11th International Conference on System Engineering and Technology (ICSET), Shah Alam, Malaysia.","DOI":"10.1109\/ICSET53708.2021.9612552"},{"key":"ref_49","doi-asserted-by":"crossref","first-page":"103957","DOI":"10.1016\/j.frl.2023.103957","article-title":"Sentiment spin: Attacking financial sentiment with GPT-3","volume":"55","author":"Leippold","year":"2023","journal-title":"Financ. Res. Lett."},{"key":"ref_50","unstructured":"Sitaram, S., Beigman Klebanov, B., and Williams, J.D. (2023). Chemical Language Understanding Benchmark. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), Association for Computational Linguistic."},{"key":"ref_51","doi-asserted-by":"crossref","first-page":"235","DOI":"10.1038\/s41597-022-01350-1","article-title":"Plant phenotype relationship corpus for biomedical relationships between plants and phenotypes","volume":"9","author":"Cho","year":"2022","journal-title":"Sci. Data"},{"key":"ref_52","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s12859-016-1249-5","article-title":"A corpus for plant-chemical relationships in the biomedical domain","volume":"17","author":"Choi","year":"2016","journal-title":"BMC Bioinform."},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Scharpf, P., Schubotz, M., Youssef, A., Hamborg, F., Meuschke, N., and Gipp, B. (2020, January 1\u20135). Classification and clustering of arxiv documents, sections, and abstracts, comparing encodings of natural and mathematical language. Proceedings of the ACM\/IEEE Joint Conference on Digital Libraries in 2020, Online.","DOI":"10.1145\/3383583.3398529"},{"key":"ref_54","doi-asserted-by":"crossref","unstructured":"Patadia, D., Kejriwal, S., Mehta, P., and Joshi, A.R. (2021, January 3\u20134). Zero-shot approach for news and scholarly article classification. Proceedings of the 2021 International Conference on Advances in Computing, Communication, and Control (ICAC3), Mumbai, India.","DOI":"10.1109\/ICAC353642.2021.9697327"},{"key":"ref_55","doi-asserted-by":"crossref","unstructured":"Chalkidis, I., Jana, A., Hartung, D., Bommarito, M., Androutsopoulos, I., Katz, D.M., and Aletras, N. (2021). LexGLUE: A benchmark dataset for legal language understanding in English. arXiv.","DOI":"10.2139\/ssrn.3936759"},{"key":"ref_56","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1007\/s11063-024-11599-9","article-title":"Novel GCN Model Using Dense Connection and Attention Mechanism for Text Classification","volume":"56","author":"Peng","year":"2024","journal-title":"Neural Process. Lett."},{"key":"ref_57","doi-asserted-by":"crossref","first-page":"859","DOI":"10.1007\/s10994-017-5689-6","article-title":"Online multi-label dependency topic models for text classification","volume":"107","author":"Burkhardt","year":"2018","journal-title":"Mach. Learn."},{"key":"ref_58","doi-asserted-by":"crossref","first-page":"8669","DOI":"10.1039\/D4GC01745E","article-title":"Balancing computational chemistry\u2019s potential with its environmental impact","volume":"26","author":"Schilter","year":"2024","journal-title":"Green Chem."},{"key":"ref_59","doi-asserted-by":"crossref","first-page":"101968","DOI":"10.1016\/j.aei.2023.101968","article-title":"CO2 impact on convolutional network model training for autonomous driving through behavioral cloning","volume":"56","author":"Parada","year":"2023","journal-title":"Adv. Eng. Inform."},{"key":"ref_60","first-page":"85","article-title":"Human capital and the economy","volume":"136","author":"Becker","year":"1992","journal-title":"Proc. Am. Philos. Soc."},{"key":"ref_61","doi-asserted-by":"crossref","first-page":"109","DOI":"10.1108\/02683940710726375","article-title":"Person-organization fit","volume":"22","author":"Morley","year":"2007","journal-title":"J. Manag. Psychol."},{"key":"ref_62","unstructured":"Edwards, J.R. (1991). Person-Job Fit: A Conceptual Integration, Literature Review, and Methodological Critique, John Wiley & Sons."},{"key":"ref_63","doi-asserted-by":"crossref","first-page":"545","DOI":"10.1080\/1367886042000299843","article-title":"Human capital theory: Implications for human resource development","volume":"7","author":"Nafukho","year":"2004","journal-title":"Hum. Resour. Dev. Int."},{"key":"ref_64","unstructured":"Harris, Z. (1954). Distributional Structure, Taylor & Francis Group."},{"key":"ref_65","unstructured":"Firth, J.R. (1957). A synopsis of linguistic theory, 1930 \u00b1 1955\u2019 Studies in Linguistic Analysis. Special Volume of the Philological Society, Blackwell."},{"key":"ref_66","doi-asserted-by":"crossref","first-page":"141","DOI":"10.1613\/jair.2934","article-title":"From frequency to meaning: Vector space models of semantics","volume":"37","author":"Turney","year":"2010","journal-title":"J. Artif. Intell. Res."},{"key":"ref_67","doi-asserted-by":"crossref","first-page":"665","DOI":"10.1162\/COLI_a_00237","article-title":"Simlex-999: Evaluating semantic models with (genuine) similarity estimation","volume":"41","author":"Hill","year":"2015","journal-title":"Comput. Linguist."},{"key":"ref_68","doi-asserted-by":"crossref","unstructured":"Gao, T., Yao, X., and Chen, D. (2021). Simcse: Simple contrastive learning of sentence embeddings. arXiv.","DOI":"10.18653\/v1\/2021.emnlp-main.552"},{"key":"ref_69","unstructured":"Aggarwal, C.C., Hinneburg, A., and Keim, D.A. (2001, January 4\u20136). On the surprising behavior of distance metrics in high dimensional space. Proceedings of the Database Theory\u2014ICDT 2001: 8th International Conference, London, UK. proceedings 8."},{"key":"ref_70","unstructured":"Huang, A. (2008, January 14\u201318). Similarity measures for text document clustering. Proceedings of the Sixth New Zealand Computer Science Research Student Conference (NZCSRSC2008), Christchurch, New Zealand."},{"key":"ref_71","doi-asserted-by":"crossref","first-page":"234","DOI":"10.1214\/ECP.v12-1294","article-title":"Asymptotic distribution of coordinates on high dimensional spheres","volume":"12","author":"Spruill","year":"2007","journal-title":"Electron. Commun. Probab."},{"key":"ref_72","unstructured":"Paukkeri, M.S., Kivim\u00e4ki, I., Tirunagari, S., Oja, E., and Honkela, T. (2011, January 13\u201317). Effect of dimensionality reduction on different distance measures in document clustering. Proceedings of the Neural Information Processing: 18th International Conference, ICONIP 2011, Shanghai, China. Proceedings, Part III 18."},{"key":"ref_73","first-page":"506","article-title":"Selecting a text similarity measure for a content-based recommender system: A comparison in two corpora","volume":"37","author":"Wijewickrema","year":"2019","journal-title":"Electron. Libr."},{"key":"ref_74","doi-asserted-by":"crossref","unstructured":"Parsons, V.L. (2017). Stratified Sampling. Wiley StatsRef: Statistics Reference Online, John Wiley & Sons, Ltd.","DOI":"10.1002\/9781118445112.stat05999.pub2"},{"key":"ref_75","doi-asserted-by":"crossref","first-page":"2","DOI":"10.1080\/10691898.2011.11889611","article-title":"Measuring skewness: A forgotten statistic?","volume":"19","author":"Doane","year":"2011","journal-title":"J. Stat. Educ."},{"key":"ref_76","doi-asserted-by":"crossref","first-page":"129","DOI":"10.1109\/TIT.1982.1056489","article-title":"Least squares quantization in PCM","volume":"28","author":"Lloyd","year":"1982","journal-title":"IEEE Trans. Inf. Theory"},{"key":"ref_77","doi-asserted-by":"crossref","first-page":"1171","DOI":"10.1016\/j.procs.2024.08.179","article-title":"Ancient Chinese Poetry Collation Based on BERT","volume":"242","author":"Yu","year":"2024","journal-title":"Procedia Comput. Sci."},{"key":"ref_78","doi-asserted-by":"crossref","first-page":"26839","DOI":"10.1109\/ACCESS.2024.3365742","article-title":"A review on large Language Models: Architectures, applications, taxonomies, open issues and challenges","volume":"12","author":"Raiaan","year":"2024","journal-title":"IEEE Access"},{"key":"ref_79","doi-asserted-by":"crossref","unstructured":"Gasparetto, A., Marcuzzo, M., Zangari, A., and Albarelli, A. (2022). A survey on text classification algorithms: From text to predictions. Information, 13.","DOI":"10.3390\/info13020083"}],"container-title":["Machine Learning and Knowledge Extraction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-4990\/7\/1\/3\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,8]],"date-time":"2025-10-08T10:23:11Z","timestamp":1759918991000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-4990\/7\/1\/3"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,1,5]]},"references-count":79,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,3]]}},"alternative-id":["make7010003"],"URL":"https:\/\/doi.org\/10.3390\/make7010003","relation":{},"ISSN":["2504-4990"],"issn-type":[{"type":"electronic","value":"2504-4990"}],"subject":[],"published":{"date-parts":[[2025,1,5]]}}}