{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T01:42:21Z","timestamp":1760060541718,"version":"build-2065373602"},"reference-count":39,"publisher":"MDPI AG","issue":"9","license":[{"start":{"date-parts":[[2025,9,2]],"date-time":"2025-09-02T00:00:00Z","timestamp":1756771200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>Loss functions play a significant role in shaping model behavior in machine learning, yet their design implications remain underexplored in natural language processing tasks such as Named Entity Recognition (NER). This study investigates the performance and optimization behavior of five loss functions\u2014L1, L2, Cross-Entropy (CE), KL Divergence (KL), and the proposed DLITE (Discounted Least Information Theory of Entropy) Loss\u2014within transformer-based NER models. DLITE introduces a bounded, entropy-discounting approach to penalization, prioritizing recall and training stability, especially under noisy or imbalanced data conditions. We conducted empirical evaluations across three benchmark NER datasets: Basic NER, CoNLL-2003, and the Broad Twitter Corpus. While CE and KL achieved the highest weighted F1-scores in clean datasets, DLITE Loss demonstrated distinct advantages in macro recall, precision\u2013recall balance, and convergence stability\u2014particularly in noisy environments. Our findings suggest that the choice of loss function should align with application-specific priorities, such as minimizing false negatives or managing uncertainty. DLITE adds a new dimension to model design by enabling more measured predictions, making it a valuable alternative in high-stakes or real-world NLP deployments.<\/jats:p>","DOI":"10.3390\/info16090760","type":"journal-article","created":{"date-parts":[[2025,9,2]],"date-time":"2025-09-02T14:16:55Z","timestamp":1756822615000},"page":"760","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Beyond Cross-Entropy: Discounted Least Information Theory of Entropy (DLITE) Loss and the Impact of Loss Functions on AI-Driven Named Entity Recognition"],"prefix":"10.3390","volume":"16","author":[{"given":"Sonia","family":"Pascua","sequence":"first","affiliation":[{"name":"Information Science Department, College of Computing and Informatics, Drexel University, Philadelphia, PA 19104, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Michael","family":"Pan","sequence":"additional","affiliation":[{"name":"Department of Computer Science and Statistics, College of Arts and Sciences, University of Rhode Island, Kingston, RI 02881, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Weimao","family":"Ke","sequence":"additional","affiliation":[{"name":"Information Science Department, College of Computing and Informatics, Drexel University, Philadelphia, PA 19104, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2025,9,2]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"19","DOI":"10.1007\/s10479-005-5724-z","article-title":"A Tutorial on the Cross-Entropy Method","volume":"134","author":"Kroese","year":"2005","journal-title":"Ann. Oper. Res."},{"key":"ref_2","unstructured":"Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. (2016). Deep Learning, MIT Press."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"187","DOI":"10.1007\/s40745-020-00253-5","article-title":"A comprehensive survey of loss functions in machine learning","volume":"9","author":"Wang","year":"2022","journal-title":"Ann. Data Sci."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Doll\u00e1r, P. (2017, January 22\u201329). Focal Loss for Dense Object Detection. Proceedings of the IEEE ICCV, Venice, Italy.","DOI":"10.1109\/ICCV.2017.324"},{"key":"ref_5","unstructured":"Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019, January 2\u20137). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA."},{"key":"ref_6","first-page":"1","article-title":"Attention is All You Need","volume":"30","author":"Vaswani","year":"2017","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"79","DOI":"10.1214\/aoms\/1177729694","article-title":"On information and sufficiency","volume":"22","author":"Kullback","year":"1951","journal-title":"Ann. Math. Stat."},{"key":"ref_8","unstructured":"Kingma, D.P., and Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv."},{"key":"ref_9","first-page":"8778","article-title":"Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels","volume":"31","author":"Zhang","year":"2018","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"160018","DOI":"10.1038\/sdata.2016.18","article-title":"The FAIR Guiding Principles for Scientific Data Management and Stewardship","volume":"3","author":"Wilkinson","year":"2016","journal-title":"Sci. Data"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Ke, W. (2022, January 17\u201320). Alternatives to classic BM25-IDF based on a new information theoretical framework. Proceedings of the 2022 IEEE International Conference on Big Data (Big Data), Osaka, Japan.","DOI":"10.1109\/BigData55660.2022.10020937"},{"key":"ref_12","unstructured":"Winograd, T. (1971). Procedures as a Representation for Data in a Computer Program for Understanding Natural Language, MIT AI Technical."},{"key":"ref_13","unstructured":"Chinchor, N., Hirschman, L., and Lewis, D.D. (1993, January 21\u201324). Evaluating message understanding systems: An analysis of the Message Understanding Conference (MUC) results. Proceedings of the Workshop on Human Language Technology, Stroudsburg, PA, USA."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"361","DOI":"10.1016\/S0959-440X(96)80056-X","article-title":"Hidden Markov models","volume":"6","author":"Eddy","year":"1996","journal-title":"Curr. Opin. Struct. Biol."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"3","DOI":"10.1075\/li.30.1.03nad","article-title":"A survey of named entity recognition and classification","volume":"30","author":"Nadeau","year":"2007","journal-title":"Lingvisticae Investig."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Mandic, D.P., and Chambers, J. (2001). Recurrent Neural Networks for Prediction: Learning Algorithms, Architectures, and Stability, John Wiley & Sons, Inc.","DOI":"10.1002\/047084535X"},{"key":"ref_17","unstructured":"Egan, S., Fedorko, W., Lister, A., Pearkes, J., and Gay, C. (2017). Long short-term memory (LSTM) networks with jet constituents for boosted top tagging at the LHC. arXiv."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3560260","article-title":"QA dataset explosion: A taxonomy of NLP resources for question answering and reading comprehension","volume":"55","author":"Rogers","year":"2023","journal-title":"ACM Comput. Surv."},{"key":"ref_19","unstructured":"Derczynski, L., Bontcheva, K., and Roberts, I. (2016, January 11\u201316). Broad Twitter corpus: A diverse named entity recognition resource. Proceedings of the COLING 2016, The 26th International Conference on Computational Linguistics, Osaka, Japan."},{"key":"ref_20","unstructured":"Jaswani, N. (2025, June 29). NER Dataset. Kaggle. Available online: https:\/\/www.kaggle.com\/datasets\/namanj27\/ner-dataset."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Tjong Kim Sang, E.F., and De Meulder, F. (2003, January 31). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, Stroudsburg, PA, USA. Available online: https:\/\/aclanthology.org\/W03-0419.","DOI":"10.3115\/1119176.1119195"},{"key":"ref_22","unstructured":"Weisstein, E.W. (2025, June 29). \u201cNorm\u201d. MathWorld. Available online: http:\/\/mathworld.wolfram.com\/Norm.html."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Hastie, T., Tibshirani, R., and Friedman, J.H. (2001). The Elements of Statistical Learning, Springer.","DOI":"10.1007\/978-0-387-21606-5"},{"key":"ref_24","unstructured":"Dhinakaran, A. (2025, June 29). Understanding KL Divergence. Towards Data Science. Available online: https:\/\/towardsdatascience.com\/understanding-kl-divergence-f3ddc8dff254."},{"key":"ref_25","first-page":"146","article-title":"I-divergence geometry of probability distributions and minimization problems","volume":"3","year":"1975","journal-title":"Ann. Probab."},{"key":"ref_26","unstructured":"Ke, W. (2012). Least information modeling for information retrieval. arXiv, Available online: https:\/\/arxiv.org\/pdf\/1205.0312."},{"key":"ref_27","unstructured":"Ke, W. (2025, June 29). Beyond Cross-Entropy: DLITE Loss and the Impact of Loss Functions on AI-Driven Named Entity Recognition. [Computer Software]. GitHub. Available online: https:\/\/github.com\/keweimao\/DeepDelight\/tree\/main\/Thread4\/Beyond%20Cross-Entropy%3A%20DLITE%20Loss%20and%20the%20Impact%20of%20Loss%20Functions%20on%20AI-Driven%20Named%20Entity%20Recognition."},{"key":"ref_28","unstructured":"OpenAI (2023). GPT-4 Technical Report (Tech. Rep.). arXiv."},{"key":"ref_29","unstructured":"OpenAI (2025, June 29). ChatGPT [Large Language Model]. Available online: https:\/\/chat.openai.com\/."},{"key":"ref_30","unstructured":"Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., and Macherey, K. (2016). Google\u2019s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv, Available online: https:\/\/arxiv.org\/abs\/1609.08144."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"S2","DOI":"10.1186\/1758-2946-7-S1-S2","article-title":"The CHEMDNER corpus of chemicals and drugs and its annotation principles","volume":"7","author":"Krallinger","year":"2015","journal-title":"J. Cheminform."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"e12239","DOI":"10.2196\/12239","article-title":"Natural language processing of clinical notes on chronic diseases: Systematic review","volume":"7","author":"Sheikhalishahi","year":"2019","journal-title":"JMIR Med. Inform."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Ghosh, A., Kumar, H., and Sastry, P.S. (2017, January 4\u20139). Robust loss functions under label noise for deep neural networks. Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA.","DOI":"10.1609\/aaai.v31i1.10894"},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"i180","DOI":"10.1093\/bioinformatics\/btg1023","article-title":"GENIA corpus\u2014A semantically annotated corpus for bio-textmining","volume":"19","author":"Kim","year":"2003","journal-title":"Bioinformatics"},{"key":"ref_35","unstructured":"Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. (2021). CUAD: An expert-annotated NLP dataset for legal contract review. arXiv, Available online: https:\/\/arxiv.org\/abs\/2103.06268."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., and Androutsopoulos, I. (2022, January 22\u201327). LexGLUE: A benchmark dataset for legal language understanding in English. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), Dublin, Ireland.","DOI":"10.18653\/v1\/2022.acl-long.297"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Wiegreffe, S., and Pinter, Y. (2019, January 3\u20137). Attention is not not explanation. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), Hong Kong, China.","DOI":"10.18653\/v1\/D19-1002"},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"111022","DOI":"10.1016\/j.knosys.2023.111022","article-title":"OWAdapt: An adaptive loss function for deep learning using OWA operators","volume":"280","author":"Maldonado","year":"2023","journal-title":"Knowl.-Based Syst."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Janocha, K., and Czarnecki, W.M. (2017). On loss functions for deep neural networks in classification. arXiv, Available online: https:\/\/arxiv.org\/abs\/1702.05659.","DOI":"10.4467\/20838476SI.16.004.6185"}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/16\/9\/760\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T18:37:46Z","timestamp":1760035066000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/16\/9\/760"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,2]]},"references-count":39,"journal-issue":{"issue":"9","published-online":{"date-parts":[[2025,9]]}},"alternative-id":["info16090760"],"URL":"https:\/\/doi.org\/10.3390\/info16090760","relation":{},"ISSN":["2078-2489"],"issn-type":[{"type":"electronic","value":"2078-2489"}],"subject":[],"published":{"date-parts":[[2025,9,2]]}}}