{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,24]],"date-time":"2026-03-24T15:57:55Z","timestamp":1774367875111,"version":"3.50.1"},"reference-count":51,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2025,4,21]],"date-time":"2025-04-21T00:00:00Z","timestamp":1745193600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"governments of Czechia, Hungary, Poland, and Slovakia","award":["#22310057"],"award-info":[{"award-number":["#22310057"]}]},{"name":"governments of Czechia, Hungary, Poland, and Slovakia","award":["RRF-2.3.1-21-2022-00004"],"award-info":[{"award-number":["RRF-2.3.1-21-2022-00004"]}]},{"name":"governments of Czechia, Hungary, Poland, and Slovakia","award":["151324"],"award-info":[{"award-number":["151324"]}]},{"name":"Hungarian Academy of Sciences, MOMENTUM V-SHIFT","award":["#22310057"],"award-info":[{"award-number":["#22310057"]}]},{"name":"Hungarian Academy of Sciences, MOMENTUM V-SHIFT","award":["RRF-2.3.1-21-2022-00004"],"award-info":[{"award-number":["RRF-2.3.1-21-2022-00004"]}]},{"name":"Hungarian Academy of Sciences, MOMENTUM V-SHIFT","award":["151324"],"award-info":[{"award-number":["151324"]}]},{"name":"Ministry of Innovation and Technology National Research, Development and Innovation (NRDI) Office","award":["#22310057"],"award-info":[{"award-number":["#22310057"]}]},{"name":"Ministry of Innovation and Technology National Research, Development and Innovation (NRDI) Office","award":["RRF-2.3.1-21-2022-00004"],"award-info":[{"award-number":["RRF-2.3.1-21-2022-00004"]}]},{"name":"Ministry of Innovation and Technology National Research, Development and Innovation (NRDI) Office","award":["151324"],"award-info":[{"award-number":["151324"]}]},{"name":"Hungarian National Research, Development and Innovation Office\u2019s National Research Excellence Programme","award":["#22310057"],"award-info":[{"award-number":["#22310057"]}]},{"name":"Hungarian National Research, Development and Innovation Office\u2019s National Research Excellence Programme","award":["RRF-2.3.1-21-2022-00004"],"award-info":[{"award-number":["RRF-2.3.1-21-2022-00004"]}]},{"name":"Hungarian National Research, Development and Innovation Office\u2019s National Research Excellence Programme","award":["151324"],"award-info":[{"award-number":["151324"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>Emotion classification in natural language processing (NLP) has recently witnessed significant advancements. However, class imbalance in emotion datasets remains a critical challenge, as dominant emotion categories tend to overshadow less frequent ones, leading to biased model predictions. Traditional techniques, such as undersampling and oversampling, offer partial solutions. More recently, synthetic data generation using large language models (LLMs) has emerged as a promising strategy for augmenting minority classes and improving model robustness. In this study, we investigate the impact of synthetic data augmentation on German-language emotion classification. Using an imbalanced dataset, we systematically evaluate multiple balancing strategies, including undersampling overrepresented classes and generating synthetic data for underrepresented emotions using a GPT-4\u2013based model in a few-shot prompting setting. Beyond enhancing model performance, we conduct a detailed linguistic analysis of the synthetic samples, examining their lexical diversity, syntactic structures, and semantic coherence to determine their contribution to overall model generalization. Our results demonstrate that integrating synthetic data significantly improves classification performance, particularly for minority emotion categories, while maintaining overall model stability. However, our linguistic evaluation reveals that synthetic examples exhibit reduced lexical diversity and simplified syntactic structures, which may introduce limitations in certain real-world applications. These findings highlight both the potential and the challenges of synthetic data augmentation in emotion classification. By providing a comprehensive evaluation of balancing techniques and the linguistic properties of generated text, this study contributes to the ongoing discourse on improving NLP models for underrepresented linguistic phenomena.<\/jats:p>","DOI":"10.3390\/info16040330","type":"journal-article","created":{"date-parts":[[2025,4,21]],"date-time":"2025-04-21T20:38:26Z","timestamp":1745267906000},"page":"330","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["Evaluating the Impact of Synthetic Data on Emotion Classification: A Linguistic and Structural Analysis"],"prefix":"10.3390","volume":"16","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5897-9379","authenticated-orcid":false,"given":"Istv\u00e1n","family":"\u00dcveges","sequence":"first","affiliation":[{"name":"HUN-REN Centre for Social Sciences, T\u00f3th K\u00e1lm\u00e1n u. 4, 1097 Budapest, Hungary"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3710-1118","authenticated-orcid":false,"given":"Orsolya","family":"Ring","sequence":"additional","affiliation":[{"name":"HUN-REN Centre for Social Sciences, T\u00f3th K\u00e1lm\u00e1n u. 4, 1097 Budapest, Hungary"}]}],"member":"1968","published-online":{"date-parts":[[2025,4,21]]},"reference":[{"key":"ref_1","first-page":"1877","article-title":"Language Models are Few-Shot Learners","volume":"Volume 33","author":"Larochelle","year":"2020","journal-title":"Proceedings of the Advances in Neural Information Processing Systems"},{"key":"ref_2","unstructured":"Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv."},{"key":"ref_3","first-page":"1","article-title":"Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer","volume":"21","author":"Raffel","year":"2020","journal-title":"J. Mach. Learn. Res."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"64","DOI":"10.1177\/1081180X05286065","article-title":"Understanding variations in media coverage of US Supreme Court decisions: Comparing media outlets in their coverage of Lawrence v. Texas","volume":"11","author":"Allen","year":"2006","journal-title":"Harv. Int. J. Press."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1561\/1500000011","article-title":"Opinion Mining and Sentiment Analysis","volume":"2","author":"Pang","year":"2008","journal-title":"Found. Trends Inf. Retr."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"615","DOI":"10.1177\/009365002237829","article-title":"Cynical and engaged: Strategic campaign coverage, public opinion, and mobilization in a referendum","volume":"29","author":"Semetko","year":"2002","journal-title":"Commun. Res."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"91","DOI":"10.1080\/07343460509507679","article-title":"Reporting on two presidencies: News coverage of George W. Bush\u2019s first year in office","volume":"32","author":"Farnsworth","year":"2005","journal-title":"Congr. Pres."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"223","DOI":"10.1080\/09296174.2014.911506","article-title":"Can Type-Token Ratio be Used to Show Morphological Complexity of Languages?","volume":"21","author":"Kettunen","year":"2014","journal-title":"J. Quant. Linguist."},{"key":"ref_9","unstructured":"Rogers, A., Boyd-Graber, J., and Okazaki, N. (2023, January 9\u201314). S2ynRE: Two-stage Self-training with Synthetic data for Low-resource Relation Extraction. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada."},{"key":"ref_10","unstructured":"Hancock, J.M. (2025, March 13). Jaccard Distance (Jaccard Index, Jaccard Similarity Coefficient). Available online: https:\/\/onlinelibrary.wiley.com\/doi\/10.1002\/9780471650126.dob0956."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"972","DOI":"10.1111\/j.1460-2466.2002.tb02584.x","article-title":"Mediatization of politics: Theory and data","volume":"52","author":"Kepplinger","year":"2002","journal-title":"J. Commun."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"207","DOI":"10.1177\/0267323107076770","article-title":"Are Sensational News Stories More Likely to Trigger Viewers\u2019 Emotions than Non-Sensational News Stories? A Content Analysis of British TV News","volume":"22","author":"Uribe","year":"2007","journal-title":"Eur. J. Commun."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"2623","DOI":"10.1007\/s11135-016-0412-4","article-title":"Sentiment analysis of political communication: Combining a dictionary approach with crowdcoding","volume":"51","author":"Haselmayer","year":"2017","journal-title":"Qual. Quant."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"83","DOI":"10.1080\/19312458.2019.1671966","article-title":"What\u2019s the Tone? Easy Doesn\u2019t Do It: Analyzing Performance and Agreement Between Off-the-Shelf Sentiment Analysis Tools","volume":"14","author":"Boukes","year":"2020","journal-title":"Commun. Methods Meas."},{"key":"ref_15","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv, 4171\u20134186."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"319","DOI":"10.1080\/19331681.2018.1485608","article-title":"Validating a sentiment dictionary for German political language\u2014A workbench note","volume":"15","author":"Rauh","year":"2018","journal-title":"J. Inf. Technol. Politics"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"626","DOI":"10.1017\/pan.2022.15","article-title":"Creating and Comparing Dictionary, Word Embedding, and Transformer-Based Models to Measure Discrete Emotions in German Political Text","volume":"31","author":"Widmann","year":"2022","journal-title":"Political Anal."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"858","DOI":"10.1108\/14684521211287936","article-title":"Sentiment analysis of online news text: A case study of appraisal theory","volume":"36","author":"Khoo","year":"2012","journal-title":"Online Inf. Rev."},{"key":"ref_19","first-page":"1","article-title":"Rethinking Sentiment Analysis in the News: From Theory to Practice and back","volume":"9","author":"Balahur","year":"2009","journal-title":"Proceed. WOMSA"},{"key":"ref_20","unstructured":"Mullen, T., and Malouf, R. (2006, January 27\u201329). A Preliminary Investigation into Sentiment Analysis of Informal Political Discourse. Proceedings of the Computational Approaches to Analyzing Weblogs, Stanford, CA, USA."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"73","DOI":"10.1080\/19331680802154145","article-title":"Good News or Bad News? Conducting Sentiment Analysis on Dutch Text to Distinguish Between Positive and Negative Relations","volume":"5","author":"Kleinnijenhuis","year":"2008","journal-title":"J. Inf. Technol. Politics"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Kaya, M., Fidan, G., and Toroslu, I.H. (2012, January 4\u20137). Sentiment analysis of Turkish political news. Proceedings of the 2012 IEEE\/WIC\/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, Macau, China.","DOI":"10.1109\/WI-IAT.2012.115"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Sa\u011flam, F., Sever, H., and Gen\u00e7, B. (December, January 29). Developing Turkish sentiment lexicon for sentiment analysis using online news media. Proceedings of the 2016 IEEE\/ACS 13th International Conference of Computer Systems and Applications (AICCSA), Agadir, Morocco.","DOI":"10.1109\/AICCSA.2016.7945670"},{"key":"ref_24","unstructured":"Bakken, P.F., Bratlie, T.A., S\u00e1nchez-Marco, C., and Gulla, J.A. (2016, January 11\u201316). Political news sentiment analysis for under-resourced languages. Proceedings of the COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Biba, M., and Mane, M. (2013, January 23\u201324). Sentiment analysis through machine learning: An experimental evaluation for Albanian. Proceedings of the Recent Advances in Intelligent Informatics: Proceedings of the Second International Symposium on Intelligent Informatics (ISI\u201913), Mysore, India.","DOI":"10.1007\/978-3-319-01778-5_20"},{"key":"ref_26","unstructured":"Bobicev, V., and Sokolova, M. (2017, January 2\u20138). Inter-Annotator Agreement in Sentiment Analysis: Machine Learning Perspective. Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, Varna, Bulgaria."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Suryono, R.R., and Indra, B. (2019, January 16). P2P Lending sentiment analysis in Indonesian online news. Proceedings of the Sriwijaya International Conference on Information Technology and Its Applications (SICONIAN 2019), Palembang, Indonesia.","DOI":"10.2991\/aisr.k.200424.006"},{"key":"ref_28","first-page":"1","article-title":"Discovered and Undiscovered Fields of Digital Politics: Mapping Online Political Communication and Online News Media Literature in Hungary","volume":"7","author":"Bene","year":"2021","journal-title":"Intersections. East Eur. J. Soc. Politics"},{"key":"ref_29","first-page":"5","article-title":"Emotional communication and participation in politics","volume":"6","year":"2020","journal-title":"Intersections. East Eur. J. Soc. Politics"},{"key":"ref_30","unstructured":"Szab\u00f3, G., and Szil\u00e1gyi, S. (2025, April 15). Mor\u00e1l a M\u00e9di\u00e1ban: Az Ukrajnai h\u00e1Bor\u00fa az Online h\u00edRport\u00e1lokon a 2022-es Orsz\u00e1ggy\u0171l\u00e9si Kamp\u00e1ny Idej\u00e9n. Available online: https:\/\/real.mtak.hu\/154704\/."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"60267","DOI":"10.1109\/ACCESS.2023.3285536","article-title":"HunEmBERT: A fine-tuned BERT-model for classifying sentiment and emotion in political communication","volume":"11","author":"Ring","year":"2023","journal-title":"IEEE Access"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Al-Twairesh, N. (2021). The Evolution of Language Models Applied to Emotion Analysis of Arabic Tweets. Information, 12.","DOI":"10.3390\/info12020084"},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"309","DOI":"10.1207\/s15506878jobem4703_1","article-title":"Media, Terrorism, and Emotionality: Emotional Differences in Media Content and Public Reactions to the September 11th Terrorist Attacks","volume":"47","author":"Cho","year":"2003","journal-title":"J. Broadcast. Electron. Media"},{"key":"ref_34","first-page":"64","article-title":"Reader Perspective Emotion Analysis in Text through Ensemble based Multi-Label Classification Framework","volume":"2","author":"Bhowmick","year":"2009","journal-title":"Comput. Inf. Sci."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Boomgaarden, H.G., and Schmitt-Beck, R. (2019). The Media and Political Behavior. Oxford Research Encyclopedia of Politics, Oxford University Press.","DOI":"10.1093\/acrefore\/9780190228637.013.621"},{"key":"ref_36","unstructured":"Kuila, A., and Sarkar, S. (2024). Deciphering Political Entity Sentiment in News with Large Language Models: Zero-Shot and Few-Shot Strategies. arXiv."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Rozado, D., Hughes, R., and Halberstadt, J. (2022). Longitudinal analysis of sentiment and emotion in news media headlines using automated labelling with Transformer language models. PLoS ONE, 17.","DOI":"10.1371\/journal.pone.0276367"},{"key":"ref_38","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. (2017). Attention Is All You Need. arXiv."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzm\u00e1n, F., Grave, E., Ott, M., Zettlemoyer, L., and Stoyanov, V. (2020). Unsupervised Cross-lingual Representation Learning at Scale. arXiv.","DOI":"10.18653\/v1\/2020.acl-main.747"},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"84","DOI":"10.1145\/3065386","article-title":"ImageNet classification with deep convolutional neural networks","volume":"60","author":"Krizhevsky","year":"2017","journal-title":"Commun. ACM"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Wei, J., and Zou, K. (2019). EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. arXiv.","DOI":"10.18653\/v1\/D19-1670"},{"key":"ref_42","unstructured":"Jurafsky, D., and Martin, J.H. (2025, April 15). Speech and Language Processing. Available online: https:\/\/web.stanford.edu\/~jurafsky\/slp3\/."},{"key":"ref_43","unstructured":"Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., and Joulin, A. (2018, January 7\u201312). Advances in pre-training distributed word representations. Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Pennington, J., Socher, R., and Manning, C.D. (2014, January 25\u201329). GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar.","DOI":"10.3115\/v1\/D14-1162"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Feng, S., Gangal, V., Wei, J., Chandar, S., Vosoughi, S., Mitamura, T., and Hovy, E. (2021, January 1\u20136). A Survey of Data Augmentation Approaches for NLP. Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online.","DOI":"10.18653\/v1\/2021.findings-acl.84"},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"5652","DOI":"10.1109\/TIP.2018.2861573","article-title":"A Unified Approach for Conventional Zero-Shot, Generalized Zero-Shot, and Few-Shot Learning","volume":"27","author":"Rahman","year":"2018","journal-title":"IEEE Trans. Image Process."},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Li, Z., Zhu, H., Lu, Z., and Yin, M. (2023, January 16). Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore.","DOI":"10.18653\/v1\/2023.emnlp-main.647"},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Pramana, R., Subroto, J.J., Gunawan, A.A.S. (2022, January 4\u20135). Systematic Literature Review of Stemming and Lemmatization Performance for Sentence Similarity. Proceedings of the 2022 IEEE 7th International Conference on Information Technology and Digital Applications (ICITDA), Yogyakarta, Indonesia.","DOI":"10.1109\/ICITDA55840.2022.9971451"},{"key":"ref_49","unstructured":"Manning, C.D., and Sch\u00fctze, H. (1999). Foundations of Statistical Natural Language Processing, The MIT Press."},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Sidorov, G., Gomez-Adorno, H., Markov, I., Pinto, D., and Loya, N. (2015, January 17\u201319). Computing text similarity using Tree Edit Distance. Proceedings of the 2015 Annual Conference of the North American Fuzzy Information Processing Society (NAFIPS) Held Jointly with 2015 5th World Conference on Soft Computing (WConSC), Redmond, WA, USA.","DOI":"10.1109\/NAFIPS-WConSC.2015.7284129"},{"key":"ref_51","first-page":"351","article-title":"Plagiarism Detection Using Machine Learning-Based Paraphrase Recognizer","volume":"25","author":"Chitra","year":"2016","journal-title":"J. Intell. Syst."}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/16\/4\/330\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T17:18:45Z","timestamp":1760030325000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/16\/4\/330"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,4,21]]},"references-count":51,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2025,4]]}},"alternative-id":["info16040330"],"URL":"https:\/\/doi.org\/10.3390\/info16040330","relation":{},"ISSN":["2078-2489"],"issn-type":[{"value":"2078-2489","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,4,21]]}}}