{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,25]],"date-time":"2026-03-25T14:09:52Z","timestamp":1774447792898,"version":"3.50.1"},"reference-count":17,"publisher":"MDPI AG","issue":"2","license":[{"start":{"date-parts":[[2026,2,1]],"date-time":"2026-02-01T00:00:00Z","timestamp":1769904000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"state assignment of the Ministry of Science and Higher Education of the Russian Federation for the Federal Research Center for Information and Computational Technologies"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Data"],"abstract":"<jats:p>This data descriptor presents two fully synthetic corpora for sentiment analysis and named entity recognition (NER) in Uzbek. The first corpus contains 12,000 hybrid synthetic sentences generated from templates with lexical randomization, automatic insertion of named entities (PER\/ORG\/LOC), lexicon-based polarity scoring, and a controlled emoji distribution. The second corpus includes 3000 \u201cmanual-style\u201d sentences designed to resemble short, naturally structured messages. Although the manual-style subset was initially intended to be emoji-free, the released version includes a 39.6% emoji presence (sentences containing at least one emoji) to maintain comparability in emotional markers across corpora. Both corpora are released in CSV, XLSX, and JSONL formats and share a unified schema (id, text, sentiment, entities, entity_type, polarity_score, polarity_source, token_count, emojis, emoji_position, emoji_sentiment, conflict_flag, sentiment_from_polarity_score, split). The dataset is publicly available via Mendeley Data (DOI: 10.17632\/y2d5pcyrzz.3).<\/jats:p>","DOI":"10.3390\/data11020028","type":"journal-article","created":{"date-parts":[[2026,2,2]],"date-time":"2026-02-02T12:49:44Z","timestamp":1770036584000},"page":"28","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Dual-Source Synthetic Uzbek Corpora for Sentiment Analysis and NER with Controlled Emoji Signals"],"prefix":"10.3390","volume":"11","author":[{"ORCID":"https:\/\/orcid.org\/0009-0000-5540-2013","authenticated-orcid":false,"given":"Bobur","family":"Saidov","sequence":"first","affiliation":[{"name":"Faculty of Mechanics and Mathematics, Novosibirsk State University, 1 Pirogova str., Novosibirsk 630090, Russia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3299-0507","authenticated-orcid":false,"given":"Vladimir","family":"Barakhnin","sequence":"additional","affiliation":[{"name":"Faculty of Mechanics and Mathematics, Novosibirsk State University, 1 Pirogova str., Novosibirsk 630090, Russia"},{"name":"Federal Research Center for Information and Computational Technologies, Novosibirsk 630090, Russia"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-2128-8769","authenticated-orcid":false,"given":"Shohrux","family":"Madirimov","sequence":"additional","affiliation":[{"name":"Faculty of Mechanics and Mathematics, Novosibirsk State University, 1 Pirogova str., Novosibirsk 630090, Russia"},{"name":"Tashkent Institute of Textile and Light Industry, Tashkent 100100, Uzbekistan"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-9995-7202","authenticated-orcid":false,"given":"Umid","family":"Ibragimov","sequence":"additional","affiliation":[{"name":"Faculty of Mechanics and Mathematics, Novosibirsk State University, 1 Pirogova str., Novosibirsk 630090, Russia"}]},{"given":"Shakhboz","family":"Meylikulov","sequence":"additional","affiliation":[{"name":"Department of Information Technology and Exact Sciences, Termez University of Economics and Service, 38-B, Ibn-Sino str., Termez 190100, Uzbekistan"}]},{"given":"Sultonbek","family":"Normamatov","sequence":"additional","affiliation":[{"name":"Department of Computer Linguistics and Digital Technologies, Faculty of Social and Humanitarian Sciences, Alisher Navo\u2032i Tashkent State University of Uzbek Language and Literature, 103, Yusuf Xos Khojib Str., Tashkent 100013, Uzbekistan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3565-5432","authenticated-orcid":false,"given":"Feruza","family":"Bahodirova","sequence":"additional","affiliation":[{"name":"Department of Interfaculty Foreign Languages, Urgench State University, 14, Kh. Alimdjan str., Urgench 220100, Uzbekistan"}]},{"given":"Javlonbek","family":"Matnazarov","sequence":"additional","affiliation":[{"name":"Department of Language and Literature, Mamun University, 2, Bol-xovuz str., Khiva 220901, Uzbekistan"}]},{"given":"Zarnigor","family":"Fayzullaeva","sequence":"additional","affiliation":[{"name":"Department of Software Engineering, Tashkent University of Information Technologies Named After Muhammad al-Khwarizmi, Tashkent 100084, Uzbekistan"}]}],"member":"1968","published-online":{"date-parts":[[2026,2,1]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Saidov, B.R., Barakhnin, V.B., Rixsibayev, U.T., Sharipov, E.J., Sobirov, O.O., and Bekchanov, M.K. (July, January 27). Methods of Automatic Selection of Named Entities (NER) in Uzbek Language for Text Tone Analysis. Proceedings of the 2025 IEEE 26th International Conference of Young Professionals in Electron Devices and Materials (EDM), Altai Republic, Russia.","DOI":"10.1109\/EDM65517.2025.11096748"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"110413","DOI":"10.1016\/j.dib.2024.110413","article-title":"Developing named entity recognition algorithms for Uzbek: Dataset insights and implementation","volume":"54","author":"Mengliev","year":"2024","journal-title":"Data Brief"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"232","DOI":"10.1007\/978-3-031-05328-3_15","article-title":"Construction and Evaluation of Sentiment Datasets for Low-Resource Languages: The Case of Uzbek","volume":"Volume 13212","author":"Kuriyozov","year":"2022","journal-title":"Human Language Technology. Challenges for Computer Science and Linguistics (LTC 2019)"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Mengliev, D., Abdurakhmonova, N., Allamov, O., Ibragimov, B., Saidov, B., and Boltayev, N. (2025, January 1\u20135). Development of a Hybrid Algorithm for Identifying Named Entities in 20th Century Uzbek Texts. Proceedings of the AIP Conference Proceedings, Wollongong, Australia.","DOI":"10.1063\/5.0296177"},{"key":"ref_5","unstructured":"Sharipov, M., Kuriyozov, E., Yuldashev, O., and Sobirov, O. (2023). UzbekTagger: The Rule-Based POS Tagger for the Uzbek Language. arXiv."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Mengliev, D., Barakhnin, V., Madirimov, S., Ibragimov, B., Eshkulov, M., and Saidov, B. (2024, January 22\u201323). Unveiling the Variance of Uzbek Language: A Rule-Based Algorithm for Dialect Recognition. Proceedings of the AIP Conference Proceedings, Melbourne, Australia.","DOI":"10.1063\/5.0241409"},{"key":"ref_7","unstructured":"Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019, January 2\u20137). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the NAACL-HLT 2019, Minneapolis, MN, USA."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Johanson, L., and Csat\u00f3, \u00c9.\u00c1. (2015). The Turkic Languages, Routledge.","DOI":"10.4324\/9780203066102"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"160018","DOI":"10.1038\/sdata.2016.18","article-title":"The FAIR Guiding Principles for scientific data management and stewardship","volume":"3","author":"Wilkinson","year":"2016","journal-title":"Sci. Data"},{"key":"ref_10","unstructured":"Unicode Consortium (2026, January 08). Full Emoji List, v17.0. Available online: https:\/\/unicode.org\/emoji\/charts\/full-emoji-list.html."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Mengliev, D., Barakhnin, V.B., Saidov, B.R., and Ibragimov, B.B. (October, January 30). A Computational Approach to Recognizing Poetry Genres in Uzbek Texts. Proceedings of the 2024 IEEE International Multi-Conference on Engineering, Computer and Information Sciences (SIBIRCON), Novosibirsk, Russia.","DOI":"10.1109\/SIBIRCON63777.2024.10758540"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Saidov, B.R., Barakhnin, V.B., Sharipov, E.J., Maksetbaev, A.B., Ruzimov, J.O., and Abdullayev, R.M. (2024, January 15\u201317). Development and Realization of Software Application for Syntax Checking of Karakalpak Language Text. Proceedings of the 2024 IEEE 3rd International Conference on Problems of Informatics, Electronics and Radio Engineering (PIERE), Novosibirsk, Russia.","DOI":"10.1109\/PIERE62470.2024.10804984"},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"111702","DOI":"10.1016\/j.dib.2025.111702","article-title":"An annotated morphological dataset for Uzbek word forms: Towards rule-based and machine learning approaches","volume":"61","author":"Abdurakhmonova","year":"2025","journal-title":"Data Brief"},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"109118","DOI":"10.1016\/j.dib.2023.109118","article-title":"A Twitter dataset for Monkeypox, May 2022","volume":"48","author":"Nia","year":"2023","journal-title":"Data Brief"},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"106179","DOI":"10.1016\/j.dib.2020.106179","article-title":"A first public dataset from Brazilian Twitter and news on COVID-19 in Portuguese","volume":"32","author":"Figueiredo","year":"2020","journal-title":"Data Brief"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"111249","DOI":"10.1016\/j.dib.2024.111249","article-title":"A comprehensive dataset and neural network approach for named entity recognition in the Uzbek language","volume":"58","author":"Mengliev","year":"2024","journal-title":"Data Brief"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"111004","DOI":"10.1016\/j.dib.2024.111004","article-title":"Arabic paraphrased parallel synthetic dataset","volume":"57","year":"2024","journal-title":"Data Brief"}],"container-title":["Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2306-5729\/11\/2\/28\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,4]],"date-time":"2026-02-04T09:24:38Z","timestamp":1770197078000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2306-5729\/11\/2\/28"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,2,1]]},"references-count":17,"journal-issue":{"issue":"2","published-online":{"date-parts":[[2026,2]]}},"alternative-id":["data11020028"],"URL":"https:\/\/doi.org\/10.3390\/data11020028","relation":{},"ISSN":["2306-5729"],"issn-type":[{"value":"2306-5729","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,2,1]]}}}