{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,12]],"date-time":"2025-10-12T01:53:25Z","timestamp":1760234005577,"version":"build-2065373602"},"reference-count":49,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2021,3,14]],"date-time":"2021-03-14T00:00:00Z","timestamp":1615680000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Committee of Science under the Ministry of Education and Science of the Republic of Kazakhstan","award":["AP08856034"],"award-info":[{"award-number":["AP08856034"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Data"],"abstract":"<jats:p>Mass media is one of the most important elements influencing the information environment of society. The mass media is not only a source of information about what is happening but is often the authority that shapes the information agenda, the boundaries, and forms of discussion on socially relevant topics. A multifaceted and, where possible, quantitative assessment of mass media performance is crucial for understanding their objectivity, tone, thematic focus and, quality. The paper presents a corpus of Kazakhstan media, which contains over 4 million publications from 36 primary sources (which has at least 500 publications). The corpus also includes more than 2 million texts of Russian media for comparative analysis of publication activity of the countries, also about 4000 sections of state policy documents. The paper briefly describes the natural language processing and multiple-criteria decision-making methods, which are the algorithmic basis of the text and mass media evaluation method, and describes the results of several research cases, such as identification of propaganda, assessment of the tone of publications, calculation of the level of socially relevant negativity, comparative analysis of publication activity in the field of renewable energy. Experiments confirm the general possibility of evaluating the socially significant news, identifying texts with propagandistic content, evaluating the sentiment of publications using the topic model of the text corpus since the area under receiver operating characteristics curve (ROC AUC) values of 0.81, 0.73 and 0.93 were achieved on abovementioned tasks. The described cases do not exhaust the possibilities of thematic, tonal, dynamic, etc., analysis of the considered corpus of texts. The corpus will be interesting to researchers considering both multiple publications and mass media analysis, including comparative analysis and identification of common patterns inherent in the media of different countries.<\/jats:p>","DOI":"10.3390\/data6030031","type":"journal-article","created":{"date-parts":[[2021,3,14]],"date-time":"2021-03-14T22:13:10Z","timestamp":1615759990000},"page":"31","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["KazNewsDataset: Single Country Overall Digital Mass Media Publication Corpus"],"prefix":"10.3390","volume":"6","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7378-9212","authenticated-orcid":false,"given":"Kirill","family":"Yakunin","sequence":"first","affiliation":[{"name":"Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan"},{"name":"Institute of Cybernetics and Information Technology, Satbayev University (KazNRTU), Almaty 050013, Kazakhstan"}]},{"given":"Maksat","family":"Kalimoldayev","sequence":"additional","affiliation":[{"name":"Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3727-043X","authenticated-orcid":false,"given":"Ravil I.","family":"Mukhamediev","sequence":"additional","affiliation":[{"name":"Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan"},{"name":"Institute of Cybernetics and Information Technology, Satbayev University (KazNRTU), Almaty 050013, Kazakhstan"},{"name":"Department of Natural Science and Computer Technologies, ISMA University, LV-1011 Riga, Latvia"}]},{"given":"Rustam","family":"Mussabayev","sequence":"additional","affiliation":[{"name":"Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3299-0507","authenticated-orcid":false,"given":"Vladimir","family":"Barakhnin","sequence":"additional","affiliation":[{"name":"Federal Research Center for Information and Computational Technologies, 630090 Novosibirsk, Russia"},{"name":"Department of Information Technologies, Novosibirsk State University, 630090 Novosibirsk, Russia"}]},{"given":"Yan","family":"Kuchin","sequence":"additional","affiliation":[{"name":"Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5271-9071","authenticated-orcid":false,"given":"Sanzhar","family":"Murzakhmetov","sequence":"additional","affiliation":[{"name":"Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan"}]},{"given":"Timur","family":"Buldybayev","sequence":"additional","affiliation":[{"name":"Information-Analytical Center, Nur-Sultan 010000, Kazakhstan"}]},{"given":"Ulzhan","family":"Ospanova","sequence":"additional","affiliation":[{"name":"Information-Analytical Center, Nur-Sultan 010000, Kazakhstan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4203-800X","authenticated-orcid":false,"given":"Marina","family":"Yelis","sequence":"additional","affiliation":[{"name":"Institute of Cybernetics and Information Technology, Satbayev University (KazNRTU), Almaty 050013, Kazakhstan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1389-239X","authenticated-orcid":false,"given":"Akylbek","family":"Zhumabayev","sequence":"additional","affiliation":[{"name":"Institute of Cybernetics and Information Technology, Satbayev University (KazNRTU), Almaty 050013, Kazakhstan"}]},{"given":"Viktors","family":"Gopejenko","sequence":"additional","affiliation":[{"name":"Department of Natural Science and Computer Technologies, ISMA University, LV-1011 Riga, Latvia"},{"name":"International Radio Astronomy Centre, Ventspils University of Applied Sciences, LV-3601 Ventspils, Latvia"}]},{"given":"Zhazirakhanym","family":"Meirambekkyzy","sequence":"additional","affiliation":[{"name":"Institute of Information and Computational Technologies, Almaty 050010, Kazakhstan"}]},{"given":"Alibek","family":"Abdurazakov","sequence":"additional","affiliation":[{"name":"Institute of Cybernetics and Information Technology, Satbayev University (KazNRTU), Almaty 050013, Kazakhstan"}]}],"member":"1968","published-online":{"date-parts":[[2021,3,14]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"357","DOI":"10.1016\/j.eswa.2018.07.063","article-title":"Document-based topic coherence measures for news media text","volume":"114","author":"Ristov","year":"2018","journal-title":"Expert Syst. Appl."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"102048","DOI":"10.1016\/j.ijinfomgt.2019.102048","article-title":"Big data analytics and international negotiations: Sentiment analysis of Brexit negotiating outcomes","volume":"51","author":"Georgiadou","year":"2020","journal-title":"Int. J. Inf. Manag."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Neuendorf, A. (2016). The Content Analysis Guidebook, Sage.","DOI":"10.4135\/9781071802878"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"102","DOI":"10.1080\/21670811.2012.714928","article-title":"Research methods in the age of digital journalism: Massive-scale automated analysis of news-content topics, style and gender","volume":"1","author":"Flaounas","year":"2013","journal-title":"Digit. Journal."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"689","DOI":"10.1016\/j.dss.2012.05.029","article-title":"Creating sentiment dictionaries via triangulation","volume":"53","author":"Steinberger","year":"2012","journal-title":"Decis. Support Syst."},{"key":"ref_6","unstructured":"Vossen, P., Rigau, G., Serafini, L., Stouten, P., Irving, F., and Van Hage, W.R. (2014, January 26\u201331). NewsReader: Recording history from daily news streams. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), Reykjavik, Iceland."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"3168","DOI":"10.1016\/j.eswa.2013.11.020","article-title":"Modeling and broadening temporal user interest in personalized news recommendation","volume":"41","author":"Li","year":"2014","journal-title":"Expert Syst. Appl."},{"key":"ref_8","first-page":"519","article-title":"Enter the robot journalist: Users\u2019 perceptions of automated content","volume":"8","author":"Clerwall","year":"2014","journal-title":"Journal. Pract."},{"key":"ref_9","unstructured":"Popescu, O., and Strapparava, C. (2017, January 7). Natural Language Processing meets Journalism. Proceedings of the 2017 EMNLP Workshop, Copenhagen, Denmark."},{"key":"ref_10","first-page":"233","article-title":"A Survey report on Evolution of Machine Translation","volume":"9","author":"Sreelekha","year":"2016","journal-title":"Int. J. Control Theory Appl."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"895","DOI":"10.3233\/SW-160247","article-title":"Survey on challenges of Question Answering in the Semantic Web","volume":"8","author":"Walter","year":"2017","journal-title":"Semant. Web"},{"key":"ref_12","unstructured":"Jurafsky, D., and Martin, J.H. (2014). Speech and Language Processing, Pearson."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"778","DOI":"10.26483\/ijarcs.v9i1.5505","article-title":"A survey paper on information retrieval system","volume":"9","author":"Deo","year":"2018","journal-title":"Int. J. Adv. Res. Comput. Sci."},{"key":"ref_14","unstructured":"Shokin, Y.I., Fedotov, A.M., and Barakhnin, V.B. (2010). Problems in Finding Information, Science."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"10","DOI":"10.1016\/j.inffus.2016.10.004","article-title":"A review of natural language processing techniques for opinion mining systems","volume":"36","author":"Sun","year":"2017","journal-title":"Inf. Fusion"},{"key":"ref_16","unstructured":"Manning, C., and Schutze, H. (1999). Foundations of Statistical Natural Language Processing, MIT Press."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"345","DOI":"10.1613\/jair.4992","article-title":"A primer on neural network models for natural language processing","volume":"57","author":"Goldberg","year":"2016","journal-title":"J. Artif. Intell. Res."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"102","DOI":"10.1109\/MIS.2016.31","article-title":"Affective Computing and Sentiment Analysis","volume":"Volume 31","author":"Cambria","year":"2016","journal-title":"IEEE Intelligent Systems"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Vilares, D., Peng, H., Satapathy, R., and Cambria, E. (2018, January 18\u201321). BabelSenticNet: A Commonsense Reasoning Framework for Multilingual Sentiment Analysis. Proceedings of the 2018 IEEE Symposium Series on Computational Intelligence (SSCI), Bangalore, India.","DOI":"10.1109\/SSCI.2018.8628718"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"499","DOI":"10.1007\/s10462-016-9508-4","article-title":"Multilingual sentiment analysis: From formal to informal and scarce resource languages","volume":"48","author":"Lo","year":"2017","journal-title":"Artif. Intell. Rev."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"261","DOI":"10.1126\/science.aaa8685","article-title":"Advances in natural language processing","volume":"349","author":"Hirschberg","year":"2015","journal-title":"Science"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"436","DOI":"10.1038\/nature14539","article-title":"Deep learning","volume":"521","author":"LeCun","year":"2015","journal-title":"Nature"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"146","DOI":"10.1016\/j.inffus.2017.10.006","article-title":"A survey on deep learning for big data","volume":"42","author":"Zhang","year":"2018","journal-title":"Inf. Fusion"},{"key":"ref_24","unstructured":"(2021, March 12). Coronavirus Tweets NLP\u2014Text Classification. Available online: https:\/\/www.kaggle.com\/datatattle\/covid-19-nlp-text-classification."},{"key":"ref_25","unstructured":"(2021, March 12). Spam Text Message Classification. Available online: https:\/\/www.kaggle.com\/team-ai\/spam-text-message-classification."},{"key":"ref_26","unstructured":"(2021, March 12). Open Food Facts. Available online: https:\/\/www.kaggle.com\/openfoodfacts\/world-food-facts."},{"key":"ref_27","unstructured":"(2021, March 12). Getting Real about Fake News. Available online: https:\/\/www.kaggle.com\/mrisdal\/fake-news."},{"key":"ref_28","unstructured":"(2021, March 12). Credit Card Fraud Detection. Available online: https:\/\/www.kaggle.com\/konradb\/text-recognition-total-text-daset."},{"key":"ref_29","unstructured":"(2021, March 12). Kazakhstani and Russian News Corpus. Available online: https:\/\/data.mendeley.com\/datasets\/2vz7vtbhn2\/1."},{"key":"ref_30","unstructured":"(2021, March 12). Kazakhstani News Corpus for Social Significance Identification with Topic Modelling Results. Available online: https:\/\/data.mendeley.com\/datasets\/hwj24p9gkh\/1."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Mukhamediev, R.I., Yakunin, K., Mussabayev, R., Buldybayev, T., Kuchin, Y., Murzakhmetov, S., and Yelis, M. (2020). Classification of Negative Information on Socially Significant Topics in Mass Media. Symmetry, 12.","DOI":"10.3390\/sym12121945"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"60","DOI":"10.17323\/1998-0663.2019.4.60.72","article-title":"The design of the structure of the software system for processing text document corpus","volume":"13","author":"Barakhnin","year":"2019","journal-title":"Bus. Inform."},{"key":"ref_33","unstructured":"Yakunin, K. (2020, September 14). Media Monitoring System. Available online: https:\/\/github.com\/KindYAK\/NLPMonitor."},{"key":"ref_34","first-page":"91","article-title":"Methods for calculating the relevance of text fragments based on thematic models in the problem of automatic annotation","volume":"14","author":"Mashechkin","year":"2013","journal-title":"Comput. Methods Program."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"693","DOI":"10.20537\/2076-7633-2012-4-4-693-706","article-title":"Regularization, robustness and sparseness of probabilistic thematic models","volume":"4","author":"Vorontsov","year":"2012","journal-title":"Comput. Res. Modeling"},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"161","DOI":"10.15514\/ISPRAS-2017-29(2)-6","article-title":"A survey and an experimental comparison of methods for text clustering: Application to scientific articles","volume":"29","author":"Parhomenko","year":"2017","journal-title":"Proc. Inst. Syst. Program. RAS"},{"key":"ref_37","first-page":"993","article-title":"Latent dirichlet allocation","volume":"3","author":"Blei","year":"2003","journal-title":"J. Mach. Learn. Res."},{"key":"ref_38","first-page":"15169","article-title":"Latent Dirichlet Allocation (LDA) and Topic modeling: Models, applications, a survey","volume":"78","author":"Hamed","year":"2017","journal-title":"Multimed. Tools Appl."},{"key":"ref_39","unstructured":"Mimno, D., Wallach, H., Talley, E., Leenders, M., and McCallum, A. (2011, January 27\u201331). Optimizing Semantic Coherence in Topic Models. Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Scotland, UK."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Vorontsov, K., Frei, O., Apishev, M., Romov, P., and Dudarenko, M. (2015). BigARTM: Open Source Library for Regularized Multimodal Topic Modeling of Large Collections. International Conference on Analysis of Images, Social Networks and Texts, Springer.","DOI":"10.1007\/978-3-319-26123-2_36"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Saaty, T. (1989). Group Decision Making and the AHP, Springer.","DOI":"10.1007\/978-3-642-50244-6_4"},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"341","DOI":"10.1016\/j.renene.2016.10.054","article-title":"Developing a Novel Risk-based Methodology for Multi-Criteria Decision Making in Marine Renewable Energy Applications","volume":"102","author":"Mohammad","year":"2017","journal-title":"Renew. Energy"},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"122275","DOI":"10.1109\/ACCESS.2019.2937627","article-title":"Multi-Criteria Spatial Decision Making Support System for Renewable Energy Development in Kazakhstan","volume":"7","author":"Mukhamediev","year":"2019","journal-title":"IEEE Access"},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"205","DOI":"10.1016\/j.procs.2020.11.022","article-title":"Propaganda Identification Using Topic Modelling","volume":"178","author":"Yakunin","year":"2020","journal-title":"Procedia Comput. Sci."},{"key":"ref_45","first-page":"87","article-title":"Media assessment experiments based on a thematic corpus model","volume":"7","author":"Mukhamediev","year":"2020","journal-title":"Cloud Sci."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Yakunin, K., Mukhamediev, R., Kuchin, Y., Musabayev, R., Buldybayev, T., and Murzakhmetov, S. (2020). Classification of negative publication in mass media using topic modeling. J. Phys. Conf. Ser, in print.","DOI":"10.1088\/1742-6596\/1727\/1\/012019"},{"key":"ref_47","unstructured":"Yakunin, K., Musabaev, R., Yelis, M., and Mukhamediev, R. (2020, January 24\u201325). The topic of energy in news publications. Proceedings of the All-Russian Scientific Conference and the xiii Youth School with International Participation, Moscow, Russian."},{"key":"ref_48","first-page":"5","article-title":"On a method of multimodal media ranking using corpus based topic modelling","volume":"4","author":"Musabaev","year":"2019","journal-title":"Inf. Technol. Manag. Soc."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Yakunin, K., Mukhamediev, R., Mussabayev, R., Buldybayev, T., Kuchin, Y., Murzakhmetov, S., Rassul, Y., and Ospanova, U. (2020). Mass Media Evaluation Using Topic Modelling. International Conference on Digital Transformation and Global Society, Springer.","DOI":"10.1007\/978-3-030-65218-0_13"}],"container-title":["Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2306-5729\/6\/3\/31\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T05:35:31Z","timestamp":1760160931000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2306-5729\/6\/3\/31"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,3,14]]},"references-count":49,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2021,3]]}},"alternative-id":["data6030031"],"URL":"https:\/\/doi.org\/10.3390\/data6030031","relation":{},"ISSN":["2306-5729"],"issn-type":[{"type":"electronic","value":"2306-5729"}],"subject":[],"published":{"date-parts":[[2021,3,14]]}}}