{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,12]],"date-time":"2025-10-12T01:58:35Z","timestamp":1760234315625,"version":"build-2065373602"},"reference-count":30,"publisher":"MDPI AG","issue":"5","license":[{"start":{"date-parts":[[2021,5,10]],"date-time":"2021-05-10T00:00:00Z","timestamp":1620604800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"the Committee of Science of Ministry of Education and Science of the Republic of Kazakhstan","award":["AP09259324"],"award-info":[{"award-number":["AP09259324"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Symmetry"],"abstract":"<jats:p>This article presents a new conceptual approach for the interpretative topic modeling problem. It uses sentences as basic units of analysis, instead of words or n-grams, which are commonly used in the standard approaches.The proposed approach\u2019s specifics are using sentence probability evaluations within the text corpus and clustering of sentence embeddings. The topic model estimates discrete distributions of sentence occurrences within topics and discrete distributions of topic occurrence within the text. Our approach provides the possibility of explicit interpretation of topics since sentences, unlike words, are more informative and have complete grammatical and semantic constructions inside. The method for automatic topic labeling is also provided. Contextual embeddings based on the BERT model are used to obtain corresponding sentence embeddings for their subsequent analysis. Moreover, our approach allows big data processing and shows the possibility of utilizing the combination of internal and external knowledge sources in the process of topic modeling. The internal knowledge source is represented by the text corpus itself and often it is a single knowledge source in the traditional topic modeling approaches. The external knowledge source is represented by the BERT, a machine learning model which was preliminarily trained on a huge amount of textual data and is used for generating the context-dependent sentence embeddings.<\/jats:p>","DOI":"10.3390\/sym13050837","type":"journal-article","created":{"date-parts":[[2021,5,10]],"date-time":"2021-05-10T05:30:08Z","timestamp":1620624608000},"page":"837","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":7,"title":["A New Sentence-Based Interpretative Topic Modeling and Automatic Topic Labeling"],"prefix":"10.3390","volume":"13","author":[{"given":"Olzhas","family":"Kozbagarov","sequence":"first","affiliation":[{"name":"Institute of Information and Computational Technologies, Pushkin Str., 125, Almaty 050010, Kazakhstan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7283-5144","authenticated-orcid":false,"given":"Rustam","family":"Mussabayev","sequence":"additional","affiliation":[{"name":"Institute of Information and Computational Technologies, Pushkin Str., 125, Almaty 050010, Kazakhstan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6655-0409","authenticated-orcid":false,"given":"Nenad","family":"Mladenovic","sequence":"additional","affiliation":[{"name":"Khalifa University, Abu Dhabi 41009, United Arab Emirates"}]}],"member":"1968","published-online":{"date-parts":[[2021,5,10]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"77","DOI":"10.1145\/2133806.2133826","article-title":"Probabilistic topic models","volume":"55","author":"Blei","year":"2012","journal-title":"Commun. ACM"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"143","DOI":"10.1561\/1500000030","article-title":"Applications of topic models","volume":"11","author":"Hu","year":"2017","journal-title":"Found. Trends Inf. Retr."},{"key":"ref_3","first-page":"327","article-title":"Topic modeling in marketing: Recent advances and research opportunities","volume":"89","author":"Reisenbihler","year":"2019","journal-title":"J. Bus. Econ."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"1608","DOI":"10.1186\/s40064-016-3252-8","article-title":"An overview of topic modeling and its current applications in bioinformatics","volume":"5","author":"Liu","year":"2019","journal-title":"SpringerPlus"},{"key":"ref_5","unstructured":"Yanina, A., Golitsyn, L., and Vorontsov, K. (2017, January 20\u201323). Multi-objective topic modeling for exploratory search in tech news. Proceedings of the Communications in Computer and Information Science, vol 789. AINL-6: Artificial Intelligence and Natural Language Conference, St. Petersburg, Russia."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Mukhamediev, R., Yakunin, K., Mussabayev, R., Buldybayev, T., Kuchin, Y., Murzakhmetov, S., and Yelis, M. (2020). Classification of Negative Information on Socially Significant Topics in Mass Media. Symmetry, 12.","DOI":"10.3390\/sym12121945"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"205","DOI":"10.1016\/j.procs.2020.11.022","article-title":"Propaganda Identification Using Topic Modeling","volume":"178","author":"Yakunin","year":"2020","journal-title":"Procedia Comput. Sci."},{"key":"ref_8","first-page":"165","article-title":"Mass Media Evaluation Using Topic Modeling","volume":"1242","author":"Yakunin","year":"2020","journal-title":"Commun. Comput. Inf. Sci."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Cristani, M., Tomazolli, C., and Olivieri, F. (2016, January 24\u201326). Semantic social network analysis foresees message flows. Proceedings of the 8th International Conference on Agents and Artificial Intelligence, ICAART, Roma, Italy.","DOI":"10.5220\/0005832902960303"},{"key":"ref_10","unstructured":"Hoffmann, T. (August, January 30). Probabilistic latent semantic analysis. Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence\u2014UAI, Stockholm, Sweden."},{"key":"ref_11","first-page":"993","article-title":"Latent Dirichlet allocation","volume":"3","author":"Blei","year":"2003","journal-title":"J. Mach. Learn. Res."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Apishev, M., and Vorontsov, K. (2020, January 23\u201325). Learning topic models with arbitrary loss. Proceedings of the 26th Conference of FRUCT (Finnish-Russian University Cooperation in Telecommunications) Association, Yaroslavl, Russia.","DOI":"10.23919\/FRUCT48808.2020.9087559"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Kohedykov, D., Apishev, M., Golitsyn, L., and Vorontsov, K. (2017, January 6\u201310). Fast and modular regularized topic modeling. Proceedings of the 21st Conference of FRUCT (Finnish-Russian University Cooperation in Telecommunications) Association, Helsinki, Finland.","DOI":"10.23919\/FRUCT.2017.8250181"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Ianina, A., and Vorontsov, K. (2019, January 5\u20138). Regularized multimodal hierarchical topic model for document-by document exploratory search. Proceedings of the 25th Conference Of FRUCT (Finnish-Russian University Cooperation in Telecommunications) Association, Helsinki, Finland.","DOI":"10.23919\/FRUCT48121.2019.8981493"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Pagliardini, M., Gupta, P., and Jaggi, M. (2017). Unsupervised learning of sentence embeddings using compositional n-gram features. arXiv.","DOI":"10.18653\/v1\/N18-1049"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Balikas, G., Amini, M., and Clausel, M. (2016, January 17\u201321). On a topic model for sentences. Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, Pisa, Italy.","DOI":"10.1145\/2911451.2914714"},{"key":"ref_17","unstructured":"Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"842","DOI":"10.1162\/tacl_a_00349","article-title":"A primer in BERTology: What we know about how BERT works","volume":"8","author":"Rogers","year":"2020","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"ref_19","unstructured":"Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013, January 5\u20138). Distributed Representations of Words and Phrases and their Compositionality. Proceedings of the Neural Information Processing Systems, Lake Tahoe, NV, USA."},{"key":"ref_20","unstructured":"Wiedemann, G., Remus, S., Chawla, A., and Biemann, C. (2019, January 9\u201311). Does BERT make any sense? Interpretable word sense disambiguation with contextualized embeddings. Proceedings of the Konferenz zur Verarbeitung nat\u00fcrlicher Sprache\/Conference on Natural Language Processing (KONVENS), Erlangen, Germany."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. (2018, January 1\u20136). Deep Contextualized Word Representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, Louisiana.","DOI":"10.18653\/v1\/N18-1202"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Howard, J., and Ruder, S. (2018, January 15\u201320). Universal Language Model Fine-tuning for Text Classification. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia.","DOI":"10.18653\/v1\/P18-1031"},{"key":"ref_23","unstructured":"Bhatia, S., Lau, J., and Baldwin, T. (2016, January 11\u201316). Automatic labeling of topics with neural embeddings. Proceedings of the 26th COLING International Conference on Computational Linguistics, Osaka, Japan."},{"key":"ref_24","unstructured":"(2021, April 12). News Aggregator Dataset. Available online: https:\/\/archive.ics.uci.edu\/ml\/datasets\/News+Aggregator."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"502","DOI":"10.1007\/s10618-016-0482-x","article-title":"Modeling user interests from web browsing activities","volume":"31","author":"Gasparetti","year":"2017","journal-title":"Data Min. Knowl. Discov."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"405","DOI":"10.1016\/S0031-3203(99)00216-2","article-title":"J-Means: A new local search heuristic for minimum sum of squares clustering","volume":"34","author":"Hansen","year":"2001","journal-title":"Pattern Recognit."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"569","DOI":"10.1016\/j.patcog.2018.12.022","article-title":"HG-means: A scalable hybrid genetic algorithm for minimum sum of squares clustering","volume":"88","author":"Gribel","year":"2019","journal-title":"Pattern Recognit."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"409","DOI":"10.1007\/978-3-030-58657-7_32","article-title":"Decomposition\/Aggregation K-means for Big Data","volume":"Volume 1275","author":"Krassovitskiy","year":"2020","journal-title":"International Conference on Mathematical Optimization Theory and Operations Research (MOTOR 2020)"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"95","DOI":"10.1016\/j.patcog.2019.04.014","article-title":"How much can k-means be improved by using better initialization and repeats?","volume":"93","author":"Franti","year":"2019","journal-title":"Pattern Recognit."},{"key":"ref_30","unstructured":"Arthur, D., and Vassilvitskii, S. (2007, January 7\u20139). K-means++: The advantages of careful seeding. Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA."}],"container-title":["Symmetry"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-8994\/13\/5\/837\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T05:58:37Z","timestamp":1760162317000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-8994\/13\/5\/837"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,5,10]]},"references-count":30,"journal-issue":{"issue":"5","published-online":{"date-parts":[[2021,5]]}},"alternative-id":["sym13050837"],"URL":"https:\/\/doi.org\/10.3390\/sym13050837","relation":{},"ISSN":["2073-8994"],"issn-type":[{"type":"electronic","value":"2073-8994"}],"subject":[],"published":{"date-parts":[[2021,5,10]]}}}