{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,9]],"date-time":"2026-04-09T01:13:15Z","timestamp":1775697195053,"version":"3.50.1"},"reference-count":47,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2019,1,30]],"date-time":"2019-01-30T00:00:00Z","timestamp":1548806400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["MAKE"],"abstract":"<jats:p>Although topic models have been used to build clusters of documents for more than ten years, there is still a problem of choosing the optimal number of topics. The authors analyzed many fundamental studies undertaken on the subject in recent years. The main problem is the lack of a stable metric of the quality of topics obtained during the construction of the topic model. The authors analyzed the internal metrics of the topic model: coherence, contrast, and purity to determine the optimal number of topics and concluded that they are not applicable to solve this problem. The authors analyzed the approach to choosing the optimal number of topics based on the quality of the clusters. For this purpose, the authors considered the behavior of the cluster validation metrics: the Davies Bouldin index, the silhouette coefficient, and the Calinski-Harabaz index. A new method for determining the optimal number of topics proposed in this paper is based on the following principles: (1) Setting up a topic model with additive regularization (ARTM) to separate noise topics; (2) Using dense vector representation (GloVe, FastText, Word2Vec); (3) Using a cosine measure for the distance in cluster metric that works better than Euclidean distance on vectors with large dimensions. The methodology developed by the authors for obtaining the optimal number of topics was tested on the collection of scientific articles from the OnePetro library, selected by specific themes. The experiment showed that the method proposed by the authors allows assessing the optimal number of topics for the topic model built on a small collection of English documents.<\/jats:p>","DOI":"10.3390\/make1010025","type":"journal-article","created":{"date-parts":[[2019,1,30]],"date-time":"2019-01-30T10:58:27Z","timestamp":1548845907000},"page":"416-426","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":29,"title":["The Number of Topics Optimization: Clustering Approach"],"prefix":"10.3390","volume":"1","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9881-7371","authenticated-orcid":false,"given":"Fedor","family":"Krasnov","sequence":"first","affiliation":[{"name":"Gazpromneft STC, 75-79 Moika River Emb.,  190000 Saint Petersburg, Russia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1949-1642","authenticated-orcid":false,"given":"Anastasiia","family":"Sen","sequence":"additional","affiliation":[{"name":"Faculty of Applied Mathematics and Control Processes, Saint Petersburg State University, 7-9 Universitetskaya Emb., 199034 Saint Petersburg, Russia"}]}],"member":"1968","published-online":{"date-parts":[[2019,1,30]]},"reference":[{"key":"ref_1","first-page":"993","article-title":"Latent Dirichlet Allocation","volume":"3","author":"Blei","year":"2003","journal-title":"J. Mach. Learn. Res."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Vorontsov, K., Potapenko, A., and Plavin, A. (2015). Additive Regularization of Topic Models for Topic Selection and Sparse Factorization. Statistical Learning and Data Sciences, Springer International Publishing.","DOI":"10.1007\/978-3-319-17091-6_14"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Staab, S., Koltsova, O., and Ignatov, D.I. (2018). A Full-Cycle Methodology for News Topic Modeling and User Feedback Research. Social Informatics, Springer International Publishing.","DOI":"10.1007\/978-3-030-01129-1_19"},{"key":"ref_4","first-page":"264","article-title":"Authorship Attribution with Author-aware Topic Models","volume":"Volume 2","author":"Seroussi","year":"2012","journal-title":"Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"400","DOI":"10.1108\/LHT-06-2017-0132","article-title":"Discovering research topics from library electronic references using latent Dirichlet allocation","volume":"36","author":"Fang","year":"2018","journal-title":"Libr. Hi Tech"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Binkley, D., Heinz, D., Lawrie, D., and Overfelt, J. (June, January 31). Understanding LDA in Source Code Analysis. Proceedings of the 22nd International Conference on Program Comprehension (ICPC 2014), Hyderabad, India.","DOI":"10.1145\/2597008.2597150"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"74","DOI":"10.1016\/j.infsof.2018.02.005","article-title":"What is wrong with topic modeling? And how to fix it using search-based software engineering","volume":"98","author":"Agrawal","year":"2018","journal-title":"Inf. Softw. Technol."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"341","DOI":"10.1023\/A:1008202821328","article-title":"Differential Evolution\u2014A Simple and Efficient Heuristic for Global Optimization over Continuous Spaces","volume":"11","author":"Storn","year":"1997","journal-title":"J. Glob. Optim."},{"key":"ref_9","unstructured":"Asuncion, A., Welling, M., Smyth, P., and Teh, Y.W. (2009, January 18\u201321). On Smoothing and Inference for Topic Models. Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, Montreal, QC, Canada."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Wallach, H.M., Murray, I., Salakhutdinov, R., and Mimno, D. (2009, January 14\u201318). Evaluation Methods for Topic Models. Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QC, Canada.","DOI":"10.1145\/1553374.1553515"},{"key":"ref_11","unstructured":"Chang, J., Boyd-Graber, J., Gerrish, S., Wang, C., and Blei, D.M. (2009, January 7\u201310). Reading Tea Leaves: How Humans Interpret Topic Models. Proceedings of the 22nd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Koltcov, S., Koltsova, O., and Nikolenko, S. (2014, January 23\u201326). Latent Dirichlet Allocation: Stability and Applications to Studies of User-generated Content. Proceedings of the 2014 ACM Conference on Web Science, Bloomington, IN, USA.","DOI":"10.1145\/2615569.2615680"},{"key":"ref_13","unstructured":"Mimno, D., and Blei, D. (2011, January 27\u201331). Bayesian Checking for Topic Models. Proceedings of the Conference on Empirical Methods in Natural Language Processing, Edinburgh, UK."},{"key":"ref_14","unstructured":"Teh, Y.W., Jordan, M.I., Beal, M.J., and Blei, D.M. (,  2004). Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes. Proceedings of the 17th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"7:1","DOI":"10.1145\/1667053.1667056","article-title":"The Nested Chinese Restaurant Process and Bayesian Nonparametric Inference of Topic Hierarchies","volume":"57","author":"Blei","year":"2010","journal-title":"J. ACM"},{"key":"ref_16","unstructured":"Blei, D.M., Jordan, M.I., Griffiths, T.L., and Tenenbaum, J.B. (2003, January 9\u201311). Hierarchical Topic Models and the Nested Chinese Restaurant Process. Proceedings of the 16th International Conference on Neural Information Processing Systems, Whistler, BC, Canada."},{"key":"ref_17","first-page":"2699","article-title":"Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes","volume":"Volume 2","author":"Bryant","year":"2012","journal-title":"Proceedings of the 25th International Conference on Neural Information Processing Systems"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Rossetti, M., Stella, F., and Zanker, M. (2013, January 26\u201329). Towards Explaining Latent Factors with Topic Models in Collaborative Recommender Systems. Proceedings of the 2013 24th International Workshop on Database and Expert Systems Applications, Prague, Czech Republic.","DOI":"10.1109\/DEXA.2013.26"},{"key":"ref_19","unstructured":"Newman, D., Lau, J.H., Grieser, K., and Baldwin, T. (2010, January 2\u20134). Automatic Evaluation of Topic Coherence. Proceedings of the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, CA, USA."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"1192","DOI":"10.1016\/j.physa.2018.08.050","article-title":"Application of R\u00e9nyi and Tsallis entropies to topic modeling optimization","volume":"512","author":"Koltcov","year":"2018","journal-title":"Phys. A Stat. Mech. Its Appl."},{"key":"ref_21","unstructured":"Bing, X., Bunea, F., and Wegkamp, M.H. (arXiv, 2018). A fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics, arXiv."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"30:31","DOI":"10.1145\/3236386.3241340","article-title":"The Mythos of Model Interpretability","volume":"16","author":"Lipton","year":"2018","journal-title":"Queue"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"382","DOI":"10.1109\/TVCG.2017.2745080","article-title":"Progressive Learning of Topic Modeling Parameters: A Visual Analytics Framework","volume":"24","author":"Sevastjanova","year":"2018","journal-title":"IEEE Trans. Vis. Comput. Graph."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"88","DOI":"10.1177\/0165551515617393","article-title":"Topic modelling for qualitative studies","volume":"43","author":"Nikolenko","year":"2016","journal-title":"J. Inf. Sci."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Batmanghelich, K., Saeedi, A., Narasimhan, K., and Gershman, S. (2016, January 7\u201312). Nonparametric Spherical Topic Modeling with Word Embeddings. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Berlin, Germany.","DOI":"10.18653\/v1\/P16-2087"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Law, J., Zhuo, H.H., He, J., and Rong, E. (2018). LTSG: Latent Topical Skip-Gram for Mutually Improving Topic Model and Vector Representations. Pattern Recognition and Computer Vision, Springer International Publishing.","DOI":"10.1007\/978-3-030-03338-5_32"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Das, R., Zaheer, M., and Dyer, C. (2015, January 26\u201331). Gaussian LDA for Topic Models with Word Embeddings. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Beijing, China.","DOI":"10.3115\/v1\/P15-1077"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"299","DOI":"10.1162\/tacl_a_00140","article-title":"Improving Topic Models with Latent Feature Word Representations","volume":"3","author":"Nguyen","year":"2015","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Mantyla, M.V., Claes, M., and Farooq, U. (2018, January 11\u201312). Measuring LDA Topic Stability from Clusters of Replicated Runs. Proceedings of the 12th ACM\/IEEE International Symposium on Empirical Software Engineering and Measurement, Oulu, Finland.","DOI":"10.1145\/3239235.3267435"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Mehta, V., Caceres, R.S., and Carter, K.M. (2014, January 9\u201312). Evaluating topic quality using model clustering. Proceedings of the 2014 IEEE Symposium on Computational Intelligence and Data Mining (CIDM), Orlando, FL, USA.","DOI":"10.1109\/CIDM.2014.7008665"},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"58","DOI":"10.1080\/01969727308546047","article-title":"Cluster Validity with Fuzzy Sets","volume":"3","author":"Bezdek","year":"1973","journal-title":"J. Cybern."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"95","DOI":"10.1080\/01969727408546059","article-title":"Well-Separated Clusters and Optimal Fuzzy Partitions","volume":"4","author":"Dunn","year":"1974","journal-title":"J. Cybern."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"224","DOI":"10.1109\/TPAMI.1979.4766909","article-title":"A Cluster Separation Measure","volume":"1","author":"Davies","year":"1979","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"19","DOI":"10.1145\/601858.601862","article-title":"Clustering Validity Checking Methods: Part II","volume":"31","author":"Halkidi","year":"2002","journal-title":"SIGMOD Rec."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"841","DOI":"10.1109\/34.85677","article-title":"A Validity Measure for Fuzzy Clustering","volume":"13","author":"Xie","year":"1991","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"53","DOI":"10.1016\/0377-0427(87)90125-7","article-title":"Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis","volume":"20","author":"Rousseeuw","year":"1987","journal-title":"J. Comput. Appl. Math."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Pennington, J., Socher, R., and Manning, C. (2014, January 25\u201329). Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar.","DOI":"10.3115\/v1\/D14-1162"},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"135","DOI":"10.1162\/tacl_a_00051","article-title":"Enriching Word Vectors with Subword Information","volume":"5","author":"Bojanowski","year":"2017","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Wu, L.Y., Fisch, A., Chopra, S., Adams, K., Bordes, A., and Weston, J. (2018). StarSpace: Embed All The Things!, AAAI.","DOI":"10.1609\/aaai.v32i1.11996"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Bicalho, P.V., de Oliveira Cunha, T., Mourao, F.H.J., Pappa, G.L., and Meira, W. (2014, January 18\u201322). Generating Cohesive Semantic Topics from Latent Factors. Proceedings of the 2014 Brazilian Conference on Intelligent Systems, Sao Paulo, Brazil.","DOI":"10.1109\/BRACIS.2014.56"},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"230","DOI":"10.1016\/j.infsof.2006.10.017","article-title":"Semantic clustering: Identifying topics in source code","volume":"49","author":"Kuhn","year":"2007","journal-title":"Inf. Softw. Technol."},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Chuang, J., Roberts, M.E., Stewart, B.M., Weiss, R., Tingley, D., Grimmer, J., and Heer, J. (June, January 31). TopicCheck: Interactive Alignment for Assessing Topic Model Stability. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Denver, CO, USA.","DOI":"10.3115\/v1\/N15-1018"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Greene, D., O\u2019Callaghan, D., and Cunningham, P. (2014). How Many Topics? Stability Analysis for Topic Models. Machine Learning and Knowledge Discovery in Databases, Springer.","DOI":"10.1007\/978-3-662-44848-9_32"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Koltcov, S., Nikolenko, S.I., Koltsova, O., Filippov, V., and Bodrunova, S. (2016). Stable Topic Modeling with Local Density Regularization. Internet Science, Springer International Publishing.","DOI":"10.1145\/2908131.2908184"},{"key":"ref_45","first-page":"7","article-title":"Exploration of Hidden Research Directions in Oil and Gas Industry via Full Text Analysis of OnePetro Digital Library","volume":"6","author":"Krasnov","year":"2018","journal-title":"Int. J. Open Inf. Technol."},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"277","DOI":"10.1111\/j.1745-3984.2003.tb01108.x","article-title":"Modern Multidimensional Scaling: Theory and Applications","volume":"40","author":"Borg","year":"2003","journal-title":"J. Educ. Meas."},{"key":"ref_47","first-page":"1","article-title":"A dendrite method for cluster analysis","volume":"3","author":"Calinski","year":"1974","journal-title":"Commun. Stat."}],"container-title":["Machine Learning and Knowledge Extraction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-4990\/1\/1\/25\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T12:29:45Z","timestamp":1760185785000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-4990\/1\/1\/25"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,1,30]]},"references-count":47,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2019,3]]}},"alternative-id":["make1010025"],"URL":"https:\/\/doi.org\/10.3390\/make1010025","relation":{},"ISSN":["2504-4990"],"issn-type":[{"value":"2504-4990","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,1,30]]}}}