{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,12]],"date-time":"2026-02-12T08:25:44Z","timestamp":1770884744666,"version":"3.50.1"},"reference-count":32,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2025,12,8]],"date-time":"2025-12-08T00:00:00Z","timestamp":1765152000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"\u201cImplementation of cutting-edge research and its application as part of the Scientific Center of Excellence for Quantum and Complex Systems, and Representations of Lie Algebras\u201d","award":["PK.1.1.10.0004"],"award-info":[{"award-number":["PK.1.1.10.0004"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["MAKE"],"abstract":"<jats:p>Topic modeling is a fundamental technique in natural language processing used to uncover latent themes in large text corpora, yet existing approaches struggle to jointly achieve interpretability, semantic coherence, and scalability. Classical probabilistic models such as LDA and NMF rely on bag-of-words assumptions that obscure contextual meaning, while embedding-based methods (e.g., BERTopic, Top2Vec) improve coherence at the expense of diversity and stability. Prompt-based frameworks (e.g., TopicGPT) enhance interpretability but remain sensitive to prompt design and are computationally costly on large datasets. This study introduces VISTA (Vector-Similarity Topic Analysis), a multi-view, hierarchical, and interpretable framework that integrates complementary document embeddings, mutual-nearest-neighbor hierarchical clustering with selective dimension analysis, and large language model (LLM)-based topic labeling enforcing hierarchical coherence. Experiments on three heterogeneous corpora\u2014BBC News, BillSum, and a mixed U.S. Government agency news + Twitter dataset\u2014show that VISTA consistently ranks among the top-performing models, achieving the highest C_UCI coherence and a strong balance between topic diversity and semantic consistency. Qualitative analyses confirm that VISTA identifies domain-relevant themes overlooked by probabilistic or prompt-based models. Overall, VISTA provides a scalable, semantically robust, and interpretable framework for topic discovery, bridging probabilistic, embedding-based, and LLM-driven paradigms in a unified and reproducible design.<\/jats:p>","DOI":"10.3390\/make7040162","type":"journal-article","created":{"date-parts":[[2025,12,8]],"date-time":"2025-12-08T16:41:41Z","timestamp":1765212101000},"page":"162","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["VISTA: A Multi-View, Hierarchical, and Interpretable Framework for Robust Topic Modelling"],"prefix":"10.3390","volume":"7","author":[{"given":"Tvrtko","family":"Glun\u010di\u0107","sequence":"first","affiliation":[{"name":"Faculty of Science, University of Zagreb, 10000 Zagreb, Croatia"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-1757-8028","authenticated-orcid":false,"given":"Domjan","family":"Bari\u0107","sequence":"additional","affiliation":[{"name":"Faculty of Science, University of Zagreb, 10000 Zagreb, Croatia"},{"name":"Aras\u2122 Digital Products, 10000 Zagreb, Croatia"}]},{"given":"Matko","family":"Glun\u010di\u0107","sequence":"additional","affiliation":[{"name":"Faculty of Science, University of Zagreb, 10000 Zagreb, Croatia"}]}],"member":"1968","published-online":{"date-parts":[[2025,12,8]]},"reference":[{"key":"ref_1","first-page":"993","article-title":"Latent Dirichlet allocation","volume":"3","author":"Blei","year":"2003","journal-title":"J. Mach. Learn. Res."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"788","DOI":"10.1038\/44565","article-title":"Learning the parts of objects by non-negative matrix factorization","volume":"401","author":"Lee","year":"1999","journal-title":"Nature"},{"key":"ref_3","first-page":"288","article-title":"Reading Tea Leaves: How Humans Interpret Topic Models","volume":"22","author":"Chang","year":"2009","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"143","DOI":"10.1561\/1500000030","article-title":"Applications of Topic Models","volume":"11","author":"Hu","year":"2017","journal-title":"Found. Trends\u00ae Inf. Retr."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"1775","DOI":"10.1016\/j.neucom.2008.06.011","article-title":"A density-based method for adaptive LDA model selection","volume":"72","author":"Cao","year":"2009","journal-title":"Neurocomputing"},{"key":"ref_6","unstructured":"Srivastava, A., and Sutton, C. (2017). Autoencoding Variational Inference for Topic Models. arXiv."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Sia, S., Dalmia, A., and Mielke, S.J. (2020). Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too!. arXiv.","DOI":"10.18653\/v1\/2020.emnlp-main.135"},{"key":"ref_8","unstructured":"Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Pennington, J., Socher, R., and Manning, C. (2014, January 25\u201329). GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar.","DOI":"10.3115\/v1\/D14-1162"},{"key":"ref_10","unstructured":"Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv.","DOI":"10.18653\/v1\/N18-1202"},{"key":"ref_12","first-page":"9","article-title":"Language models are unsupervised multitask learners","volume":"1","author":"Radford","year":"2019","journal-title":"OpenAI Blog"},{"key":"ref_13","unstructured":"Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"439","DOI":"10.1162\/tacl_a_00325","article-title":"Topic modeling in embedding spaces","volume":"8","author":"Dieng","year":"2020","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Meng, Y., Huang, J.X., Wang, G.Y., Wang, Z.H., Zhang, C., Zhang, Y., Han, J.W., and Assoc Comp, M. (2020, January 20\u201324). Discriminative Topic Mining via Category-Name Guided Text Embedding. Proceedings of the 29th World Wide Web Conference (WWW), Taipei, Taiwan.","DOI":"10.1145\/3366423.3380278"},{"key":"ref_16","unstructured":"Thompson, L., and Mimno, D. (2020). Topic Modeling with Contextualized Word Representation Clusters. arXiv."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Pham, C.M., Hoyle, A., Sun, S., Resnik, P., and Iyyer, M. (2024). TopicGPT: A Prompt-based Topic Modeling Framework. arXiv.","DOI":"10.18653\/v1\/2024.naacl-long.164"},{"key":"ref_18","unstructured":"Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., and Askell, A. (2020). Language Models are Few-Shot Learners. arXiv."},{"key":"ref_19","unstructured":"OpenAi, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., and Altman, S. (2024). GPT-4 Technical Report. arXiv."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Wang, Z., Shang, J., and Zhong, R. (2023, January 6\u201310). Goal-Driven Explainable Clustering via Language Descriptions. Proceedings of the 2023 Conference on Empirical Methods in Natural Language, Singapore.","DOI":"10.18653\/v1\/2023.emnlp-main.657"},{"key":"ref_21","unstructured":"Xu, C., Tao, D., and Xu, C. (2013). A Survey on Multi-view Learning. arXiv."},{"key":"ref_22","unstructured":"Martin, F., and Johnson, M. (2015, January 8\u20139). More Efficient Topic Modelling Through a Noun Only Approach. Proceedings of the Australasian Language Technology Association Workshop 2015, Parramatta, Australia."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.-P., Yan, H., and Li, X. (2011). Comparing Twitter and Traditional Media Using Topic Models. Advances in Information Retrieval, Proceedings of the 33rd European Conference on IR Resarch, ECIR 2011, Dublin, Ireland, 18\u201321 April 2011, Springer.","DOI":"10.1007\/978-3-642-20161-5_34"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"421","DOI":"10.1038\/nbt.4091","article-title":"Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors","volume":"36","author":"Haghverdi","year":"2018","journal-title":"Nat. Biotechnol."},{"key":"ref_25","first-page":"845","article-title":"Feature Selection for Unsupervised Learning","volume":"5","author":"Dy","year":"2004","journal-title":"J. Mach. Learn. Res."},{"key":"ref_26","unstructured":"Van den Bussche, J., and Vianu, V. (2001). On the Surprising Behavior of Distance Metrics in High Dimensional Space. Database Theory, Proceedings of the 8th International Conference London, UK, 4\u20136 January 2001, Springer."},{"key":"ref_27","unstructured":"Niu, L.-Q., and Dai, X.-Y. (2015). Topic2Vec: Learning Distributed Representations of Topics. arXiv."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"R\u00f6der, M., Both, A., and Hinneburg, A. (2015, January 2\u20136). Exploring the Space of Topic Coherence Measures. Proceedings of the Eighth ACM International Conference on Web Search and Data Mining (WSDM), Shanghai, China.","DOI":"10.1145\/2684822.2685324"},{"key":"ref_29","unstructured":"Mimno, D., Wallach, H., Talley, E., Leenders, M., and McCallum, A. (2011, January 27\u201331). Optimizing Semantic Coherence in Topic Models. Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, Edinburgh, UK."},{"key":"ref_30","unstructured":"Bouma, G.J. (October, January 30). Normalized (pointwise) Mutual Information in Collocation Extraction. Proceedings of the Biennial GSCL Conference 2009, Postdam, Germany."},{"key":"ref_31","unstructured":"Newman, D., Lau, J.H., Grieser, K., and Baldwin, T. (2010, January 2\u20134). Automatic evaluation of topic coherence. Proceedings of the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, CA, USA."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"391","DOI":"10.1002\/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9","article-title":"Indexing By Latent Semantic Analysis","volume":"41","author":"Deerwester","year":"1990","journal-title":"J. Am. Soc. Inf. Sci."}],"container-title":["Machine Learning and Knowledge Extraction"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-4990\/7\/4\/162\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,8]],"date-time":"2025-12-08T16:53:21Z","timestamp":1765212801000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-4990\/7\/4\/162"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12,8]]},"references-count":32,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["make7040162"],"URL":"https:\/\/doi.org\/10.3390\/make7040162","relation":{},"ISSN":["2504-4990"],"issn-type":[{"value":"2504-4990","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,12,8]]}}}