{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,4]],"date-time":"2026-04-04T23:34:38Z","timestamp":1775345678021,"version":"3.50.1"},"reference-count":45,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2025,4,10]],"date-time":"2025-04-10T00:00:00Z","timestamp":1744243200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Computers"],"abstract":"<jats:p>The vast amount of social media and web data offers valuable insights for purposes such as brand reputation management, topic research, competitive analysis, product development, and public opinion surveys. However, analysing these data to identify patterns and extract valuable insights is challenging due to the vast number of posts, which can number in the thousands within a single day. One practical approach is topic clustering, which creates clusters of mentions that refer to a specific topic. Following this process will create several manageable clusters, each containing hundreds or thousands of posts. These clusters offer a more meaningful overview of the discussed topics, eliminating the need to categorise each post manually. Several topic detection algorithms can achieve clustering of posts, such as LDA, NMF, BERTopic, etc. The existing algorithms, however, have several important drawbacks, including language constraints and slow or resource-intensive data processing. Moreover, the labels for the clusters typically consist of a few keywords that may not make sense unless one explores the mentions within the cluster. Recently, with the introduction of AI large language models, such as GPT-4, new techniques can be realised for topic clustering to address the aforementioned issues. Our novel approach (AI Mention Clustering) employs LLMs at its core to produce an algorithm for efficient and accurate topic clustering of web and social data. Our solution was tested on social and web data and compared to the popular existing algorithm of BERTopic, demonstrating superior resource efficiency and absolute accuracy of clustered documents. Furthermore, it produces summaries of the clusters that are easily understood by humans instead of just representative keywords. This approach enhances the productivity of social and web data researchers by providing more meaningful and interpretable results.<\/jats:p>","DOI":"10.3390\/computers14040142","type":"journal-article","created":{"date-parts":[[2025,4,10]],"date-time":"2025-04-10T05:28:07Z","timestamp":1744262887000},"page":"142","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["An Innovative Approach to Topic Clustering for Social Media and Web Data Using AI"],"prefix":"10.3390","volume":"14","author":[{"ORCID":"https:\/\/orcid.org\/0009-0002-7196-4686","authenticated-orcid":false,"given":"Ioannis","family":"Kapantaidakis","sequence":"first","affiliation":[{"name":"Department of Management Science and Technology, Hellenic Mediterranean University, 72100 Agios Nikolaos, Greece"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9974-0820","authenticated-orcid":false,"given":"Emmanouil","family":"Perakakis","sequence":"additional","affiliation":[{"name":"Department of Management Science and Technology, Hellenic Mediterranean University, 72100 Agios Nikolaos, Greece"},{"name":"Mentionlytics Ltd., 20\u201322 Wenlock Road, London N1 7GU, UK"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6733-5652","authenticated-orcid":false,"given":"George","family":"Mastorakis","sequence":"additional","affiliation":[{"name":"Department of Management Science and Technology, Hellenic Mediterranean University, 72100 Agios Nikolaos, Greece"},{"name":"Mentionlytics Ltd., 20\u201322 Wenlock Road, London N1 7GU, UK"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ioannis","family":"Kopanakis","sequence":"additional","affiliation":[{"name":"Department of Management Science and Technology, Hellenic Mediterranean University, 72100 Agios Nikolaos, Greece"},{"name":"Mentionlytics Ltd., 20\u201322 Wenlock Road, London N1 7GU, UK"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2025,4,10]]},"reference":[{"key":"ref_1","first-page":"186","article-title":"A cloud-based big data sentiment analysis application for enterprises\u2019 brand monitoring in social media streams","volume":"2","author":"Tedeschi","year":"2015","journal-title":"Proc. IEEE RSI Conf. Robot. Mechatron."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Perakakis, E., Mastorakis, G., and Kopanakis, I. (2019). Social Media Monitoring: An Innovative Intelligent Approach. Designs, 3.","DOI":"10.3390\/designs3020024"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"3056","DOI":"10.1177\/1460458220962652","article-title":"The social life of COVID-19: Early insights from social media monitoring data collected in Poland","volume":"26","author":"Bartosiewicz","year":"2020","journal-title":"Health Inform. J."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"81","DOI":"10.1080\/00913367.2020.1809576","article-title":"Can Social Media Listening Platforms\u2019 Artificial Intelligence Be Trusted? Examining the Accuracy of Crimson Hexagon\u2019s (Now Brandwatch Consumer Research\u2019s) AI-Driven Analyses","volume":"50","author":"Hayes","year":"2020","journal-title":"J. Advert."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"23","DOI":"10.4236\/jcc.2016.413003","article-title":"Statistical Analysis of Network-Based Issues and Their Impact on Social Computing Practices in Pakistan","volume":"4","author":"Hussain","year":"2016","journal-title":"J. Comput. Commun."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"566","DOI":"10.1016\/j.inffus.2022.11.017","article-title":"A survey on cross-media search based on user intention understanding in social networks","volume":"91","author":"Shi","year":"2022","journal-title":"Inf. Fusion"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"899","DOI":"10.25300\/MISQ\/2023\/17381","article-title":"Timely, Granular, and Actionable: Designing a Social Listening Platform for Public Health 3.0","volume":"48","author":"Kitchens","year":"2024","journal-title":"MIS Q."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"1795","DOI":"10.1109\/TPAMI.2009.203","article-title":"Keep It Simple with Time: A Reexamination of Probabilistic Topic Detection Models","volume":"32","author":"He","year":"2010","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"3858","DOI":"10.1109\/ACCESS.2020.3047458","article-title":"Topic Detection and Tracking Based on Windowed DBSCAN and Parallel KNN","volume":"9","author":"Li","year":"2020","journal-title":"IEEE Access"},{"key":"ref_10","unstructured":"Ahmed, A., Ho, Q., Smola, A.J., Teo, C.H., Xing, E., and Eisenstein, J. (2011, January 21\u201324). Unified analysis of streaming news. Proceedings of the 2011 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA."},{"key":"ref_11","unstructured":"Lu, Q., Conrad, J.G., Al-Kofahi, K., and Keenan, W. (2011, January 5\u20138). Legal document clustering with built-in topic segmentation. Proceedings of the Fifth International Conference on Statistical Data Analysis Based on the L1-Norm and Related Methods, Shanghai, China."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"e87","DOI":"10.7717\/peerj-cs.87","article-title":"OSoMe: The IUNI Observatory on Social Media","volume":"2","author":"Davis","year":"2016","journal-title":"PeerJ Comput. Sci."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"246","DOI":"10.1109\/TLT.2013.2296520","article-title":"Mining Social Media Data for Understanding Students\u2019 Learning Experiences","volume":"7","author":"Chen","year":"2014","journal-title":"IEEE Trans. Learn. Technol."},{"key":"ref_14","first-page":"993","article-title":"Latent Dirichlet Allocation","volume":"3","author":"Blei","year":"2003","journal-title":"J. Mach. Learn. Res."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"788","DOI":"10.1038\/44565","article-title":"Learning the Parts of Objects by Non-negative Matrix Factorization","volume":"401","author":"Lee","year":"1999","journal-title":"Nature"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"417","DOI":"10.1002\/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9","article-title":"Indexing by latent semantic analysis","volume":"41","author":"Deerwester","year":"1990","journal-title":"J. Am. Soc. Inf. Sci."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Zhou, K., and Yang, Q. (2018, January 25\u201327). LDA-PSTR: A Topic Modeling Method for Short Text. Proceedings of the 2018 International Conference on Big Data Analysis, Beijing, China.","DOI":"10.1007\/978-3-030-05090-0_29"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1002\/meet.14504901385","article-title":"Enriching Text Representation with Frequent Pattern Mining for Probabilistic Topic Modeling","volume":"49","author":"Kim","year":"2012","journal-title":"Proc. Am. Soc. Inf. Sci. Technol."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"21","DOI":"10.5121\/acij.2011.2603","article-title":"Improving Text Categorization By Using A Topic Model","volume":"2","author":"Sriurai","year":"2011","journal-title":"Adv. Comput. Int. J."},{"key":"ref_20","unstructured":"Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based embedding model. arXiv."},{"key":"ref_21","unstructured":"Angelov, D. (2020). Top2Vec: Distributed Representations of Topics. arXiv."},{"key":"ref_22","unstructured":"Milios, E., and Zhang, X. (2023). MPTopic: Improving Topic Modeling via Masked Permuted Pre-training. arXiv."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Zhang, Y., Wang, Z., and Shang, J. (2023, January 7\u201311). ClusterLLM: Large Language Models as a Guide for Text Clustering. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), Singapore.","DOI":"10.18653\/v1\/2023.emnlp-main.858"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Viswanathan, V., Gashteovski, K., Lawrence, C., Wu, T., and Neubig, G. (2023). Large Language Models Enable Few-Shot Clustering. arXiv.","DOI":"10.1162\/tacl_a_00648"},{"key":"ref_25","unstructured":"Mu, Y., Dong, C., Bontcheva, K., and Song, X. (2024). Large Language Models Offer an Alternative to the Traditional Approach of Topic Modelling. arXiv."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Miller, J.K., and Alexander, T.J. (2024). Human-Interpretable Clustering of Short-Text Using Large Language Models. arXiv.","DOI":"10.1098\/rsos.241692"},{"key":"ref_27","unstructured":"OpenAI (2023). GPT-4 Technical Report. arXiv."},{"key":"ref_28","unstructured":"Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., Van Den Driessche, G.B., Lespiau, J.B., Damoc, B., and Clark, A. (2022, January 17\u201323). Improving Language Models by Retrieving from Trillions of Tokens. Proceedings of the International Conference on Machine Learning (ICML 2022), Baltimore, MD, USA."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"18553","DOI":"10.1007\/s00521-023-08680-0","article-title":"HierMDS: A hierarchical multi-document summarization model with global\u2013local document dependencies","volume":"35","author":"Li","year":"2023","journal-title":"Neural Comput. Appl."},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Moro, G., Ragazzi, L., Valgimigli, L., Frisoni, G., Sartori, C., and Marfia, G. (2022). Efficient Memory-Enhanced Transformer for Long-Document Summarization in Low-Resource Regimes. Sensors, 23.","DOI":"10.3390\/s23073542"},{"key":"ref_31","unstructured":"Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., and Dean, J. (2013, January 5\u201310). Distributed Representations of Words and Phrases and Their Compositionality. Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NeurIPS 2013), Lake Tahoe, NV, USA."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"513","DOI":"10.1016\/0306-4573(88)90021-0","article-title":"Term-weighting Approaches in Automatic Text Retrieval","volume":"24","author":"Salton","year":"1988","journal-title":"Inf. Process. Manag."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Reimers, N., and Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP 2019), Hong Kong, China, 3\u20137 November 2019, Association for Computational Linguistics.","DOI":"10.18653\/v1\/D19-1410"},{"key":"ref_34","unstructured":"DBSCAN (Density-Based Spatial Clustering of Applications with Noise), Ester, M., Kriegel, H.P., Sander, J., and Xu, X. (1996, January 2\u20134). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, OR, USA."},{"key":"ref_35","first-page":"281","article-title":"Some Methods for Classification and Analysis of Multivariate Observations","volume":"Volume 1","author":"MacQueen","year":"1967","journal-title":"Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Ankerst, M., Breunig, M.M., Kriegel, H.P., and Sander, J. (1999, January 1\u20133). OPTICS: Ordering Points to Identify the Clustering Structure. Proceedings of the ACM SIGMOD International Conference on Management of Data, Philadelphia, PA, USA.","DOI":"10.1145\/304182.304187"},{"key":"ref_37","unstructured":"Kaufman, L., and Rousseeuw, P.J. (2005). Finding Groups in Data: An Introduction to Cluster Analysis, Wiley."},{"key":"ref_38","unstructured":"(2025, March 15). Mentionlytics [Computer Software]. Available online: https:\/\/www.mentionlytics.com."},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"224","DOI":"10.1109\/TPAMI.1979.4766909","article-title":"A Cluster Separation Measure","volume":"PAMI-1","author":"Davies","year":"1979","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_40","unstructured":"Newman, D., Lau, J.H., Grieser, K., and Baldwin, T. (2010, January 1\u20136). Automatic Evaluation of Topic Coherence. Proceedings of the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2010), Los Angeles, CA, USA."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"R\u00f6der, M., Both, A., and Hinneburg, A. (2015, January 2\u20136). Exploring the Space of Topic Coherence Measures. Proceedings of the Eighth ACM International Conference on Web Search and Data Mining (WSDM 2015), Shanghai, China.","DOI":"10.1145\/2684822.2685324"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Bianchi, F., Terragni, S., Hovy, D., Nozza, D., and Fersini, E. (2021, January 19\u201323). Cross-lingual Contextualized Topic Models with Zero-shot Learning. Proceedings of the 2021 European Chapter of the Association for Computational Linguistics (EACL 2021), Online. Available online: https:\/\/aclanthology.org\/2021.eacl-main.9\/.","DOI":"10.18653\/v1\/2021.eacl-main.143"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2016). Enriching Word Vectors with Subword Information. arXiv.","DOI":"10.1162\/tacl_a_00051"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Allaoui, M., Kherfi, M.L., and Cheriet, A. (2020, January 6\u20138). Considerably Improving Clustering Algorithms Using UMAP Dimensionality Reduction Technique: A Comparative Study. Proceedings of the 2020 International Conference on Machine Learning and Data Science, Singapore.","DOI":"10.1007\/978-3-030-51935-3_34"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Luo, G., Luo, X., Tian, L., Gooch, T.F., and Qin, K. (2016, January 4\u20136). A Parallel DBSCAN Algorithm Based on Spark. Proceedings of the 2016 IEEE International Conference on Big Data and Cloud Computing, Beijing, China.","DOI":"10.1109\/BDCloud-SocialCom-SustainCom.2016.85"}],"container-title":["Computers"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-431X\/14\/4\/142\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T17:11:57Z","timestamp":1760029917000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-431X\/14\/4\/142"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,4,10]]},"references-count":45,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2025,4]]}},"alternative-id":["computers14040142"],"URL":"https:\/\/doi.org\/10.3390\/computers14040142","relation":{},"ISSN":["2073-431X"],"issn-type":[{"value":"2073-431X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,4,10]]}}}