{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,14]],"date-time":"2025-10-14T07:17:06Z","timestamp":1760426226276,"version":"build-2065373602"},"reference-count":55,"publisher":"MDPI AG","issue":"5","license":[{"start":{"date-parts":[[2024,4,25]],"date-time":"2024-04-25T00:00:00Z","timestamp":1714003200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Economic and Social Research Council (ESRC), grant \u2018Discribe\u2014Digital Security by Design (DSbD) Programme\u2019","award":["REF ES\/V003666\/1"],"award-info":[{"award-number":["REF ES\/V003666\/1"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["BDCC"],"abstract":"<jats:p>Topic modelling is a text mining technique for identifying salient themes from a number of documents. The output is commonly a set of topics consisting of isolated tokens that often co-occur in such documents. Manual effort is often associated with interpreting a topic\u2019s description from such tokens. However, from a human\u2019s perspective, such outputs may not adequately provide enough information to infer the meaning of the topics; thus, their interpretability is often inaccurately understood. Although several studies have attempted to automatically extend topic descriptions as a means of enhancing the interpretation of topic models, they rely on external language sources that may become unavailable, must be kept up to date to generate relevant results, and present privacy issues when training on or processing data. This paper presents a novel approach towards extending the output of traditional topic modelling methods beyond a list of isolated tokens. This approach removes the dependence on external sources by using the textual data themselves by extracting high-scoring keywords and mapping them to the topic model\u2019s token outputs. To compare how the proposed method benchmarks against the state of the art, a comparative analysis against results produced by Large Language Models (LLMs) is presented. Such results report that the proposed method resonates with the thematic coverage found in LLMs and often surpasses such models by bridging the gap between broad thematic elements and granular details. In addition, to demonstrate and reinforce the generalisation of the proposed method, the approach was further evaluated using two other topic modelling methods as the underlying models and when using a heterogeneous unseen dataset. To measure the interpretability of the proposed outputs against those of the traditional topic modelling approach, independent annotators manually scored each output based on their quality and usefulness as well as the efficiency of the annotation task. The proposed approach demonstrated higher quality and usefulness, as well as higher efficiency in the annotation task, in comparison to the outputs of a traditional topic modelling method, demonstrating an increase in their interpretability.<\/jats:p>","DOI":"10.3390\/bdcc8050044","type":"journal-article","created":{"date-parts":[[2024,4,25]],"date-time":"2024-04-25T08:08:32Z","timestamp":1714032512000},"page":"44","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":8,"title":["Topic Modelling: Going beyond Token Outputs"],"prefix":"10.3390","volume":"8","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3794-6145","authenticated-orcid":false,"given":"Lowri","family":"Williams","sequence":"first","affiliation":[{"name":"School of Computer Science & Informatics, Cardiff University, Cardiff CF24 4AG, UK"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5274-0727","authenticated-orcid":false,"given":"Eirini","family":"Anthi","sequence":"additional","affiliation":[{"name":"School of Computer Science & Informatics, Cardiff University, Cardiff CF24 4AG, UK"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2517-845X","authenticated-orcid":false,"given":"Laura","family":"Arman","sequence":"additional","affiliation":[{"name":"School of Social Sciences, Cardiff University, Cardiff CF10 3NN, UK"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0396-633X","authenticated-orcid":false,"given":"Pete","family":"Burnap","sequence":"additional","affiliation":[{"name":"School of Computer Science & Informatics, Cardiff University, Cardiff CF24 4AG, UK"}]}],"member":"1968","published-online":{"date-parts":[[2024,4,25]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Bakshy, E., Rosenn, I., Marlow, C., and Adamic, L. (2012, January 16\u201320). The role of social networks in information diffusion. Proceedings of the 21st International Conference on World Wide Web, Lyon, France.","DOI":"10.1145\/2187836.2187907"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Kang, H.J., Kim, C., and Kang, K. (2019). Analysis of the trends in biochemical research using Latent Dirichlet Allocation (LDA). Processes, 7.","DOI":"10.3390\/pr7060379"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"102034","DOI":"10.1016\/j.ipm.2019.04.002","article-title":"An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit","volume":"57","author":"Curiskis","year":"2020","journal-title":"Inf. Process. Manag."},{"key":"ref_4","unstructured":"Chinnov, A., Kerschke, P., Meske, C., Stieglitz, S., and Trautmann, H. (2015, January 13\u201315). An overview of topic discovery in Twitter communication through social media analytics. Proceedings of the 21st Americas Conference on Information Systems (AMCIS), Fajardo, Puerto Rico."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Weng, J., Lim, E.P., Jiang, J., and He, Q. (2010, January 3\u20136). Twitterrank: Finding topic-sensitive influential twitterers. Proceedings of the Third ACM International Conference on Web Search and Data Mining, New York, NY, USA.","DOI":"10.1145\/1718487.1718520"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Resnik, P., Armstrong, W., Claudino, L., Nguyen, T., Nguyen, V.A., and Boyd-Graber, J. (2015, January 5). Beyond LDA: Exploring supervised topic modeling for depression-related language in Twitter. Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, Denver, CO, USA.","DOI":"10.3115\/v1\/W15-1212"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"e21252","DOI":"10.2196\/21252","article-title":"Patient Triage by Topic Modeling of Referral Letters: Feasibility Study","volume":"8","author":"Spasic","year":"2020","journal-title":"JMIR Med. Inform."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"5228","DOI":"10.1073\/pnas.0307752101","article-title":"Finding scientific topics","volume":"101","author":"Griffiths","year":"2004","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"ref_9","first-page":"e0264134","article-title":"Public Opinion about the UK Government during COVID-19 and Implications for Public Health: A Topic Modelling Analysis of Open-Ended Survey Response Data","volume":"17","author":"Wright","year":"2021","journal-title":"medRxiv"},{"key":"ref_10","first-page":"89","article-title":"Quantitative analysis of large amounts of journalistic texts using topic modelling","volume":"4","author":"Jacobi","year":"2016","journal-title":"Digit. J."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Greene, D., O\u2019Callaghan, D., and Cunningham, P. (2014, January 14\u201318). How many topics? stability analysis for topic models. Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Ghent, Belgium.","DOI":"10.1007\/978-3-662-44848-9_32"},{"key":"ref_12","first-page":"1","article-title":"In search of coherence and consensus: Measuring the interpretability of statistical topics","volume":"18","author":"Morstatter","year":"2018","journal-title":"J. Mach. Learn. Res."},{"key":"ref_13","unstructured":"Boyd-Graber, J., Mimno, D., and Newman, D. (2014). Handbook of Mixed Membership Models and Their Applications, Taylor & Francis Group Ltd."},{"key":"ref_14","unstructured":"Mimno, D., Wallach, H., Talley, E., Leenders, M., and McCallum, A. (2011, January 27\u201329). Optimizing semantic coherence in topic models. Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, UK."},{"key":"ref_15","unstructured":"Chuang, J., Gupta, S., Manning, C., and Heer, J. (2013, January 17\u201319). Topic model diagnostics: Assessing domain relevance via topical alignment. Proceedings of the International conference on machine learning, PMLR, Atlanta, GA, USA."},{"key":"ref_16","unstructured":"OpenAI (2024, April 16). ChatGPT-3.5 Version. Available online: https:\/\/chat.openai.com\/."},{"key":"ref_17","unstructured":"Yu, J., and Egger, R. (2021). Information and Communication Technologies in Tourism 2021, Springer International Publishing."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"658","DOI":"10.1177\/0165551519887878","article-title":"Topic modelling and social network analysis of publications and patents in humanoid robot technology","volume":"47","author":"Kumari","year":"2021","journal-title":"J. Inf. Sci."},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Mei, Q., Shen, X., and Zhai, C. (2007, January 12\u201315). Automatic labeling of multinomial topic models. Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Jose, CA, USA.","DOI":"10.1145\/1281192.1281246"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"1125","DOI":"10.1007\/s10664-012-9209-9","article-title":"Automated topic naming","volume":"18","author":"Hindle","year":"2013","journal-title":"Empir. Softw. Eng."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"28","DOI":"10.1016\/j.ijhcs.2017.03.007","article-title":"The human touch: How non-expert users perceive, interpret, and fix topic models","volume":"105","author":"Lee","year":"2017","journal-title":"Int. J. Hum.-Comput. Stud."},{"key":"ref_22","unstructured":"Lau, J.H., Grieser, K., Newman, D., and Baldwin, T. (2011, January 21). Automatic labelling of topic models. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Stroudsburg, PA, USA."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Magatti, D., Calegari, S., Ciucci, D., and Stella, F. (2009, January 2). Automatic labeling of topics. Proceedings of the 2009 Ninth International Conference on Intelligent Systems Design and Applications, Pisa, Italy.","DOI":"10.1109\/ISDA.2009.165"},{"key":"ref_24","unstructured":"Bhatia, S., Lau, J.H., and Baldwin, T. (2016, January 11\u201316). Automatic Labelling of Topics with Neural Embeddings. Proceedings of the COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan."},{"key":"ref_25","unstructured":"Basave, A.E.C., He, Y., and Xu, R. (2014, January 22\u201327). Automatic labelling of topic models learned from twitter by summarisation. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Baltimore, MD, USA."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Aletras, N., and Stevenson, M. (2014, January 22\u201327). Labelling topics using unsupervised graph-based methods. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Baltimore, MD, USA.","DOI":"10.3115\/v1\/P14-2103"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Hulpus, I., Hayes, C., Karnstedt, M., and Greene, D. (2013, January 4\u20138). Unsupervised graph-based topic labelling using dbpedia. Proceedings of the sixth ACM International Conference on Web Search and Data Mining, Rome, Italy.","DOI":"10.1145\/2433396.2433454"},{"key":"ref_28","first-page":"335","article-title":"A knowledge-based topic modeling approach for automatic topic labeling","volume":"8","author":"Allahyari","year":"2017","journal-title":"Int. J. Adv. Comput. Sci. Appl."},{"key":"ref_29","first-page":"85","article-title":"Onto_TML: Auto-labeling of topic models","volume":"9","author":"Kinariwala","year":"2021","journal-title":"J. Integr. Sci. Technol."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"20531680211022206","DOI":"10.1177\/20531680211022206","article-title":"Transfer learning for topic labeling: Analysis of the UK House of Commons speeches 1935\u20132014","volume":"8","author":"Herzog","year":"2021","journal-title":"Res. Politics"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Wan, X., and Wang, T. (2016, January 7\u201312). Automatic labeling of topic models using text summaries. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany.","DOI":"10.18653\/v1\/P16-1217"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"154","DOI":"10.1002\/asi.23574","article-title":"Evaluating topic representations for exploring document collections","volume":"68","author":"Aletras","year":"2017","journal-title":"J. Assoc. Inf. Sci. Technol."},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Kou, W., Li, F., and Baldwin, T. (2015, January 2\u20134). Automatic labelling of topic models using word vectors and letter trigram vectors. Proceedings of the AIRS, Brisbane, Australia.","DOI":"10.1007\/978-3-319-28940-3_20"},{"key":"ref_34","unstructured":"Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J.L., and Blei, D.M. (2009, January 7\u201310). Reading tea leaves: How humans interpret topic models. Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Hofmann, T. (1999, January 15\u201319). Probabilistic latent semantic indexing. Proceedings of the 22nd annual international ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, CA, USA.","DOI":"10.1145\/312624.312649"},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"188","DOI":"10.1002\/aris.1440380105","article-title":"Latent semantic analysis","volume":"38","author":"Dumais","year":"2004","journal-title":"Annu. Rev. Inf. Sci. Technol."},{"key":"ref_37","first-page":"993","article-title":"Latent dirichlet allocation","volume":"3","author":"Blei","year":"2003","journal-title":"J. Mach. Learn. Res."},{"key":"ref_38","unstructured":"Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv."},{"key":"ref_39","unstructured":"Angelov, D. (2020). Top2vec: Distributed representations of topics. arXiv."},{"key":"ref_40","first-page":"55","article-title":"Probabilistic topic models","volume":"27","author":"Blei","year":"2010","journal-title":"IEEE Signal Process. Mag."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Hendry, D., Darari, F., Nurfadillah, R., Khanna, G., Sun, M., Condylis, P.C., and Taufik, N. (2021, January 23\u201326). Topic modeling for customer service chats. Proceedings of the 2021 International Conference on Advanced Computer Science and Information Systems (ICACSIS), Virtual.","DOI":"10.1109\/ICACSIS53237.2021.9631322"},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"886498","DOI":"10.3389\/fsoc.2022.886498","article-title":"A topic modeling comparison between lda, nmf, top2vec, and bertopic to demystify twitter posts","volume":"7","author":"Egger","year":"2022","journal-title":"Front. Sociol."},{"key":"ref_43","unstructured":"(2023, May 05). Scikit-Learn. 0.24.1 Linear Discriminant Analysis. Available online: https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.\\discriminantanalysis.LinearDiscriminantAnalysis.html."},{"key":"ref_44","first-page":"1","article-title":"Automatic keyword extraction from individual documents","volume":"1","author":"Rose","year":"2010","journal-title":"Text Mining Appl. Theory"},{"key":"ref_45","unstructured":"\u0158eh\u016f\u0159ek, R. (2024, April 03). Gensim: Topic Modelling for Humans. Available online: https:\/\/radimrehurek.com\/gensim_3.8.3\/summarization\/keywords.html."},{"key":"ref_46","unstructured":"Mihalcea, R., and Tarau, P. (2004, January 25\u201326). Textrank: Bringing order into text. Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain."},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"257","DOI":"10.1016\/j.ins.2019.09.013","article-title":"YAKE! Keyword extraction from single documents using multiple local features","volume":"509","author":"Campos","year":"2020","journal-title":"Inf. Sci."},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"1784827","DOI":"10.1155\/2016\/1784827","article-title":"An automatic multidocument text summarization approach based on Naive Bayesian classifier using timestamp strategy","volume":"2016","author":"Ramanujam","year":"2016","journal-title":"Sci. World J."},{"key":"ref_49","first-page":"109","article-title":"Text Mining Scientific Data to Extract Relevant Documents and Auto-Summarization","volume":"4","author":"Davare","year":"2017","journal-title":"IJSTE-Int. J. Sci. Technol. Eng."},{"key":"ref_50","unstructured":"Tarasov, A., Delany, S.J., and Cullen, C. (2010, January 1\u20135). Using crowdsourcing for labelling emotional speech assets. Proceedings of the W3C Workshop on Emotion ML, Paris, France."},{"key":"ref_51","unstructured":"Passonneau, R.J., Yano, T., Lippincott, T., and Klavans, J. (2008). Computational Linguistics for Metadata Building, European Language Resources Association (ELRA)."},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Snow, R., O\u2019connor, B., Jurafsky, D., and Ng, A.Y. (2008, January 25\u201327). Cheap and fast\u2013but is it good? Evaluating non-expert annotations for natural language tasks. Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, Honolulu, HI, USA.","DOI":"10.3115\/1613715.1613751"},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Krippendorff, K. (2018). Content Analysis: An Introduction to Its Methodology, Sage Publications.","DOI":"10.4135\/9781071878781"},{"key":"ref_54","unstructured":"Meta (2024, April 16). Llama-2 Version. Available online: https:\/\/llama.meta.com\/."},{"key":"ref_55","unstructured":"Lang, K. (2023, November 03). Newsgroups Data Set. Available online: https:\/\/scikit-learn.org\/0.19\/datasets\/twenty_newsgroups.html."}],"container-title":["Big Data and Cognitive Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-2289\/8\/5\/44\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T14:33:50Z","timestamp":1760106830000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-2289\/8\/5\/44"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,4,25]]},"references-count":55,"journal-issue":{"issue":"5","published-online":{"date-parts":[[2024,5]]}},"alternative-id":["bdcc8050044"],"URL":"https:\/\/doi.org\/10.3390\/bdcc8050044","relation":{},"ISSN":["2504-2289"],"issn-type":[{"type":"electronic","value":"2504-2289"}],"subject":[],"published":{"date-parts":[[2024,4,25]]}}}