{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,30]],"date-time":"2026-04-30T23:09:17Z","timestamp":1777590557577,"version":"3.51.4"},"reference-count":78,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,5,31]],"date-time":"2025-05-31T00:00:00Z","timestamp":1748649600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,5,31]],"date-time":"2025-05-31T00:00:00Z","timestamp":1748649600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"NGI Search","award":["Sub-grant Agreement SEARCH OC2_18"],"award-info":[{"award-number":["Sub-grant Agreement SEARCH OC2_18"]}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Big Data"],"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>The world is facing a multitude of challenges that hinder the development of human civilization and the well-being of humanity on the planet. The Sustainable Development Goals (SDGs) were formulated by the United Nations in 2015 to address these global challenges by 2030. Natural language processing techniques can help uncover discussions on SDGs within research literature. We propose a completely automated pipeline that (1) fetches content from academic literature and prepares datasets dedicated to five groups of SDGs; (2) performs topic modeling, a statistical technique used to identify topics in large collections of textual data; and (3) enables topic exploration through keywords-based search and topic frequency time series extraction. For topic modeling, we leverage the stack of BERTopic scaled up to be applied on large corpora of textual documents (we find hundreds of topics on hundreds of thousands of documents), introducing (i) a novel LLM-based embeddings computation for representing scientific abstracts in the continuous space, and (ii) a hyperparameter optimizer to efficiently find the best configuration for any new dataset. We additionally produce the visualization of results on interactive dashboards reporting topics\u2019 temporal evolution. Results are made inspectable and explorable, contributing to the interpretability of the topic modeling process. The proposed LLM-based topic modeling pipeline allows users to capture insights on the evolution of the attitude toward SDGs within scientific abstracts in the 2006\u20132023 time span. All the results are reproducible by using our system; the workflow can be generalized to be applied at any point in time to any large corpus of text data.<\/jats:p>","DOI":"10.1186\/s40537-025-01189-4","type":"journal-article","created":{"date-parts":[[2025,5,31]],"date-time":"2025-05-31T17:39:31Z","timestamp":1748713171000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":9,"title":["Capturing research literature attitude towards sustainable development goals: an LLM-based topic modeling approach"],"prefix":"10.1186","volume":"12","author":[{"given":"Francesco","family":"Invernici","sequence":"first","affiliation":[]},{"given":"Francesca","family":"Curati","sequence":"additional","affiliation":[]},{"given":"Jelena","family":"Jakimov","sequence":"additional","affiliation":[]},{"given":"Amirhossein","family":"Samavi","sequence":"additional","affiliation":[]},{"given":"Anna","family":"Bernasconi","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,5,31]]},"reference":[{"key":"1189_CR1","unstructured":"United Nations, Department of Economic and Social Affairs.: Sustainable Development Goals. Last accessed: Nov 24th, 2024. https:\/\/sdgs.un.org\/goals."},{"key":"1189_CR2","unstructured":"Elsevier.: Elsevier Developer Portal \u2013 Academic Research. Last accessed: Nov 24th, 2024. https:\/\/dev.elsevier.com\/academic_research_scopus.html."},{"key":"1189_CR3","doi-asserted-by":"crossref","unstructured":"Krause A, Leskovec J, Guestrin C. Data association for topic intensity tracking. In: Proceedings of the 23rd international conference on Machine learning; 2006. p. 497\u2013504.","DOI":"10.1145\/1143844.1143907"},{"key":"1189_CR4","doi-asserted-by":"crossref","unstructured":"Sia S, Dalmia A, Mielke SJ. Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too! In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP); 2020. p. 1728\u20131736.","DOI":"10.18653\/v1\/2020.emnlp-main.135"},{"key":"1189_CR5","doi-asserted-by":"publisher","DOI":"10.4855\/arXiv.1605.02019","author":"CE Moody","year":"2016","unstructured":"Moody CE. Mixing dirichlet topic models and word embeddings to make Lda2vec. arXiv. 2016. https:\/\/doi.org\/10.4855\/arXiv.1605.02019.","journal-title":"arXiv"},{"key":"1189_CR6","doi-asserted-by":"crossref","unstructured":"Meng Y, Zhang Y, Huang J, Zhang Y, Han J. Topic discovery via latent space clustering of pretrained language model representations. In: Proceedings of the ACM web conference 2022; 2022. p. 3143\u20133152.","DOI":"10.1145\/3485447.3512034"},{"key":"1189_CR7","doi-asserted-by":"publisher","DOI":"10.3389\/fsoc.2022.886498","volume":"7","author":"R Egger","year":"2022","unstructured":"Egger R, Yu J. A topic modeling comparison between LDA, NMF, Top2Vec, and BERTopic to demystify twitter posts. Front Sociol. 2022;7: 886498.","journal-title":"Front Sociol"},{"key":"1189_CR8","doi-asserted-by":"publisher","DOI":"10.4855\/arXiv.2203.05794","author":"M Grootendorst","year":"2022","unstructured":"Grootendorst M. BERTopic: neural topic modeling with a class-based TF-IDF procedure. arXiv. 2022. https:\/\/doi.org\/10.4855\/arXiv.2203.05794.","journal-title":"arXiv"},{"key":"1189_CR9","doi-asserted-by":"publisher","DOI":"10.1016\/j.is.2022.102131","volume":"112","author":"A Abdelrazek","year":"2023","unstructured":"Abdelrazek A, Eid Y, Gawish E, Medhat W, Hassan A. Topic modeling algorithms and applications: a survey. Inform Syst. 2023;112: 102131.","journal-title":"Inform Syst"},{"issue":"12","key":"1189_CR10","doi-asserted-by":"publisher","first-page":"1114","DOI":"10.1038\/s41558-022-01527-x","volume":"12","author":"M Falkenberg","year":"2022","unstructured":"Falkenberg M, Galeazzi A, Torricelli M, Di Marco N, Larosa F, Sas M, et al. Growing polarization around climate change on social media. Nat Clim Change. 2022;12(12):1114\u201321.","journal-title":"Nat Clim Change"},{"key":"1189_CR11","doi-asserted-by":"crossref","unstructured":"Ebeling R, S\u00e1enz CAC, Nobre JC, Becker K. Analysis of the influence of political polarization in the vaccination stance: the Brazilian COVID-19 scenario. In: Proceedings of the International AAAI Conference on Web and Social Media. vol.\u00a016; 2022. p. 159\u2013170.","DOI":"10.1609\/icwsm.v16i1.19281"},{"key":"1189_CR12","doi-asserted-by":"publisher","first-page":"1603","DOI":"10.1038\/s41598-022-26796-6","volume":"13","author":"S Scepanovic","year":"2023","unstructured":"Scepanovic S, Constantinides M, Quercia D, Kim S. Quantifying the impact of positive stress on companies from online employee reviews. Sci Rep. 2023;13:1603.","journal-title":"Sci Rep"},{"key":"1189_CR13","doi-asserted-by":"crossref","unstructured":"Petrescu A, Truic\u0103 CO, Apostol ES. Sentiment analysis of events in social media. In: 2019 IEEE 15th International Conference on Intelligent Computer Communication and Processing (ICCP). IEEE; 2019. p. 143\u2013149.","DOI":"10.1109\/ICCP48234.2019.8959677"},{"key":"1189_CR14","doi-asserted-by":"crossref","unstructured":"Mitroi M, Truic\u0103 CO, Apostol ES, Florea AM, Sentiment analysis using topic-document embeddings. In,. IEEE 16th international conference on intelligent computer communication and processing (ICCP). IEEE. 2020;2020:75\u201382.","DOI":"10.1109\/ICCP51029.2020.9266181"},{"key":"1189_CR15","doi-asserted-by":"publisher","DOI":"10.4855\/arXiv.2311.16162","author":"H Yin","year":"2023","unstructured":"Yin H, Aryani A, Lambert G, White M, Salvador-Carulla L, Sadiq S, et al. Leveraging artificial intelligence technology for mapping research to sustainable development goals: a case study. arXiv preprint. 2023. https:\/\/doi.org\/10.4855\/arXiv.2311.16162.","journal-title":"arXiv preprint"},{"key":"1189_CR16","doi-asserted-by":"crossref","unstructured":"Kharlashkin L, Macias M, Huovinen L, H\u00e4m\u00e4l\u00e4inen M. Predicting Sustainable Development Goals Using Course Descriptions\u2013from LLMs to Conventional Foundation Models. Journal of Data Mining & Digital Humanities. 2024;.","DOI":"10.46298\/jdmdh.13127"},{"key":"1189_CR17","unstructured":"Corallo L, Li G, Reagan K, Saxena A, Varde AS, Wilde B. A Framework for German-English Machine Translation with GRU RNN. In: EDBT\/ICDT Workshops; 2022. ."},{"issue":"2","key":"1189_CR18","doi-asserted-by":"publisher","first-page":"733","DOI":"10.1007\/s10579-022-09584-6","volume":"57","author":"M Puri","year":"2023","unstructured":"Puri M, Varde AS, de Melo G. Commonsense based text mining on urban policy. Lang Resour Eval. 2023;57(2):733\u201363.","journal-title":"Lang Resour Eval"},{"key":"1189_CR19","doi-asserted-by":"publisher","DOI":"10.1108\/jices-05-2023-0073\/full\/html","author":"A Verma","year":"2024","unstructured":"Verma A, Nayak JK. Understanding public sentiments and misbeliefs about sustainable development goals: a sentiment and topic modeling analysis. J Inform, Commun Ethics Soc. 2024. https:\/\/doi.org\/10.1108\/jices-05-2023-0073\/full\/html.","journal-title":"J Inform, Commun Ethics Soc"},{"key":"1189_CR20","doi-asserted-by":"publisher","first-page":"144106","DOI":"10.1109\/ACCESS.2021.3122086","volume":"9","author":"D Roldan-Alvarez","year":"2021","unstructured":"Roldan-Alvarez D, Mart\u00ednez-Mart\u00ednez F, Martin E, Haya PA. Understanding discussions of citizen Science around sustainable development goals in Twitter. IEEE Access. 2021;9:144106\u201320.","journal-title":"IEEE Access"},{"issue":"3","key":"1189_CR21","first-page":"82","volume":"5","author":"H Fitri","year":"2021","unstructured":"Fitri H, Widyawan W, Soesanti I. Topic modeling in the news document on sustainable development goals. IJITEE (Int J Inform Technol Electric Eng). 2021;5(3):82\u20139.","journal-title":"IJITEE (Int J Inform Technol Electric Eng)"},{"key":"1189_CR22","volume":"35","author":"T Saheb","year":"2022","unstructured":"Saheb T, Dehghani M, Saheb T. Artificial intelligence for sustainable energy: a contextual topic modeling and content analysis. Sustain Comput: Inform Syst. 2022;35: 100699.","journal-title":"Sustain Comput: Inform Syst"},{"issue":"1","key":"1189_CR23","doi-asserted-by":"publisher","first-page":"55","DOI":"10.1186\/s40537-024-00920-x","volume":"11","author":"R Raman","year":"2024","unstructured":"Raman R, Pattnaik D, Lathabai HH, Kumar C, Govindan K, Nedungadi P. Green and sustainable AI research: an integrated thematic and topic modeling analysis. J Big Data. 2024;11(1):55.","journal-title":"J Big Data"},{"issue":"1","key":"1189_CR24","doi-asserted-by":"publisher","first-page":"44","DOI":"10.1186\/s12992-023-00943-8","volume":"19","author":"TB Smith","year":"2023","unstructured":"Smith TB, Vacca R, Mantegazza L, Capua I. Discovering new pathways toward integration between health and sustainable development goals with natural language processing and network science. Glob Health. 2023;19(1):44.","journal-title":"Glob Health"},{"key":"1189_CR25","unstructured":"Bernasconi A, et al. TETYS: towards the next-generation open-source web topic explorer. In: CEUR proceedings, vol. 3692. CEUR-WS; 2024. pp. 26\u201333."},{"key":"1189_CR26","unstructured":"Elsevier.: Elsevier Portal. Last accessed: Nov 24th, 2024. https:\/\/www.elsevier.com\/."},{"key":"1189_CR27","doi-asserted-by":"publisher","first-page":"213","DOI":"10.1007\/s11192-015-1765-5","volume":"106","author":"P Mongeon","year":"2016","unstructured":"Mongeon P, Paul-Hus A. The journal coverage of web of science and scopus: a comparative analysis. Scientometrics. 2016;106:213\u201328.","journal-title":"Scientometrics"},{"key":"1189_CR28","doi-asserted-by":"publisher","first-page":"4066","DOI":"10.1109\/ACCESS.2022.3232939","volume":"11","author":"AAM Grisales","year":"2023","unstructured":"Grisales AAM, Robledo S, Zuluaga M. Topic modeling: perspectives from a literature review. IEEE Access. 2023;11:4066\u201378.","journal-title":"IEEE Access"},{"key":"1189_CR29","doi-asserted-by":"publisher","first-page":"244","DOI":"10.1016\/j.cose.2017.03.007","volume":"67","author":"HS Choi","year":"2017","unstructured":"Choi HS, Lee WS, Sohn SY. Analyzing research trends in personal information privacy using topic modeling. Comput Secur. 2017;67:244\u201353.","journal-title":"Comput Secur"},{"issue":"2","key":"1189_CR30","doi-asserted-by":"publisher","first-page":"145","DOI":"10.1108\/JOPP-06-2022-0031","volume":"23","author":"A Rejeb","year":"2023","unstructured":"Rejeb A, Rejeb K, Appolloni A, Kayikci Y, Iranmanesh M. The landscape of public procurement research: a bibliometric analysis and topic modelling based on Scopus. J Public Procure. 2023;23(2):145\u201378.","journal-title":"J Public Procure"},{"key":"1189_CR31","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2024.124028","volume":"252","author":"F Invernici","year":"2024","unstructured":"Invernici F, Bernasconi A, Ceri S. Exploring the evolution of research topics during the COVID-19 pandemic. Expert Syst Appl. 2024;252: 124028.","journal-title":"Expert Syst Appl"},{"key":"1189_CR32","unstructured":"Grootendorst M.: BERTopic \u2013 Best practices on large datasets. Last accessed: Nov 24th, 2024. https:\/\/github.com\/MaartenGr\/BERTopic\/issues\/491."},{"key":"1189_CR33","unstructured":"OpenAI.: ChatGPT. Last accessed: Nov 24th, 2024. https:\/\/chatgpt.com\/."},{"key":"1189_CR34","doi-asserted-by":"crossref","unstructured":"Dai S, Xu C, Xu S, Pang L, Dong Z, Xu J. Bias and unfairness in information retrieval systems: New challenges in the llm era. In: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; 2024. p. 6437\u20136447.","DOI":"10.1145\/3637528.3671458"},{"key":"1189_CR35","unstructured":"Elsevier.: Elsevier Developer Portal. Last accessed: Nov 24th, 2024. https:\/\/dev.elsevier.com\/."},{"key":"1189_CR36","doi-asserted-by":"crossref","unstructured":"Reimers N, Gurevych I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In: Inui K, Jiang J, Ng V, Wan X, editors. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics; 2019. p. 3982\u20133992.","DOI":"10.18653\/v1\/D19-1410"},{"key":"1189_CR37","doi-asserted-by":"publisher","DOI":"10.4855\/arXiv.1802.03426","author":"L McInnes","year":"2018","unstructured":"McInnes L, Healy J, Melville J. UMAP: uniform manifold approximation and projection for dimension reduction. arXiv. 2018. https:\/\/doi.org\/10.4855\/arXiv.1802.03426.","journal-title":"arXiv"},{"issue":"11","key":"1189_CR38","doi-asserted-by":"publisher","first-page":"205","DOI":"10.21105\/joss.00205","volume":"2","author":"L McInnes","year":"2017","unstructured":"McInnes L, Healy J, Astels S. hdbscan: hierarchical density based clustering. J Open Sour Soft. 2017;2(11):205.","journal-title":"J Open Sour Soft"},{"issue":"85","key":"1189_CR39","first-page":"2825","volume":"12","author":"F Pedregosa","year":"2011","unstructured":"Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12(85):2825\u201330.","journal-title":"J Mach Learn Res"},{"key":"1189_CR40","doi-asserted-by":"publisher","first-page":"27","DOI":"10.1007\/978-3-642-39314-3_3","volume-title":"Web information retrieval","author":"S Ceri","year":"2013","unstructured":"Ceri S, Bozzon A, Brambilla M, Della Valle E, Fraternali P, Quarteroni S. Information retrieval models. In: Fraternali P, Quarteroni S, Valle ED, Brambilla M, Ceri S, Bozzon A, editors. Web information retrieval. Berlin: Springer; 2013. p. 27\u201337."},{"key":"1189_CR41","doi-asserted-by":"publisher","DOI":"10.4855\/arXiv.2210.07316","author":"N Muennighoff","year":"2022","unstructured":"Muennighoff N, Tazi N, Magne L, Reimers N. MTEB: massive text embedding benchmark. arXiv preprint. 2022. https:\/\/doi.org\/10.4855\/arXiv.2210.07316.","journal-title":"arXiv preprint"},{"key":"1189_CR42","unstructured":"Meng R, Liu Y, Joty SR, Xiong C, Zhou Y, Yavuz S.: SFR-Embedding-2: Advanced Text Embedding with Multi-stage Training. Last accessed: Nov 24th, 2024. https:\/\/huggingface.co\/Salesforce\/SFR-Embedding-2_R."},{"key":"1189_CR43","unstructured":"Salesforce AI Research.: SFR-Embedding-Mistral. Last accessed: Nov 24th, 2024. https:\/\/huggingface.co\/Salesforce\/SFR-Embedding-Mistral."},{"key":"1189_CR44","unstructured":"Hugging Face.: Pipelines. Last accessed: Nov 24th, 2024. https:\/\/huggingface.co\/docs\/transformers\/en\/main_classes\/pipelines."},{"key":"1189_CR45","doi-asserted-by":"crossref","unstructured":"Moulavi D, Jaskowiak PA, Campello RJ, Zimek A, Sander J. Density-Based Clustering Validation. In: Proceedings of the 2014 SIAM international conference on data mining. SIAM; 2014. p. 839\u2013847.","DOI":"10.1137\/1.9781611973440.96"},{"key":"1189_CR46","unstructured":"The PyTorch Foundation.: PyTorch. Last accessed: Nov 24th, 2024. https:\/\/pytorch.org\/."},{"key":"1189_CR47","unstructured":"Mueller AC.: Wordcloud. Last accessed: Nov 24th, 2024. https:\/\/github.com\/amueller\/word_cloud."},{"issue":"260","key":"1189_CR48","doi-asserted-by":"publisher","first-page":"583","DOI":"10.1080\/01621459.1952.10483441","volume":"47","author":"WH Kruskal","year":"1952","unstructured":"Kruskal WH, Wallis WA. Use of ranks in one-criterion variance analysis. J Am Stat Assoc. 1952;47(260):583\u2013621.","journal-title":"J Am Stat Assoc"},{"issue":"3","key":"1189_CR49","doi-asserted-by":"publisher","first-page":"261","DOI":"10.1038\/s41592-019-0686-2","volume":"17","author":"P Virtanen","year":"2020","unstructured":"Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods. 2020;17(3):261\u201372.","journal-title":"Nat Methods"},{"issue":"3","key":"1189_CR50","doi-asserted-by":"publisher","first-page":"241","DOI":"10.1080\/00401706.1964.10490181","volume":"6","author":"OJ Dunn","year":"1964","unstructured":"Dunn OJ. Multiple comparisons using rank sums. Technometrics. 1964;6(3):241\u201352.","journal-title":"Technometrics"},{"issue":"10","key":"1189_CR51","doi-asserted-by":"publisher","first-page":"3900","DOI":"10.1021\/cr050200z","volume":"107","author":"W Lubitz","year":"2007","unstructured":"Lubitz W, Tumas W. Hydrogen: an overview. Chem Rev. 2007;107(10):3900\u20133.","journal-title":"Chem Rev"},{"key":"1189_CR52","unstructured":"NVIDIA Corporation.: NVIDIA A100 Tensor Core GPU. Last accessed: Nov 24th, 2024. https:\/\/www.nvidia.com\/en-eu\/data-center\/a100\/."},{"key":"1189_CR53","doi-asserted-by":"crossref","unstructured":"Cohan A, Feldman S, Beltagy I, Downey D, Weld D. SPECTER: Document-level Representation Learning using Citation-informed Transformers. In: Jurafsky D, Chai J, Schluter N, Tetreault J, editors. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics; 2020. p. 2270\u20132282.","DOI":"10.18653\/v1\/2020.acl-main.207"},{"key":"1189_CR54","doi-asserted-by":"crossref","unstructured":"Beltagy I, Lo K, Cohan A. SciBERT: A Pretrained Language Model for Scientific Text. In: Inui K, Jiang J, Ng V, Wan X, editors. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics; 2019. p. 3615\u20133620.","DOI":"10.18653\/v1\/D19-1371"},{"key":"1189_CR55","unstructured":"Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Burstein J, Doran C, Solorio T, editors. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics; 2019. p. 4171\u20134186."},{"key":"1189_CR56","doi-asserted-by":"crossref","unstructured":"R\u00f6der M, Both A, Hinneburg A. Exploring the space of topic coherence measures. In: Proceedings of the eighth ACM international conference on Web search and data mining; 2015. p. 399\u2013408.","DOI":"10.1145\/2684822.2685324"},{"key":"1189_CR57","volume-title":"Gensim-python framework for vector space modelling","author":"R Rehurek","year":"2011","unstructured":"Rehurek R, Sojka P. Gensim-python framework for vector space modelling. Brno: NLP Centre, Faculty of Informatics, Masaryk University; 2011."},{"key":"1189_CR58","doi-asserted-by":"crossref","unstructured":"Terragni S, Fersini E, Galuzzi BG, Tropeano P, Candelieri A. OCTIS: Comparing and Optimizing Topic Models is Simple! In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics; 2021. p. 263\u2013270. Available from: https:\/\/www.aclweb.org\/anthology\/2021.eacl-demos.31.","DOI":"10.18653\/v1\/2021.eacl-demos.31"},{"key":"1189_CR59","unstructured":"Stevens K, Kegelmeyer P, Andrzejewski D, Buttler D. Exploring topic coherence over many models and many topics. In: Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning; 2012. p. 952\u2013961."},{"issue":"5","key":"1189_CR60","doi-asserted-by":"publisher","first-page":"1199","DOI":"10.1017\/S1351324922000535","volume":"29","author":"H Schuff","year":"2023","unstructured":"Schuff H, Vanderlyn L, Adel H, Vu NT. How to do human evaluation: a brief introduction to user studies in NLP. Natl Lang Eng. 2023;29(5):1199\u2013222.","journal-title":"Natl Lang Eng"},{"issue":"2","key":"1189_CR61","doi-asserted-by":"publisher","first-page":"153","DOI":"10.1007\/BF02295996","volume":"12","author":"Q McNemar","year":"1947","unstructured":"McNemar Q. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika. 1947;12(2):153\u20137.","journal-title":"Psychometrika"},{"key":"1189_CR62","unstructured":"Invernici F, Bernasconi A, Curati F, Jakimov J, Samavi A, et\u00a0al. TETYS: Configurable Topic Modeling Exploration for Big Corpora of Text Documents. In: Proceedings of the 28th International Conference on Extending Database Technology (EDBT). vol.\u00a028; 2025. p. 1114\u20131117."},{"key":"1189_CR63","doi-asserted-by":"crossref","unstructured":"Raasveldt M, M\u00fchleisen H. Duckdb: an embeddable analytical database. In: Proceedings of the 2019 International Conference on Management of Data; 2019. p. 1981\u20131984.","DOI":"10.1145\/3299869.3320212"},{"key":"1189_CR64","unstructured":"Apache Software Foundation.: Apache Parquet. Last accessed: Nov 24th, 2024. https:\/\/parquet.apache.org\/."},{"key":"1189_CR65","unstructured":"Crossref.: Documentation - Metadata Retrieval - REST API. Last accessed: Nov 24th, 2024. https:\/\/www.crossref.org\/documentation\/retrieve-metadata\/rest-api\/."},{"key":"1189_CR66","doi-asserted-by":"publisher","DOI":"10.4855\/arXiv.2010.12626","author":"L Thompson","year":"2020","unstructured":"Thompson L, Mimno D. Topic modeling with contextualized word representation clusters. arXiv preprint. 2020. https:\/\/doi.org\/10.4855\/arXiv.2010.12626.","journal-title":"arXiv preprint"},{"key":"1189_CR67","first-page":"100","volume":"6","author":"A Petukhova","year":"2025","unstructured":"Petukhova A, Matos-Carvalho JP, Fachada N. Text clustering with large language model embeddings. Int J Cognit Comput Eng. 2025;6:100\u20138.","journal-title":"Int J Cognit Comput Eng"},{"key":"1189_CR68","unstructured":"Radulescu IM, Truica CO, Apostol ES, Boicea A, Radulescu F, Mocanu M. Performance Evaluation of DBSCAN With Similarity Join Algorithms. In: The 34th International-Business-Information-Management-Association (IBIMA) Conference, Madrid, Spain; 2019. ."},{"key":"1189_CR69","doi-asserted-by":"crossref","unstructured":"Allaoui M, Kherfi ML, Cheriet A. Considerably improving clustering algorithms using UMAP dimensionality reduction technique: a comparative study. In: International conference on image and signal processing. Springer; 2020. p. 317\u2013325.","DOI":"10.1007\/978-3-030-51935-3_34"},{"issue":"1","key":"1189_CR70","doi-asserted-by":"publisher","first-page":"11","DOI":"10.1108\/eb026526","volume":"28","author":"Jones K Sparck","year":"1972","unstructured":"Sparck Jones K. A statistical interpretation of term specificity and its application in retrieval. J Document. 1972;28(1):11\u201321.","journal-title":"J Document"},{"issue":"4","key":"1189_CR71","doi-asserted-by":"publisher","first-page":"333","DOI":"10.1561\/1500000019","volume":"3","author":"S Robertson","year":"2009","unstructured":"Robertson S, Zaragoza H, et al. The probabilistic relevance framework: BM25 and beyond. Found Trends Inform Retriev. 2009;3(4):333\u201389.","journal-title":"Found Trends Inform Retriev"},{"key":"1189_CR72","doi-asserted-by":"crossref","unstructured":"Truica CO, Radulescu F, Boicea A, Comparing different term weighting schemas for topic modeling. In,. 18th international symposium on symbolic and numeric algorithms for scientific computing (SYNASC). IEEE. 2016;2016:307\u201310.","DOI":"10.1109\/SYNASC.2016.055"},{"key":"1189_CR73","doi-asserted-by":"crossref","unstructured":"Zuo Y, Wu J, Zhang H, Lin H, Wang F, Xu K, et\u00a0al. Topic modeling of short texts: A pseudo-document view. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining; 2016. p. 2105\u20132114.","DOI":"10.1145\/2939672.2939880"},{"key":"1189_CR74","doi-asserted-by":"crossref","unstructured":"Truica CO, Apostol ES, Leordeanu CA, Topic modeling using contextual cues. In,. 19th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC). IEEE. 2017;2017:203\u201310.","DOI":"10.1109\/SYNASC.2017.00041"},{"issue":"6","key":"1189_CR75","doi-asserted-by":"publisher","first-page":"1345","DOI":"10.1016\/j.ipm.2018.05.009","volume":"54","author":"X Li","year":"2018","unstructured":"Li X, Zhang A, Li C, Ouyang J, Cai Y. Exploring coherent topics by topic modeling with term weighting. Inform Process Manag. 2018;54(6):1345\u201358.","journal-title":"Inform Process Manag"},{"key":"1189_CR76","doi-asserted-by":"publisher","first-page":"439","DOI":"10.1162\/tacl_a_00325","volume":"8","author":"AB Dieng","year":"2020","unstructured":"Dieng AB, Ruiz FJ, Blei DM. Topic modeling in embedding spaces. Trans Assoc Comput Linguist. 2020;8:439\u201353.","journal-title":"Trans Assoc Comput Linguist"},{"key":"1189_CR77","doi-asserted-by":"crossref","unstructured":"Wang H, Prakash N, Hoang NK, Hee MS, Naseem U, Lee RKW. Prompting large language models for topic modeling. In: 2023 IEEE International Conference on Big Data (BigData). IEEE; 2023. p. 1236\u20131241.","DOI":"10.1109\/BigData59044.2023.10386113"},{"key":"1189_CR78","doi-asserted-by":"publisher","DOI":"10.4855\/arXiv.2403.16248","author":"Y Mu","year":"2024","unstructured":"Mu Y, Dong C, Bontcheva K, Song X. Large language models offer an alternative to the traditional approach of topic modelling. arXiv preprint. 2024. https:\/\/doi.org\/10.4855\/arXiv.2403.16248.","journal-title":"arXiv preprint"}],"container-title":["Journal of Big Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-025-01189-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s40537-025-01189-4\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-025-01189-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,5,31]],"date-time":"2025-05-31T17:39:38Z","timestamp":1748713178000},"score":1,"resource":{"primary":{"URL":"https:\/\/journalofbigdata.springeropen.com\/articles\/10.1186\/s40537-025-01189-4"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,5,31]]},"references-count":78,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["1189"],"URL":"https:\/\/doi.org\/10.1186\/s40537-025-01189-4","relation":{},"ISSN":["2196-1115"],"issn-type":[{"value":"2196-1115","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,5,31]]},"assertion":[{"value":"5 December 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"12 May 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"31 May 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare that there are no competing interests.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"139"}}