{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,16]],"date-time":"2026-06-16T16:03:25Z","timestamp":1781625805705,"version":"3.54.5"},"reference-count":41,"publisher":"Springer Science and Business Media LLC","issue":"9","license":[{"start":{"date-parts":[[2022,3,30]],"date-time":"2022-03-30T00:00:00Z","timestamp":1648598400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,3,30]],"date-time":"2022-03-30T00:00:00Z","timestamp":1648598400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100010661","name":"Horizon 2020 Framework Programme","doi-asserted-by":"publisher","award":["101004870"],"award-info":[{"award-number":["101004870"]}],"id":[{"id":"10.13039\/100010661","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100010661","name":"Horizon 2020 Framework Programme","doi-asserted-by":"publisher","award":["101004870"],"award-info":[{"award-number":["101004870"]}],"id":[{"id":"10.13039\/100010661","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100010661","name":"Horizon 2020 Framework Programme","doi-asserted-by":"publisher","award":["101004870"],"award-info":[{"award-number":["101004870"]}],"id":[{"id":"10.13039\/100010661","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100014440","name":"Ministerio de Ciencia, Innovaci\u00f3n y Universidades","doi-asserted-by":"publisher","award":["TEC2017-83838-R"],"award-info":[{"award-number":["TEC2017-83838-R"]}],"id":[{"id":"10.13039\/100014440","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100014440","name":"Ministerio de Ciencia, Innovaci\u00f3n y Universidades","doi-asserted-by":"publisher","award":["TEC2017-83838-R"],"award-info":[{"award-number":["TEC2017-83838-R"]}],"id":[{"id":"10.13039\/100014440","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100014440","name":"Ministerio de Ciencia, Innovaci\u00f3n y Universidades","doi-asserted-by":"publisher","award":["TEC2017-83838-R"],"award-info":[{"award-number":["TEC2017-83838-R"]}],"id":[{"id":"10.13039\/100014440","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Universidad Carlos III"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Scientometrics"],"published-print":{"date-parts":[[2022,9]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Probabilistic topic modeling algorithms like Latent Dirichlet Allocation (LDA) have become powerful tools for the analysis of large collections of documents (such as papers, projects, or funding applications) in science, technology an innovation (STI) policy design and monitoring. However, selecting an appropriate and stable topic model for a specific application (by adjusting the hyperparameters of the algorithm) is not a trivial problem. Common validation metrics like coherence or perplexity, which are focused on the quality of topics, are not a good fit in applications where the quality of the document similarity relations inferred from the topic model is especially relevant. Relying on graph analysis techniques, the aim of our work is to state a new methodology for the selection of hyperparameters which is specifically oriented to optimize the similarity metrics emanating from the topic model. In order to do this, we propose two graph metrics: the first measures the variability of the similarity graphs that result from different runs of the algorithm for a fixed value of the hyperparameters, while the second metric measures the alignment between the graph derived from the LDA model and another obtained using metadata available for the corresponding corpus. Through experiments on various corpora related to STI, it is shown that the proposed metrics provide relevant indicators to select the number of topics and build persistent topic models that are consistent with the metadata. Their use, which can be extended to other topic models beyond LDA, could facilitate the systematic adoption of this kind of techniques in STI policy analysis and design.<\/jats:p>","DOI":"10.1007\/s11192-022-04318-5","type":"journal-article","created":{"date-parts":[[2022,3,30]],"date-time":"2022-03-30T04:16:09Z","timestamp":1648613769000},"page":"5441-5458","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":12,"title":["Validation of scientific topic models using graph analysis and corpus metadata"],"prefix":"10.1007","volume":"127","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-3365-2622","authenticated-orcid":false,"given":"Manuel A.","family":"V\u00e1zquez","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jorge","family":"Pereira-Delgado","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5243-5992","authenticated-orcid":false,"given":"Jes\u00fas","family":"Cid-Sueiro","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4071-7068","authenticated-orcid":false,"given":"Jer\u00f3nimo","family":"Arenas-Garc\u00eda","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2022,3,30]]},"reference":[{"issue":"10","key":"4318_CR1","doi-asserted-by":"publisher","first-page":"3378","DOI":"10.36478\/jeasci.2019.3378.3382","volume":"14","author":"A Adebiyi","year":"2019","unstructured":"Adebiyi, A., Ogunleye, O. M., Adebiyi, M., & Okesola, J. (2019). A comparative analysis of tf-idf, lsi and lda in semantic information retrieval approach for paper-reviewer assignment. Journal of Engineering and Applied Sciences, 14(10), 3378\u20133382.","journal-title":"Journal of Engineering and Applied Sciences"},{"key":"4318_CR2","unstructured":"Agerri, R., Bermudez, J., & Rigau, G. (2014). Ixa pipeline: Efficient and ready to use multilingual nlp tools. In LREC, 2014, 3823\u20133828."},{"key":"4318_CR3","doi-asserted-by":"publisher","first-page":"74","DOI":"10.1016\/j.infsof.2018.02.005","volume":"98","author":"A Agrawal","year":"2018","unstructured":"Agrawal, A., Fu, W., & Menzies, T. (2018). What is wrong with topic modeling? And how to fix it using search-based software engineering. Information and Software Technology, 98, 74\u201388.","journal-title":"Information and Software Technology"},{"key":"4318_CR4","doi-asserted-by":"crossref","unstructured":"Ammar, W., Groeneveld, D., Bhagavatula, C., Beltagy, I., Crawford, M., Downey, D., Dunkelberger, J., Elgohary, A., Feldman, S., Ha, V., Kinney, R., Kohlmeier, S., Lo, K., Murray, T., Ooi, H. H., Peters, M., Power, J., Skjonsberg, S., Wang, L. L., Wilhelm, C., Yuan, Z., van Zuylen, & M., Etzioni, O. (2018) Construction of the literature graph in semantic scholar. In NAACL","DOI":"10.18653\/v1\/N18-3011"},{"key":"4318_CR6","doi-asserted-by":"crossref","unstructured":"Badenes-Olmedo, C., Redondo-Garcia, J. L., & Corcho, O. (2017). Distributing text mining tasks with librAIry. In Proceedings of the 2017 ACM symposium on document engineering, DocEng \u201917 (pp. 63\u201366). ACM.","DOI":"10.1145\/3103010.3121040"},{"key":"4318_CR5","doi-asserted-by":"crossref","unstructured":"Badenes-Olmedo, C., Redondo-Garc\u00eda, J. L., & Corcho O. (2020). Large-scale semantic exploration of scientific literature using topic-based hashing algorithms. Semantic Web, 11, 735\u2013750.","DOI":"10.3233\/SW-200373"},{"key":"4318_CR7","first-page":"147","volume":"18","author":"D Blei","year":"2006","unstructured":"Blei, D., & Lafferty, J. (2006). Correlated topic models. Advances in Neural Information Procesing Systems, 18, 147.","journal-title":"Advances in Neural Information Procesing Systems"},{"key":"4318_CR8","first-page":"993","volume":"3","author":"DM Blei","year":"2003","unstructured":"Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3, 993\u20131022.","journal-title":"Journal of machine Learning research"},{"issue":"3","key":"4318_CR9","doi-asserted-by":"publisher","first-page":"e18029","DOI":"10.1371\/journal.pone.0018029","volume":"6","author":"KW Boyack","year":"2011","unstructured":"Boyack, K. W., Newman, D., Duhon, R. J., Klavans, R., Patek, M., Biberstine, J. R., Schijvenaars, B., Skupin, A., Ma, N., & B\u00f6rner, K. (2011). Clustering more than two million biomedical publications: Comparing the accuracies of nine text-based similarity approaches. PLoS ONE, 6(3), e18029.","journal-title":"PLoS ONE"},{"key":"4318_CR10","unstructured":"Burghardt, M., & Luhmann, J. (2021) Same same, but different? On the relation of information science and the digital humanities a scientometric comparison of academic journals using lda and hierarchical clustering"},{"key":"4318_CR11","unstructured":"Chang, J., Boyd-Graber, J., Wang, C., Gerrish, S., & Blei, D. M.(2009) Reading tea leaves: How humans interpret topic models. In Neural Information Processing Systems."},{"issue":"3","key":"4318_CR12","doi-asserted-by":"publisher","first-page":"2091","DOI":"10.1007\/s11192-020-03666-4","volume":"125","author":"J Chen","year":"2020","unstructured":"Chen, J., Chen, J., Zhao, S., Zhang, Y., & Tang, J. (2020). Exploiting word embedding for heterogeneous topic model towards patent recommendation. Scientometrics, 125(3), 2091\u20132108.","journal-title":"Scientometrics"},{"key":"4318_CR13","doi-asserted-by":"crossref","unstructured":"Chuang, J., Roberts, M. E., Stewart, B. M., Weiss, R., Tingley, D., Grimmer, J., & Heer, J.(2015) Topiccheck: Interactive alignment for assessing topic model stability. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 175\u2013184).","DOI":"10.3115\/v1\/N15-1018"},{"issue":"1","key":"4318_CR14","doi-asserted-by":"publisher","first-page":"e0244839","DOI":"10.1371\/journal.pone.0244839","volume":"16","author":"G Colavizza","year":"2021","unstructured":"Colavizza, G., Costas, R., Traag, V. A., Van Eck, N. J., Van Leeuwen, T., & Waltman, L. (2021). A scientometric overview of cord-19. PLoS ONE, 16(1), e0244839.","journal-title":"PLoS ONE"},{"key":"4318_CR15","unstructured":"European-Patent-Office: Data catalog patstat global. (2020). Data retrieved from the European Patent Office, https:\/\/www.epo.org\/."},{"key":"4318_CR16","unstructured":"Grant, J., Hinrichs, S., Gill, A., & Adams, J.(2017) The nature, scale and beneficiaries of research impact"},{"issue":"6","key":"4318_CR17","doi-asserted-by":"publisher","first-page":"1292","DOI":"10.1016\/j.ipm.2018.05.006","volume":"54","author":"L Hagen","year":"2018","unstructured":"Hagen, L. (2018). Content analysis of e-petitions with topic modeling: How to train and evaluate lda models? Information Processing & Management, 54(6), 1292\u20131307.","journal-title":"Information Processing & Management"},{"issue":"3","key":"4318_CR18","doi-asserted-by":"publisher","first-page":"2561","DOI":"10.1007\/s11192-020-03721-0","volume":"125","author":"X Han","year":"2020","unstructured":"Han, X. (2020). Evolution of research topics in lis between 1996 and 2019: An analysis based on latent dirichlet allocation topic model. Scientometrics, 125(3), 2561\u20132595.","journal-title":"Scientometrics"},{"issue":"3","key":"4318_CR19","doi-asserted-by":"publisher","first-page":"263","DOI":"10.1093\/reseval\/rvz015","volume":"28","author":"T Hecking","year":"2019","unstructured":"Hecking, T., & Leydesdorff, L. (2019). Can topic models be used in research evaluations? Reproducibility, validity, and reliability when compared with semantic maps. Research Evaluation, 28(3), 263\u2013272.","journal-title":"Research Evaluation"},{"issue":"1","key":"4318_CR20","doi-asserted-by":"publisher","first-page":"011007","DOI":"10.1103\/PhysRevX.5.011007","volume":"5","author":"A Lancichinetti","year":"2015","unstructured":"Lancichinetti, A., Sirer, M. I., Wang, J. X., Acuna, D., K\u00f6rding, K., & Amaral, L. A. N. (2015). High-reproducibility and high-accuracy method for automated topic classification. Physical Review X, 5(1), 011007.","journal-title":"Physical Review X"},{"issue":"2\u20133","key":"4318_CR21","doi-asserted-by":"publisher","first-page":"93","DOI":"10.1080\/19312458.2018.1430754","volume":"12","author":"D Maier","year":"2018","unstructured":"Maier, D., Waldherr, A., Miltner, P., Wiedemann, G., Niekler, A., Keinert, A., Pfetsch, B., Heyer, G., Reber, U., H\u00e4ussler, T., et al. (2018). Applying lda topic modeling in communication research: Toward a valid and reliable methodology. Communication Methods and Measures, 12(2\u20133), 93\u2013118.","journal-title":"Communication Methods and Measures"},{"key":"4318_CR22","doi-asserted-by":"crossref","unstructured":"Mantyla, M. V., Claes, M., & Farooq, U.(2018) Measuring LDA topic stability from clusters of replicated runs. In Proceedings of the 12th ACM\/IEEE International Symposium on Empirical Software Engineering and Measurement (pp. 1\u20134)","DOI":"10.1145\/3239235.3267435"},{"key":"4318_CR23","unstructured":"McCallum, A. K. (2002). Mallet: A machine learning for language toolkit. http:\/\/mallet. cs. umass. edu"},{"issue":"1","key":"4318_CR24","doi-asserted-by":"publisher","first-page":"665","DOI":"10.1007\/s11192-020-03657-5","volume":"125","author":"Y Miyata","year":"2020","unstructured":"Miyata, Y., Ishita, E., Yang, F., Yamamoto, M., Iwase, A., & Kurata, K. (2020). Knowledge structure transition in library and information science: Topic modeling and visualization. Scientometrics, 125(1), 665\u2013687.","journal-title":"Scientometrics"},{"key":"4318_CR25","unstructured":"Newman, D., Lau, J. H., Grieser, K., & Baldwin, T.(2010) Automatic evaluation of topic coherence. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 100\u2013108). Association for Computational Linguistics, Los Angeles, California"},{"key":"4318_CR26","unstructured":"OECD: The Digitalisation of Science and Innovation Policy. (2018). https:\/\/www.oecd-ilibrary.org\/content\/component\/sti_in_outlook-2018-17-en"},{"key":"4318_CR27","doi-asserted-by":"crossref","unstructured":"Pathik, N., Shukla, P.(2020) Simulated annealing based algorithm for tuning LDA hyper parameters. In Soft Computing: Theories and Applications (pp. 515\u2013521). Springer","DOI":"10.1007\/978-981-15-4032-5_47"},{"key":"4318_CR28","unstructured":"P\u00e9rez-Fern\u00e1ndez, D., Arenas-Garc\u00eda, J., Samy, D., Padilla-Soler, A., & G\u00f3mez-Verdejo, V. (2019). Corpus viewer: NLP and ML-based platform for publicpolicy making and implementation."},{"issue":"1","key":"4318_CR29","doi-asserted-by":"publisher","first-page":"215","DOI":"10.1007\/s11192-019-03275-w","volume":"122","author":"S Ranaei","year":"2020","unstructured":"Ranaei, S., Suominen, A., Porter, A., & Carley, S. (2020). Evaluating technological emergence using text analytics: two case technologies and three approaches. Scientometrics, 122(1), 215\u2013247.","journal-title":"Scientometrics"},{"key":"4318_CR30","doi-asserted-by":"crossref","unstructured":"R\u00f6der, M., Both, A., & Hinneburg, A. (2015) Exploring the space of topic coherence measures. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining (pp. 399\u2013408).","DOI":"10.1145\/2684822.2685324"},{"issue":"8","key":"4318_CR31","doi-asserted-by":"publisher","first-page":"1450","DOI":"10.1016\/j.respol.2014.02.005","volume":"43","author":"H Small","year":"2014","unstructured":"Small, H., Boyack, K. W., & Klavans, R. (2014). Identifying emerging topics in science and technology. Research Policy, 43(8), 1450\u20131467.","journal-title":"Research Policy"},{"key":"4318_CR32","unstructured":"Srivastava, A., & Sutton, C. (2017) Autoencoding variational inference for topic models. arXiv preprint arXiv:1703.01488"},{"issue":"10","key":"4318_CR33","doi-asserted-by":"publisher","first-page":"2464","DOI":"10.1002\/asi.23596","volume":"67","author":"A Suominen","year":"2016","unstructured":"Suominen, A., & Toivanen, H. (2016). Map of science with topic modeling: Comparison of unsupervised learning and human-assigned subject classification. Journal of the Association for Information Science and Technology, 67(10), 2464\u20132476 (Project code: 101488).","journal-title":"Journal of the Association for Information Science and Technology"},{"key":"4318_CR34","doi-asserted-by":"crossref","unstructured":"Syed, S., Spruit, M.(2017) Full-text or abstract? examining topic coherence scores using latent dirichlet allocation. In 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 165\u2013174. IEEE","DOI":"10.1109\/DSAA.2017.61"},{"key":"4318_CR35","unstructured":"Vega-Carrasco, M., O\u2019sullivan, J., Prior, R., Manolopoulou, I., & Musolesi, M.(2020) Modelling grocery retail topic distributions: Evaluation, interpretability and stability. arXiv preprint arXiv:2005.10125"},{"issue":"2","key":"4318_CR36","first-page":"691","volume":"1","author":"L Waltman","year":"2020","unstructured":"Waltman, L., Boyack, K. W., Colavizza, G., & van Eck, N. J. (2020). A principled methodology for comparing relatedness measures for clustering publications. Quantitative Science Studies, 1(2), 691\u2013713.","journal-title":"Quantitative Science Studies"},{"key":"4318_CR37","doi-asserted-by":"crossref","unstructured":"Wheeler, D. L., Barrett, T., Benson, D. A., Bryant, S. H., Canese, K., Church, D. M., et al. (2005). Database resources of the national center for biotechnology information. Nucleic Acids Research, D33(Database Issue), 39\u2013D45.","DOI":"10.1093\/nar\/gki062"},{"key":"4318_CR38","unstructured":"Xiao, H., Stibor, T.(2010) Efficient collapsed gibbs sampling for latent dirichlet allocation. In Proceedings of 2nd Asian Conference on Machine Learning, pp. 63\u201378. JMLR Workshop and Conference Proceedings."},{"key":"4318_CR39","doi-asserted-by":"crossref","unstructured":"Xue, M.(2019) A text retrieval algorithm based on the hybrid lda and word2vec model. In 2019 International Conference on Intelligent Transportation, Big Data & Smart City (ICITBS) (pp. 373\u2013376). IEEE","DOI":"10.1109\/ICITBS.2019.00098"},{"key":"4318_CR40","doi-asserted-by":"crossref","unstructured":"Yao, L., Mimno, D., McCallum, A. (2009)Efficient methods for topic model inference on streaming document collections. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD \u201909 (pp. 937\u2013946). Association for Computing Machinery, New York, NY, USA","DOI":"10.1145\/1557019.1557121"},{"key":"4318_CR41","doi-asserted-by":"crossref","unstructured":"Zhao, W., Chen, J. J., Perkins, R., Liu, Z., Ge, W., Ding, Y., & Zou, W.(2015) A heuristic approach to determine an appropriate number of topics in topic modeling. In BMC bioinformatics (Vol.\u00a016, pp. 1\u201310). Springer","DOI":"10.1186\/1471-2105-16-S13-S8"}],"container-title":["Scientometrics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11192-022-04318-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11192-022-04318-5\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11192-022-04318-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,9,9]],"date-time":"2022-09-09T06:40:56Z","timestamp":1662705656000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11192-022-04318-5"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,3,30]]},"references-count":41,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2022,9]]}},"alternative-id":["4318"],"URL":"https:\/\/doi.org\/10.1007\/s11192-022-04318-5","relation":{},"ISSN":["0138-9130","1588-2861"],"issn-type":[{"value":"0138-9130","type":"print"},{"value":"1588-2861","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,3,30]]},"assertion":[{"value":"8 April 2021","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"17 February 2022","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"30 March 2022","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no conflict of interest.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}]}}