{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,7]],"date-time":"2026-04-07T08:21:55Z","timestamp":1775550115421,"version":"3.50.1"},"reference-count":32,"publisher":"MIT Press","issue":"2","license":[{"start":{"date-parts":[[2021,2,17]],"date-time":"2021-02-17T00:00:00Z","timestamp":1613520000000},"content-version":"vor","delay-in-days":413,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2020,6,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>There are many different relatedness measures, based for instance on citation relations or textual similarity, that can be used to cluster scientific publications. We propose a principled methodology for evaluating the accuracy of clustering solutions obtained using these relatedness measures. We formally show that the proposed methodology has an important consistency property. The empirical analyses that we present are based on publications in the fields of cell biology, condensed matter physics, and economics. Using the BM25 text-based relatedness measure as the evaluation criterion, we find that bibliographic coupling relations yield more accurate clustering solutions than direct citation relations and cocitation relations. The so-called extended direct citation approach performs similarly to or slightly better than bibliographic coupling in terms of the accuracy of the resulting clustering solutions. The other way around, using a citation-based relatedness measure as evaluation criterion, BM25 turns out to yield more accurate clustering solutions than other text-based relatedness measures.<\/jats:p>","DOI":"10.1162\/qss_a_00035","type":"journal-article","created":{"date-parts":[[2020,3,25]],"date-time":"2020-03-25T13:06:31Z","timestamp":1585141591000},"page":"691-713","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":56,"title":["A principled methodology for comparing relatedness measures for clustering publications"],"prefix":"10.1162","volume":"1","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8249-1752","authenticated-orcid":false,"given":"Ludo","family":"Waltman","sequence":"first","affiliation":[{"name":"Centre for Science and Technology Studies, Leiden University, The Netherlands"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7814-8951","authenticated-orcid":false,"given":"Kevin W.","family":"Boyack","sequence":"additional","affiliation":[{"name":"SciTech Strategies, Inc., Albuquerque, NM, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9806-084X","authenticated-orcid":false,"given":"Giovanni","family":"Colavizza","sequence":"additional","affiliation":[{"name":"University of Amsterdam, The Netherlands"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8448-4521","authenticated-orcid":false,"given":"Nees Jan","family":"van Eck","sequence":"additional","affiliation":[{"name":"Centre for Science and Technology Studies, Leiden University, The Netherlands"}]}],"member":"281","published-online":{"date-parts":[[2020,6,1]]},"reference":[{"key":"2025073014023620400_bib1","doi-asserted-by":"crossref","unstructured":"Bae,  S.-H., Halperin,  D., West,  J. D., Rosvall,  M., & Howe,  B. (2017). Scalable and efficient flow-based community detection for large-scale graph analysis. ACM Transactions on Knowledge Discovery from Data, 11(3), 32.","DOI":"10.1145\/2992785"},{"key":"2025073014023620400_bib2","doi-asserted-by":"crossref","unstructured":"Blondel,  V. D., Guillaume,  J.-L., Lambiotte,  R., & Lefebvre,  E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 10, P10008.","DOI":"10.1088\/1742-5468\/2008\/10\/P10008"},{"key":"2025073014023620400_bib3","doi-asserted-by":"crossref","unstructured":"Boyack,  K. W., & Klavans,  R. (2010). Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately?Journal of the American Society for Information Science and Technology, 61(12), 2389\u20132404.","DOI":"10.1002\/asi.21419"},{"key":"2025073014023620400_bib4","doi-asserted-by":"crossref","unstructured":"Boyack,  K. W., & Klavans,  R. (2014). Including cited non-source items in a large-scale map of science: What difference does it make?Journal of Informetrics, 8(3), 569\u2013580.","DOI":"10.1016\/j.joi.2014.04.001"},{"key":"2025073014023620400_bib5","unstructured":"Boyack,  K. W., & Klavans,  R. (2018). Accurately identifying topics using text: Mapping PubMed. In R.Costas, T.Franssen, & A.Yegros-Yegros (Eds.), Proceedings of the 23rd International Conference on Science and Technology Indicators, pp. 107\u2013115. Leiden, the Netherlands."},{"key":"2025073014023620400_bib6","doi-asserted-by":"crossref","unstructured":"Boyack,  K. W., Newman,  D., Duhon,  R. J., Klavans,  R., Patek,  M., Biberstine,  J. R., \u2026 & B\u00f6rner,  K. (2011). Clustering more than two million biomedical publications: Comparing the accuracies of nine text-based similarity approaches. PLOS ONE, 6(3), e18029.","DOI":"10.1371\/journal.pone.0018029"},{"key":"2025073014023620400_bib7","doi-asserted-by":"crossref","unstructured":"Boyack,  K. W., Small,  H., & Klavans,  R. (2013). Improving the accuracy of co-citation clustering using full text. Journal of the American Society for Information Science and Technology, 64(9), 1759\u20131767.","DOI":"10.1002\/asi.22896"},{"key":"2025073014023620400_bib8","doi-asserted-by":"crossref","unstructured":"Fortunato,  S.\n           (2010). Community detection in graphs. Physics Reports, 486(3\u20135), 75\u2013174.","DOI":"10.1016\/j.physrep.2009.11.002"},{"key":"2025073014023620400_bib9","doi-asserted-by":"crossref","unstructured":"Fortunato,  S., & Barth\u00e9lemy,  M. (2007). Resolution limit in community detection. Proceedings of the National Academy of Sciences of the United States of America, 104(1), 36\u201341.","DOI":"10.1073\/pnas.0605965104"},{"key":"2025073014023620400_bib10","doi-asserted-by":"crossref","unstructured":"Gl\u00e4ser,  J., Scharnhorst,  A., & Gl\u00e4nzel,  W. (2017). Same data\u2014different results? Towards a comparative approach to the identification of thematic structures in science. Scientometrics, 111(2), 981\u2013998.","DOI":"10.1007\/s11192-017-2296-z"},{"key":"2025073014023620400_bib11","doi-asserted-by":"crossref","unstructured":"Haunschild,  R., Schier,  H., Marx,  W., & Bornmann,  L. (2018). Algorithmically generated subject categories based on citation relations: An empirical micro study using papers on overall water splitting. Journal of Informetrics, 12(2), 436\u2013447.","DOI":"10.1016\/j.joi.2018.03.004"},{"key":"2025073014023620400_bib12","doi-asserted-by":"crossref","unstructured":"Klavans,  R., & Boyack,  K. W. (2017). Which type of citation analysis generates the most accurate taxonomy of scientific and technical knowledge?Journal of the Association for Information Science and Technology, 68(4), 984\u2013998.","DOI":"10.1002\/asi.23734"},{"key":"2025073014023620400_bib13","doi-asserted-by":"crossref","unstructured":"Li,  Y., & Ruiz-Castillo,  J. (2013). The comparison of normalization procedures based on different classification systems. Journal of Informetrics, 7(4), 945\u2013958.","DOI":"10.1016\/j.joi.2013.09.005"},{"key":"2025073014023620400_bib14","doi-asserted-by":"crossref","unstructured":"Newman,  M. E. J.\n           (2004). Fast algorithm for detecting community structure in networks. Physical Review E, 69(6), 066133.","DOI":"10.1103\/PhysRevE.69.066133"},{"key":"2025073014023620400_bib15","doi-asserted-by":"crossref","unstructured":"Newman,  M. E. J., & Girvan,  M. (2004). Finding and evaluating community structure in networks. Physical Review E, 69(2), 026113.","DOI":"10.1103\/PhysRevE.69.026113"},{"key":"2025073014023620400_bib16","doi-asserted-by":"crossref","unstructured":"Ozaki,  N., Tezuka,  H., & Inaba,  M. (2016). A simple acceleration method for the Louvain algorithm. International Journal of Computer and Electrical Engineering, 8(3), 207\u2013218.","DOI":"10.17706\/IJCEE.2016.8.3.207-218"},{"key":"2025073014023620400_bib17","doi-asserted-by":"crossref","unstructured":"Perianes-Rodriguez,  A., & Ruiz-Castillo,  J. (2017). A comparison of the Web of Science and publication-level classification systems of science. Journal of Informetrics, 11(1), 32\u201345.","DOI":"10.1016\/j.joi.2016.10.007"},{"key":"2025073014023620400_bib18","doi-asserted-by":"crossref","unstructured":"Perianes-Rodriguez,  A., Waltman,  L., & Van Eck,  N. J. (2016). Constructing bibliometric networks: A comparison between full and fractional counting. Journal of Informetrics, 10(4), 1178\u20131195.","DOI":"10.1016\/j.joi.2016.10.006"},{"key":"2025073014023620400_bib19","doi-asserted-by":"crossref","unstructured":"Persson,  O.\n           (2010). Identifying research themes with weighted direct citation links. Journal of Informetrics, 4(3), 415\u2013422.","DOI":"10.1016\/j.joi.2010.03.006"},{"key":"2025073014023620400_bib20","doi-asserted-by":"crossref","unstructured":"Ruiz-Castillo,  J., & Waltman,  L. (2015). Field-normalized citation impact indicators using algorithmically constructed classification systems of science. Journal of Informetrics, 9(1), 102\u2013117.","DOI":"10.1016\/j.joi.2014.11.010"},{"key":"2025073014023620400_bib21","doi-asserted-by":"crossref","unstructured":"Sj\u00f6g\u00e5rde,  P., & Ahlgren,  P. (2018). Granularity of algorithmically constructed publication-level classifications of research publications: Identification of topics. Journal of Informetrics, 12(1), 133\u2013152.","DOI":"10.1016\/j.joi.2017.12.006"},{"key":"2025073014023620400_bib22","doi-asserted-by":"crossref","unstructured":"Sj\u00f6g\u00e5rde,  P., & Ahlgren,  P. (2020). Granularity of algorithmically constructed publication-level classifications of research publications: Identification of specialties. Quantitative Science Studies, 1(1), 207\u2013238.","DOI":"10.1162\/qss_a_00004"},{"key":"2025073014023620400_bib23","doi-asserted-by":"crossref","unstructured":"Small,  H.\n           (1997). Update on science mapping: Creating large document spaces. Scientometrics, 38(2), 275\u2013293.","DOI":"10.1007\/BF02457414"},{"key":"2025073014023620400_bib24","doi-asserted-by":"crossref","unstructured":"Small,  H., Boyack,  K. W., Klavans,  R. (2014). Identifying emerging topics in science and technology. Research Policy, 43(8), 1450\u20131467.","DOI":"10.1016\/j.respol.2014.02.005"},{"key":"2025073014023620400_bib25","doi-asserted-by":"crossref","unstructured":"Sparck Jones,  K., Walker,  S., & Robertson,  S. E. (2000a). A probabilistic model of information retrieval: Development and comparative experiments: Part 1. Information Processing and Management, 36(6), 779\u2013808.","DOI":"10.1016\/S0306-4573(00)00015-7"},{"key":"2025073014023620400_bib26","doi-asserted-by":"crossref","unstructured":"Sparck Jones,  K., Walker,  S., & Robertson,  S. E. (2000b). A probabilistic model of information retrieval: Development and comparative experiments: Part 2. Information Processing and Management, 36(6), 809\u2013840.","DOI":"10.1016\/S0306-4573(00)00016-9"},{"key":"2025073014023620400_bib27","doi-asserted-by":"crossref","unstructured":"\u0160ubelj,  L., Van Eck,  N. J., & Waltman,  L. (2016). Clustering scientific publications based on citation relations: A systematic comparison of different methods. PLOS ONE, 11(4), e0154404.","DOI":"10.1371\/journal.pone.0154404"},{"key":"2025073014023620400_bib28","doi-asserted-by":"crossref","unstructured":"Traag,  V. A., Van Dooren,  P., & Nesterov,  Y. (2011). Narrow scope for resolution-limit-free community detection. Physical Review E, 84(1), 016114.","DOI":"10.1103\/PhysRevE.84.016114"},{"key":"2025073014023620400_bib29","doi-asserted-by":"crossref","unstructured":"Traag,  V. A., Waltman,  L., & Van Eck,  N. J. (2019). From Louvain to Leiden: Guaranteeing well-connected communities. Scientific Reports, 9, 5233.","DOI":"10.1038\/s41598-019-41695-z"},{"key":"2025073014023620400_bib30","doi-asserted-by":"crossref","unstructured":"Van Eck,  N. J., & Waltman,  L. (2014). CitNetExplorer: A new software tool for analyzing and visualizing citation networks. Journal of Informetrics, 8(4), 802\u2013823.","DOI":"10.1016\/j.joi.2014.07.006"},{"key":"2025073014023620400_bib31","doi-asserted-by":"crossref","unstructured":"Waltman,  L., & Van Eck,  N. J. (2012). A new methodology for constructing a publication-level classification system of science. Journal of the American Society for Information Science and Technology, 63(12), 2378\u20132392.","DOI":"10.1002\/asi.22748"},{"key":"2025073014023620400_bib32","doi-asserted-by":"crossref","unstructured":"Waltman,  L., & Van Eck,  N. J. (2013). A smart local moving algorithm for large-scale modularity-based community detection. European Physical Journal B, 86(11), 471.","DOI":"10.1140\/epjb\/e2013-40829-0"}],"container-title":["Quantitative Science Studies"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/qss\/article-pdf\/1\/2\/691\/1885783\/qss_a_00035.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/qss\/article-pdf\/1\/2\/691\/1885783\/qss_a_00035.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,30]],"date-time":"2025-07-30T18:03:13Z","timestamp":1753898593000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/qss\/article\/1\/2\/691\/96143\/A-principled-methodology-for-comparing-relatedness"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020]]},"references-count":32,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2020,6,1]]}},"URL":"https:\/\/doi.org\/10.1162\/qss_a_00035","relation":{},"ISSN":["2641-3337"],"issn-type":[{"value":"2641-3337","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2020]]},"published":{"date-parts":[[2020]]}}}