{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,1]],"date-time":"2026-06-01T19:46:54Z","timestamp":1780343214153,"version":"3.54.1"},"reference-count":53,"publisher":"MDPI AG","issue":"5","license":[{"start":{"date-parts":[[2023,5,2]],"date-time":"2023-05-02T00:00:00Z","timestamp":1682985600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001774","name":"University of Sydney","doi-asserted-by":"publisher","award":["DVC-R"],"award-info":[{"award-number":["DVC-R"]}],"id":[{"id":"10.13039\/501100001774","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>Quantifying the dissimilarity of two texts is an important aspect of a number of natural language processing tasks, including semantic information retrieval, topic classification, and document clustering. In this paper, we compared the properties and performance of different dissimilarity measures D using three different representations of texts\u2014vocabularies, word frequency distributions, and vector embeddings\u2014and three simple tasks\u2014clustering texts by author, subject, and time period. Using the Project Gutenberg database, we found that the generalised Jensen\u2013Shannon divergence applied to word frequencies performed strongly across all tasks, that D\u2019s based on vector embedding representations led to stronger performance for smaller texts, and that the optimal choice of approach was ultimately task-dependent. We also investigated, both analytically and numerically, the behaviour of the different D\u2019s when the two texts varied in length by a factor h. We demonstrated that the (natural) estimator of the Jaccard distance between vocabularies was inconsistent and computed explicitly the h-dependency of the bias of the estimator of the generalised Jensen\u2013Shannon divergence applied to word frequencies. We also found numerically that the Jensen\u2013Shannon divergence and embedding-based approaches were robust to changes in h, while the Jaccard distance was not.<\/jats:p>","DOI":"10.3390\/info14050271","type":"journal-article","created":{"date-parts":[[2023,5,3]],"date-time":"2023-05-03T01:36:38Z","timestamp":1683077798000},"page":"271","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":7,"title":["Quantifying the Dissimilarity of Texts"],"prefix":"10.3390","volume":"14","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8977-8198","authenticated-orcid":false,"given":"Benjamin","family":"Shade","sequence":"first","affiliation":[{"name":"School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1932-3710","authenticated-orcid":false,"given":"Eduardo G.","family":"Altmann","sequence":"additional","affiliation":[{"name":"School of Mathematics and Statistics, The University of Sydney, Sydney, NSW 2006, Australia"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1968","published-online":{"date-parts":[[2023,5,2]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Pham, H., Luong, M.T., and Manning, C.D. (2015, January 5). Learning distributed representations for multilingual text sequences. Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, Denver, CO, USA.","DOI":"10.3115\/v1\/W15-1512"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Landauer, T.K., McNamara, D.S., Dennis, S., and Kintsch, W. (2007). Handbook of Latent Semantic Analysis, Psychology Press.","DOI":"10.4324\/9780203936399"},{"key":"ref_3","unstructured":"Jiang, N., and de Marneffe, M.C. (August, January 28). Do you know that Florence is packed with visitors? Evaluating state-of-the-art models of speaker commitment. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Wang, Q., Li, B., Xiao, T., Zhu, J., Li, C., Wong, D.F., and Chao, L.S. (2019). Learning deep transformer models for machine translation. arXiv.","DOI":"10.18653\/v1\/P19-1176"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Taghva, K., and Veni, R. (2010, January 12\u201314). Effects of Similarity Metrics on Document Clustering. Proceedings of the Seventh International Conference on Information Technology: New Generations, Las Vegas, NV, USA.","DOI":"10.1109\/ITNG.2010.65"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"3751","DOI":"10.1007\/s11192-022-04415-5","article-title":"Combining dissimilarity measures for quantifying changes in research fields","volume":"127","author":"Zheng","year":"2022","journal-title":"Scientometrics"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"171545","DOI":"10.1098\/rsos.171545","article-title":"Using text analysis to quantify the similarity and evolution of scientific disciplines","volume":"5","author":"Dias","year":"2018","journal-title":"R. Soc. Open Sci."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"600","DOI":"10.1177\/0165551520975359","article-title":"Influence and performance of user similarity metrics in followee prediction","volume":"48","author":"Tommasel","year":"2022","journal-title":"J. Inf. Sci."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"570","DOI":"10.1016\/j.pisc.2016.06.023","article-title":"Clustering of people in social network based on textual similarity","volume":"8","author":"Singh","year":"2016","journal-title":"Perspect. Sci."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"681","DOI":"10.1016\/j.knosys.2015.09.008","article-title":"Predicting individual retweet behavior by user similarity: A multi-task learning approach","volume":"89","author":"Tang","year":"2015","journal-title":"Know. Based Syst."},{"key":"ref_11","first-page":"13","article-title":"A Survey of Text Similarity Approaches","volume":"68","author":"Gomaa","year":"2013","journal-title":"Int. J. Comput. Appl."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Wang, J., and Dong, Y. (2020). Measurement of Text Similarity: A Survey. Information, 11.","DOI":"10.3390\/info11090421"},{"key":"ref_13","first-page":"19","article-title":"A Survey on Similarity Measures in Text Mining","volume":"3","author":"Vijaymeena","year":"2016","journal-title":"Mach. Learn. Appl."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"4699","DOI":"10.1007\/s00500-020-05479-2","article-title":"Short text similarity measurement methods: A review","volume":"25","author":"Prakoso","year":"2021","journal-title":"Soft Comput."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Magalh\u00e3es, D., Pozo, A., and Santana, R. (2019, January 15\u201318). An empirical comparison of distance\/similarity measures for Natural Language Processing. Proceedings of the National Meeting of Artificial and Computational Intelligence, Salvador, Brazil.","DOI":"10.5753\/eniac.2019.9328"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Boukhatem, N.M., Buscaldi, D., and Liberti, L. (2022, January 5\u20138). Empirical Comparison of Semantic Similarity Measures for Technical Question Answering. Proceedings of the European Conference on Advances in Databases and Information Systems, Turin, Italy.","DOI":"10.1007\/978-3-031-15743-1_16"},{"key":"ref_17","unstructured":"Upadhyay, A., Bhatnagar, A., Bhavsar, N., Singh, M., and Motlicek, P. (2022, January 20\u201322). An Empirical Comparison of Semantic Similarity Methods for Analyzing down-streaming Automatic Minuting task. Proceedings of the Pacific Asia Conference on Language, Information and Computation, Virtual."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Al-Anazi, S., AlMahmoud, H., and Al-Turaiki, I. (2016, January 30). Finding similar documents using different clustering techniques. Proceedings of the Symposium on Data Mining Applications, Riyadh, Saudi Arabia.","DOI":"10.1016\/j.procs.2016.04.005"},{"key":"ref_19","unstructured":"Webb, A.R. (2003). Statistical Pattern Recognition, John Wiley & Sons, Inc.. [2nd ed.]."},{"key":"ref_20","first-page":"993","article-title":"Latent Dirichlet Allocation","volume":"3","author":"Blei","year":"2003","journal-title":"J. Mach. Learn. Res."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"513","DOI":"10.1016\/0306-4573(88)90021-0","article-title":"Term-weighting approaches in automatic text retrieval","volume":"24","author":"Salton","year":"1988","journal-title":"Inf. Process. Manag."},{"key":"ref_22","unstructured":"Herdan, G. (1960). Type-Token Mathematics: A Textbook of Mathematical Linguistics, Mouton & Co.. [1st ed.]."},{"key":"ref_23","first-page":"1137","article-title":"A neural probabilistic language model","volume":"3","author":"Bengio","year":"2003","journal-title":"J. Mach. Learn. Res."},{"key":"ref_24","unstructured":"Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Pennington, J., Socher, R., and Manning, C. (2014, January 25\u201329). GloVe: Global Vectors for Word Representation. Proceedings of the Conference on Empirical Methods in Natural Language Processing, Doha, Qatar.","DOI":"10.3115\/v1\/D14-1162"},{"key":"ref_26","unstructured":"Le, Q., and Mikolov, T. (2014, January 21\u201326). Distributed Representations of Sentences and Documents. Proceedings of the 31st International Conference on Machine Learning, Beijing, China."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Hill, F., Cho, K., and Horhonen, A. (2016, January 12\u201317). Learning Distributed Representations of Sentences from Unlabelled Data. Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA.","DOI":"10.18653\/v1\/N16-1162"},{"key":"ref_28","unstructured":"Wu, L., Yen, I.E., Xu, K., Xu, F., Balakrishnan, A., Chen, P.Y., Ravikumar, P., and Witbrock, M.J. (November, January 31). Word Mover\u2019s Embedding: From Word2Vec to Document Embedding. Proceedings of the Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium."},{"key":"ref_29","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. (2017, January 4\u20139). Attention is All you Need. Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_30","unstructured":"Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019, January 2\u20137). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA."},{"key":"ref_31","unstructured":"Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. (2018). Improving Language Understanding by Generative Pre-Training, OpenAI. Technical Report."},{"key":"ref_32","unstructured":"Shade, B. (2022, February 14). Repository with the Code Used in This Paper; Static Version in Zenodo; Dynamic Version in Github; 2023. Available online: https:\/\/doi.org\/10.5281\/zenodo.7861675; https:\/\/github.com\/benjaminshade\/quantifying-dissimilarity."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"145","DOI":"10.1109\/18.61115","article-title":"Divergence measures based on the Shannon entropy","volume":"37","author":"Lin","year":"1991","journal-title":"IEEE Trans. Inf. Theory"},{"key":"ref_34","unstructured":"Cover, T.M., and Thomas, J.A. (2005). Elements of Information Theory, John Wiley & Sons, Inc.. [1st ed.]."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"1858","DOI":"10.1109\/TIT.2003.813506","article-title":"A new metric for probability distributions","volume":"49","author":"Endres","year":"2003","journal-title":"IEEE Trans. Inf. Theory"},{"key":"ref_36","first-page":"021109","article-title":"Similarity of Symbol Frequency Distributions with Heavy Tails","volume":"6","author":"Gerlach","year":"2016","journal-title":"Phys. Rev. X"},{"key":"ref_37","first-page":"30","article-title":"Quantification Method of Classification Processes. Concept of Structural \u03b1-Entropy","volume":"3","author":"Havrda","year":"1967","journal-title":"Kybernetika"},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"489","DOI":"10.1109\/TIT.1982.1056497","article-title":"On the convexity of some divergence measures based on entropy functions","volume":"28","author":"Burbea","year":"1982","journal-title":"IEEE Trans. Inf. Theory"},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"052311","DOI":"10.1103\/PhysRevA.79.052311","article-title":"Properties of classical and quantum Jensen-Shannon divergence","volume":"79","year":"2009","journal-title":"Phys. Rev. A"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Reimers, N., and Gurevych, I. (2019, January 3\u20137). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Hong Kong, China.","DOI":"10.18653\/v1\/D19-1410"},{"key":"ref_41","unstructured":"Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., and Zhou, M. (2020, January 6\u201312). MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. Proceedings of the 34th Conference on Neural Information Processing Systems, Virtual."},{"key":"ref_42","unstructured":"Beltagy, I., Peters, M.E., and Cohan, A. (2020). Longformer: The Long-Document Transformer. arXiv."},{"key":"ref_43","unstructured":"(2022, February 08). Project Gutenberg. Available online: https:\/\/www.gutenberg.org."},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Gerlach, M., and Font-Clos, F. (2020). A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics. Entropy, 22.","DOI":"10.3390\/e22010126"},{"key":"ref_45","unstructured":"Gerlach, M., and Font-Clos, F. (2022, February 14). Standardized Project Gutenberg Corpus, Github Repository. Available online: https:\/\/github.com\/pgcorpus\/gutenberg."},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"014002","DOI":"10.1088\/1742-5468\/aa53f5","article-title":"Generalized entropies and the similarity of texts","volume":"2017","author":"Altmann","year":"2017","journal-title":"J. Stat. Mech. Theory Exp."},{"key":"ref_47","unstructured":"Gerlach, M. (2015). Universality and Variability in the Statistics of Data with Fat-Tailed Distributions: The Case of Word Frequencies in Natural Languages. [Ph.D. Thesis, Max Planck Institute for the Physics of Complex Systems]."},{"key":"ref_48","first-page":"021006","article-title":"Stochastic Model for the Vocabulary Growth in Natural Languages","volume":"3","author":"Gerlach","year":"2013","journal-title":"Phys. Rev. X"},{"key":"ref_49","unstructured":"Bird, S., Klein, E., and Loper, E. (2009). Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit, O\u2019Reilly Media, Inc.. [1st ed.]."},{"key":"ref_50","unstructured":"Egloff, M., Adamou, A., and Picca, D. (2020, January 2). Enabling Ontology-Based Data Access to Project Gutenberg. Proceedings of the Third Workshop on Humanities in the Semantic Web, Heraklion, Crete, Greece."},{"key":"ref_51","doi-asserted-by":"crossref","first-page":"50","DOI":"10.1214\/aoms\/1177730491","article-title":"On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other","volume":"18","author":"Mann","year":"1947","journal-title":"Ann. Math. Stat."},{"key":"ref_52","doi-asserted-by":"crossref","first-page":"2145","DOI":"10.1256\/003590002320603584","article-title":"Areas beneath the relative operating characteristics (ROC) and relative operating levels (ROL) curves: Statistical significance and interpretation","volume":"128","author":"Mason","year":"2002","journal-title":"Q. J. R. Meteorol. Soc."},{"key":"ref_53","doi-asserted-by":"crossref","first-page":"29","DOI":"10.1148\/radiology.143.1.7063747","article-title":"The meaning and use of the area under a receiver operating characteristic (ROC) curve","volume":"143","author":"Hanley","year":"1982","journal-title":"Radiology"}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/14\/5\/271\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T19:28:17Z","timestamp":1760124497000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/14\/5\/271"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,5,2]]},"references-count":53,"journal-issue":{"issue":"5","published-online":{"date-parts":[[2023,5]]}},"alternative-id":["info14050271"],"URL":"https:\/\/doi.org\/10.3390\/info14050271","relation":{},"ISSN":["2078-2489"],"issn-type":[{"value":"2078-2489","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,5,2]]}}}