{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,3]],"date-time":"2026-07-03T16:27:07Z","timestamp":1783096027405,"version":"3.54.6"},"reference-count":63,"publisher":"MDPI AG","issue":"9","license":[{"start":{"date-parts":[[2020,8,31]],"date-time":"2020-08-31T00:00:00Z","timestamp":1598832000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Information"],"abstract":"<jats:p>Text similarity measurement is the basis of natural language processing tasks, which play an important role in information retrieval, automatic question answering, machine translation, dialogue systems, and document matching. This paper systematically combs the research status of similarity measurement, analyzes the advantages and disadvantages of current methods, develops a more comprehensive classification description system of text similarity measurement algorithms, and summarizes the future development direction. With the aim of providing reference for related research and application, the text similarity measurement method is described by two aspects: text distance and text representation. The text distance can be divided into length distance, distribution distance, and semantic distance; text representation is divided into string-based, corpus-based, single-semantic text, multi-semantic text, and graph-structure-based representation. Finally, the development of text similarity is also summarized in the discussion section.<\/jats:p>","DOI":"10.3390\/info11090421","type":"journal-article","created":{"date-parts":[[2020,8,31]],"date-time":"2020-08-31T08:11:19Z","timestamp":1598861479000},"page":"421","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":225,"title":["Measurement of Text Similarity: A Survey"],"prefix":"10.3390","volume":"11","author":[{"given":"Jiapeng","family":"Wang","sequence":"first","affiliation":[{"name":"Computer Engineering Department, Faculty of Electrical Engineering and Computer Science, Ningbo University, Ningbo 315211, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Yihong","family":"Dong","sequence":"additional","affiliation":[{"name":"Computer Engineering Department, Faculty of Electrical Engineering and Computer Science, Ningbo University, Ningbo 315211, China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"1968","published-online":{"date-parts":[[2020,8,31]]},"reference":[{"key":"ref_1","unstructured":"Lin, D. (1998, January 24\u201327). An information-theoretic definition of similarity. Proceedings of the International Conference on Machine Learning, Madison, WI, USA."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"343","DOI":"10.1561\/1500000035","article-title":"Semantic matching in search","volume":"7","author":"Li","year":"2014","journal-title":"Found. Trends Inf. Retr."},{"key":"ref_3","unstructured":"Jiang, N., and de Marneffe, M.C. (August, January 28). Do you know that Florence is packed with visitors? Evaluating state-of-the-art models of speaker commitment. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Wang, Q., Li, B., Xiao, T., Zhu, J., Li, C., Wong, D.F., and Chao, L.S. (2019). Learning deep transformer models for machine translation. arXiv.","DOI":"10.18653\/v1\/P19-1176"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Serban, I.V., Sordoni, A., Bengio, Y., Courville, A., and Pineau, J. (2016, January 12\u201317). Building end-to-end dialogue systems using generative hierarchical neural network models. Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA.","DOI":"10.1609\/aaai.v30i1.9883"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Pham, H., Luong, M.T., and Manning, C.D. (2015, January 5). Learning distributed representations for multilingual text sequences. Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, Denver, CO, USA.","DOI":"10.3115\/v1\/W15-1512"},{"key":"ref_7","first-page":"13","article-title":"A survey of text similarity approaches","volume":"68","author":"Gomaa","year":"2013","journal-title":"Int. J. Comput. Appl."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Deza, M.M., and Deza, E. (2009). Encyclopedia of distances. Encyclopedia of Distances, Springer.","DOI":"10.1007\/978-3-642-00234-2"},{"key":"ref_9","unstructured":"Norouzi, M., Fleet, D.J., and Salakhutdinov, R.R. (2012, January 3\u20136). Hamming distance metric learning. Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA."},{"key":"ref_10","unstructured":"Manning, C.D., Manning, C.D., and Sch\u00fctze, H. (1999). Foundations of Statistical Natural Language Processing, MIT Press."},{"key":"ref_11","unstructured":"Nielsen, F. (2010). A family of statistical symmetric divergences based on Jensen\u2019s inequality. arXiv."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"79","DOI":"10.1214\/aoms\/1177729694","article-title":"On information and sufficiency","volume":"22","author":"Kullback","year":"1951","journal-title":"Ann. Math. Stat."},{"key":"ref_13","unstructured":"Weng, L. (2019). From GAN to WGAN. arXiv."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"784","DOI":"10.1137\/1118101","article-title":"Calculation of the Wasserstein distance between probability distributions on the line","volume":"18","author":"Vallender","year":"1974","journal-title":"Theory Probab. Appl."},{"key":"ref_15","unstructured":"Kusner, M., Sun, Y., Kolkin, N., and Weinberger, K. (2015, January 6\u201311). From word embeddings to document distances. Proceedings of the International Conference on Machine Learning, Lille, France."},{"key":"ref_16","unstructured":"Andoni, A., Indyk, P., and Krauthgamer, R. (2008, January 20\u201322). Earth mover distance over high-dimensional spaces. Proceedings of the Symposium on Discrete Algorithms, San Francisco, CA, USA."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Wu, L., Yen, I.E., Xu, K., Xu, F., Balakrishnan, A., Chen, P.Y., Ravikumar, P., and Witbrock, M.J. (2018). Word mover\u2019s embedding: From word2vec to document embedding. arXiv.","DOI":"10.18653\/v1\/D18-1482"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/S0169-7439(99)00047-7","article-title":"The mahalanobis distance","volume":"50","author":"Massart","year":"2000","journal-title":"Chemom. Intell. Lab. Syst."},{"key":"ref_19","unstructured":"Huang, G., Guo, C., Kusner, M.J., Sun, Y., Sha, F., and Weinberger, K.Q. (2016, January 5\u201310). Supervised word mover\u2019s distance. Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"350","DOI":"10.1145\/359581.359603","article-title":"A fast algorithm for computing longest common subsequences","volume":"20","author":"Hunt","year":"1977","journal-title":"Commun. ACM"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"173","DOI":"10.1016\/j.ipl.2003.07.001","article-title":"The constrained longest common subsequence problem","volume":"88","author":"Tsai","year":"2003","journal-title":"Inf. Process. Lett."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"13","DOI":"10.1016\/j.ipl.2007.09.008","article-title":"New efficient algorithms for the LCS and constrained LCS problems","volume":"106","author":"Iliopoulos","year":"2008","journal-title":"Inf. Process. Lett."},{"key":"ref_23","unstructured":"Irving, R.W., and Fraser, C.B. (May, January 29). Two algorithms for the longest common subsequence of three (or more) strings. Proceedings of the Annual Symposium on Combinatorial Pattern Matching, Tucson, AZ, USA."},{"key":"ref_24","first-page":"707","article-title":"Binary codes capable of correcting deletions, insertions, and reversals","volume":"10","author":"Levenshtein","year":"1966","journal-title":"Sov. Phys. Dokl."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"171","DOI":"10.1145\/363958.363994","article-title":"A technique for computer detection and correction of spelling errors","volume":"7","author":"Damerau","year":"1964","journal-title":"Commun. ACM"},{"key":"ref_26","unstructured":"Winkler, W.E. (2020, August 31). String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage, Available online: https:\/\/files.eric.ed.gov\/fulltext\/ED325505.pdf."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"297","DOI":"10.2307\/1932409","article-title":"Measures of the amount of ecologic association between species","volume":"26","author":"Dice","year":"1945","journal-title":"Ecology"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"37","DOI":"10.1111\/j.1469-8137.1912.tb05611.x","article-title":"The distribution of the flora in the alpine zone. 1","volume":"11","author":"Jaccard","year":"1912","journal-title":"New Phytol."},{"key":"ref_29","unstructured":"Wang, S., and Manning, C.D. (2012, January 8\u201314). Baselines and bigrams: Simple, good sentiment and topic classification. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers, Jeju Island, Korea."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"513","DOI":"10.1016\/0306-4573(88)90021-0","article-title":"Term-weighting approaches in automatic text retrieval","volume":"24","author":"Salton","year":"1988","journal-title":"Inf. Process. Manag."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Robertson, S.E., and Walker, S. (1994, January 3\u20136). Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. Proceedings of the International ACM Sigir Conference on Research and Development in Information Retrieval SIGIR\u201994, Dublin, Ireland.","DOI":"10.1007\/978-1-4471-2099-5_24"},{"key":"ref_32","unstructured":"Rong, X. (2014). word2vec parameter learning explained. arXiv."},{"key":"ref_33","unstructured":"Le, Q., and Mikolov, T. (2014, January 22\u201324). Distributed representations of sentences and documents. Proceedings of the International Conference on Machine Learning, Bejing, China."},{"key":"ref_34","unstructured":"Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Pennington, J., Socher, R., and Manning, C.D. (2014, January 25\u201329). Glove: Global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar.","DOI":"10.3115\/v1\/D14-1162"},{"key":"ref_36","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"391","DOI":"10.1002\/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9","article-title":"Indexing by latent semantic analysis","volume":"41","author":"Deerwester","year":"1990","journal-title":"J. Am. Soc. Inf. Sci."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"56","DOI":"10.1016\/j.ipm.2004.11.007","article-title":"A framework for understanding Latent Semantic Indexing (LSI) performance","volume":"42","author":"Kontostathis","year":"2006","journal-title":"Inf. Process. Manag."},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"211","DOI":"10.1037\/0033-295X.104.2.211","article-title":"A solution to Plato\u2019s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge","volume":"104","author":"Landauer","year":"1997","journal-title":"Psychol. Rev."},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"259","DOI":"10.1080\/01638539809545028","article-title":"An introduction to latent semantic analysis","volume":"25","author":"Landauer","year":"1998","journal-title":"Discourse Process."},{"key":"ref_41","unstructured":"Grossman, D.A., and Frieder, O. (2012). Information Retrieval: Algorithms and Heuristics, Springer Science & Business Media."},{"key":"ref_42","unstructured":"Hofmann, T. (2013). Probabilistic latent semantic analysis. arXiv."},{"key":"ref_43","first-page":"993","article-title":"Latent dirichlet allocation","volume":"3","author":"Blei","year":"2003","journal-title":"J. Mach. Learn. Res."},{"key":"ref_44","unstructured":"Wei, X., and Croft, W.B. (2016, January 6\u201311). LDA-based document models for ad-hoc retrieval. Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, WA, USA."},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Sahami, M., and Heilman, T.D. (2006, January 23\u201326). A web-based kernel function for measuring the similarity of short text snippets. Proceedings of the 15th International Conference on World Wide Web, Edinburgh, Scotland, UK.","DOI":"10.1145\/1135777.1135834"},{"key":"ref_46","unstructured":"Li, Q., Wang, B., and Melucci, M. (2019). CNM: An Interpretable Complex-valued Network for Matching. arXiv."},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Shen, Y., He, X., Gao, J., Deng, L., and Mesnil, G. (2014, January 3\u20137). A latent semantic model with convolutional-pooling structure for information retrieval. Proceedings of the 23rd ACM International Conference on Information and Knowledge Management, Shanghai, China.","DOI":"10.1145\/2661829.2661935"},{"key":"ref_48","unstructured":"Huang, P.S., He, X., Gao, J., Deng, L., Acero, A., and Heck, L. (November, January 27). Learning deep structured semantic models for web search using clickthrough data. Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, Burlingame, CA, USA."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Sak, H., Senior, A., and Beaufays, F. (2014). Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv.","DOI":"10.21437\/Interspeech.2014-80"},{"key":"ref_50","unstructured":"Hu, B., Lu, Z., Li, H., and Chen, Q. (2014, January 8\u201313). Convolutional neural network architectures for matching natural language sentences. Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada."},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Wan, S., Lan, Y., Guo, J., Xu, J., Pang, L., and Cheng, X. (2016, January 12\u201317). A deep architecture for semantic matching with multiple positional sentence representations. Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA.","DOI":"10.1609\/aaai.v30i1.10342"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Pang, L., Lan, Y., Guo, J., Xu, J., Wan, S., and Cheng, X. (2016, January 12\u201317). Text matching as image recognition. Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA.","DOI":"10.1609\/aaai.v30i1.10341"},{"key":"ref_53","doi-asserted-by":"crossref","unstructured":"Liu, Z., Xiong, C., Sun, M., and Liu, Z. (2018). Entity-duet neural ranking: Understanding the role of knowledge graph semantics in neural information retrieval. arXiv.","DOI":"10.18653\/v1\/P18-1223"},{"key":"ref_54","doi-asserted-by":"crossref","first-page":"112948","DOI":"10.1016\/j.eswa.2019.112948","article-title":"A review: Knowledge reasoning over knowledge graph","volume":"141","author":"Chen","year":"2020","journal-title":"Expert Syst. Appl."},{"key":"ref_55","doi-asserted-by":"crossref","first-page":"72","DOI":"10.1109\/TKDE.2016.2610428","article-title":"Computing semantic similarity of concepts in knowledge graphs","volume":"29","author":"Zhu","year":"2016","journal-title":"IEEE Trans. Knowl. Data Eng."},{"key":"ref_56","unstructured":"Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., and Yakhnenko, O. (2013, January 5\u20138). Translating embeddings for modeling multi-relational data. Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA."},{"key":"ref_57","doi-asserted-by":"crossref","unstructured":"Dong, L., Wei, F., Zhou, M., and Xu, K. (2015, January 26\u201331). Question answering over freebase with multi-column convolutional neural networks. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China.","DOI":"10.3115\/v1\/P15-1026"},{"key":"ref_58","unstructured":"Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., and Dahl, G.E. (2017, January 6\u201311). Neural message passing for quantum chemistry. Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia."},{"key":"ref_59","unstructured":"Zhou, J., Cui, G., Zhang, Z., Yang, C., Liu, Z., Wang, L., Li, C., and Sun, M. (2018). Graph neural networks: A review of methods and applications. arXiv."},{"key":"ref_60","doi-asserted-by":"crossref","unstructured":"Vashishth, S., Yadati, N., and Talukdar, P. (2020, January 5\u20137). Graph-based Deep Learning in Natural Language Processing. Proceedings of the 7th ACM IKDD CoDS and 25th COMAD, Hyderabad, India.","DOI":"10.1145\/3371158.3371232"},{"key":"ref_61","doi-asserted-by":"crossref","unstructured":"Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., and Philip, S.Y. (2020). A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst.","DOI":"10.1109\/TNNLS.2020.2978386"},{"key":"ref_62","doi-asserted-by":"crossref","unstructured":"Sultan, M.A., Bethard, S., and Sumner, T. (2015, January 4\u20135). Dls@ cu: Sentence similarity from word alignment and semantic vector composition. Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, CO, USA.","DOI":"10.18653\/v1\/S15-2027"},{"key":"ref_63","doi-asserted-by":"crossref","unstructured":"Liu, B., Guo, W., Niu, D., Wang, C., Xu, S., Lin, J., Lai, K., and Xu, Y. (2019, January 4\u20138). A User-Centered Concept Mining System for Query and Document Understanding at Tencent. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA.","DOI":"10.1145\/3292500.3330727"}],"container-title":["Information"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2078-2489\/11\/9\/421\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T10:05:01Z","timestamp":1760177101000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2078-2489\/11\/9\/421"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,8,31]]},"references-count":63,"journal-issue":{"issue":"9","published-online":{"date-parts":[[2020,9]]}},"alternative-id":["info11090421"],"URL":"https:\/\/doi.org\/10.3390\/info11090421","relation":{},"ISSN":["2078-2489"],"issn-type":[{"value":"2078-2489","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,8,31]]}}}