{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,12]],"date-time":"2025-10-12T04:35:00Z","timestamp":1760243700188,"version":"build-2065373602"},"reference-count":31,"publisher":"MDPI AG","issue":"11","license":[{"start":{"date-parts":[[2022,11,8]],"date-time":"2022-11-08T00:00:00Z","timestamp":1667865600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Committee of Science of Ministry of Education and Science of the Republic of Kazakhstan","award":["AP08856034","AP09259324","AP09058174"],"award-info":[{"award-number":["AP08856034","AP09259324","AP09058174"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Computation"],"abstract":"<jats:p>The documents similarity metric is a substantial tool applied in areas such as determining topic in relation to documents, plagiarism detection, or problems necessary to capture the semantic, syntactic, or structural similarity of texts. Evaluated results of the similarity measure depend on the types of word represented and the problem statement and can be time-consuming. In this paper, we present a problem-independent algorithm of the similarity metric greedy texts similarity mapping (GTSM), which is computationally efficient to be applied for large datasets with any preferred word vectorization models. GTSM maps words in two texts based on a decision rule that evaluates word similarity and their importance to the texts. We compare it with the well-known word mover\u2019s distance (WMD) algorithm in the k-nearest neighbors text classification problem and find that it leads to similar or better results. In the correlation evaluation task of similarity measures with human-judged scores, we demonstrate its higher correlation scores in comparison with WMD and sentence mover\u2019s similarity (SMS) and show that GTSM is a decent alternative for both word-level and sentence-level tasks.<\/jats:p>","DOI":"10.3390\/computation10110200","type":"journal-article","created":{"date-parts":[[2022,11,8]],"date-time":"2022-11-08T07:00:42Z","timestamp":1667890842000},"page":"200","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Greedy Texts Similarity Mapping"],"prefix":"10.3390","volume":"10","author":[{"given":"Aliya","family":"Jangabylova","sequence":"first","affiliation":[{"name":"Institute of Information and Computational Technologies, Pushkin Str., 125, Almaty 050010, Kazakhstan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2948-374X","authenticated-orcid":false,"given":"Alexander","family":"Krassovitskiy","sequence":"additional","affiliation":[{"name":"Institute of Information and Computational Technologies, Pushkin Str., 125, Almaty 050010, Kazakhstan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7283-5144","authenticated-orcid":false,"given":"Rustam","family":"Mussabayev","sequence":"additional","affiliation":[{"name":"Institute of Information and Computational Technologies, Pushkin Str., 125, Almaty 050010, Kazakhstan"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3853-8896","authenticated-orcid":false,"given":"Irina","family":"Ualiyeva","sequence":"additional","affiliation":[{"name":"Faculty of Information Technology, Al-Farabi Kazakh National University, 71 Al-Farabi Ave., Almaty 050040, Kazakhstan"}]}],"member":"1968","published-online":{"date-parts":[[2022,11,8]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"143","DOI":"10.1007\/s42044-022-00098-6","article-title":"Multi-level text document similarity estimation and its application for plagiarism detection","volume":"5","author":"Veisi","year":"2022","journal-title":"Iran J. Comput. Sci."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Arabi, H., and Akbari, M. (2022). Improving plagiarism detection in text document using hybrid weighted similarity. Expert Syst. Appl., 207.","DOI":"10.1016\/j.eswa.2022.118034"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Rout, R.R., Ghosh, S.K., Jana, P.K., Tripathy, A.K., Sahoo, J.P., and Li, K.C. (2022). Asymmetrically weighted cosine similarity measure for recommendation systems. Proceedings of the Advances in Distributed Computing and Machine Learning, Springer Nature Singapore.","DOI":"10.1007\/978-981-19-1018-0"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"433","DOI":"10.1007\/978-3-031-02156-5","article-title":"Semantic similarity from natural language and ontology analysis","volume":"8","author":"Harispe","year":"2015","journal-title":"Synthesis Lectures on Human Language Technologies"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Wang, J., and Dong, Y. (2020). Measurement of Text Similarity: A Survey. Information, 11.","DOI":"10.3390\/info11090421"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Gupta, D., Khanna, A., Bhattacharyya, S., Hassanien, A.E., Anand, S., and Jaiswal, A. (2023, January 17\u201318). A novel similarity measure for context-based search engine. Proceedings of the International Conference on Innovative Computing and Communications, New Delhi, India.","DOI":"10.1007\/978-981-19-2821-5"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"43","DOI":"10.1007\/s13042-010-0001-0","article-title":"Understanding bag-of-words model: A statistical framework","volume":"1","author":"Zhang","year":"2010","journal-title":"Int. J. Mach. Learn. Cybern."},{"key":"ref_8","unstructured":"Ramos, J. (2003, January 21\u201324). Using tf-idf to determine word relevance in document queries. Proceedings of the First Instructional Conference on Machine Learning, Piscataway, NJ, USA. Available online: https:\/\/www.researchgate.net\/file.PostFileLoader.html?id=587340a5dc332da8fc3aaae3&assetKey=AS%3A448525403201536%401483948197307."},{"key":"ref_9","unstructured":"Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Pennington, J., Socher, R., and Manning, C. (2014, January 26\u201328). Glove: Global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar.","DOI":"10.3115\/v1\/D14-1162"},{"key":"ref_11","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2019, January 2\u20137). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA."},{"key":"ref_12","unstructured":"Kusner, M., Sun, Y., Kolkin, N., and Weinberger, K. (2015, January 6\u201311). From word embedding to document distances. Proceedings of the 32nd International Conference on Machine Learning, Lille, France."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Wei, C., Wang, B., and Kuo, C.C.J. (2022). SynWMD: Syntax-aware Word Mover\u2019s Distance for Sentence Similarity Evaluation. arXiv.","DOI":"10.2139\/ssrn.4145635"},{"key":"ref_14","unstructured":"Clark, E., Celikyilmaz, A., and Smith, N.A. (August, January 28). Sentence mover\u2019s similarity: Automatic evaluation for multi-sentence texts. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"99","DOI":"10.1023\/A:1026543900054","article-title":"The earth mover\u2019s distance as a metric for image retrieval","volume":"40","author":"Rubner","year":"2000","journal-title":"Int. J. Comput. Vis."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"840","DOI":"10.1109\/TPAMI.2007.1058","article-title":"An efficient earth mover\u2019s distance algorithm for robust histogram comparison","volume":"29","author":"Ling","year":"2007","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"301","DOI":"10.1109\/TDSC.2006.50","article-title":"Detecting phishing web pages with visual similarity assessment based on earth mover\u2019s distance (EMD)","volume":"3","author":"Fu","year":"2006","journal-title":"IEEE Trans. Dependable Secur. Comput."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Kilickaya, M., Erdem, A., Ikizler-Cinbis, N., and Erdem, E. (2016). Re-evaluating automatic metrics for image captioning. arXiv.","DOI":"10.18653\/v1\/E17-1019"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"1616","DOI":"10.1002\/asi.20335","article-title":"Co-occurrence matrices and their applications in information science: Extending ACA to the Web environment","volume":"57","author":"Leydesdorff","year":"2006","journal-title":"J. Am. Soc. Inf. Sci. Technol."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"211","DOI":"10.1162\/tacl_a_00134","article-title":"Improving Distributional Similarity with Lessons Learned from Word Embeddings","volume":"3","author":"Levy","year":"2015","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"919","DOI":"10.1016\/j.ipm.2003.10.006","article-title":"Centroid-based summarization of multiple documents","volume":"40","author":"Radev","year":"2004","journal-title":"Inf. Process. Manag."},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"513","DOI":"10.1016\/0306-4573(88)90021-0","article-title":"Term-weighting approaches in automatic text retrieval","volume":"24","author":"Salton","year":"1988","journal-title":"Inf. Process. Manag."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"305","DOI":"10.1007\/s00799-015-0156-0","article-title":"Paper recommender systems: A literature survey","volume":"17","author":"Beel","year":"2016","journal-title":"Int. J. Digit. Libr."},{"key":"ref_24","first-page":"993","article-title":"Latent dirichlet allocation","volume":"3","author":"Blei","year":"2003","journal-title":"J. Mach. Learn. Res."},{"key":"ref_25","first-page":"3735","article-title":"Bayesopt: A bayesian optimization library for nonlinear optimization, experimental design and bandits","volume":"15","year":"2014","journal-title":"J. Mach. Learn. Res."},{"key":"ref_26","unstructured":"Alammar, J. (2022, October 01). The Illustrated Transformer. Available online: http:\/\/jalammar.github.io\/illustrated-bert\/."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Greene, D., and Cunningham, P. (2006, January 25\u201329). Practical solutions to the problem of diagonal dominance in kernel document clustering. Proceedings of the 23rd International Conference on Machine Learning (ICML\u201906), Baltimore, MD, USA.","DOI":"10.1145\/1143844.1143892"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Zhuang, Y., Xie, J., Zheng, Y., and Zhu, X. (November, January 31). Quantifying context overlap for training word embeddings. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium.","DOI":"10.18653\/v1\/D18-1057"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Myers, L., and Sirois, M.J. (2004). Spearman correlation coefficients, differences between. Encycl. Stat. Sci., 12.","DOI":"10.1002\/0471667196.ess5050"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Aggarwal, C.C., Hinneburg, A., and Keim, D.A. (2001, January 4\u20136). On the surprising behavior of distance metrics in high dimensional space. Proceedings of the International Conference on Database Theory, London, UK.","DOI":"10.1007\/3-540-44503-X_27"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Ibrahim, O.A., and Landa-Silva, D. (2014, January 8\u201310). A new weighting scheme and discriminative approach for information retrieval in static and dynamic document collections. Proceedings of the 2014 14th UK Workshop on Computational Intelligence (UKCI), Bradford, UK.","DOI":"10.1109\/UKCI.2014.6930160"}],"container-title":["Computation"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2079-3197\/10\/11\/200\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T01:12:21Z","timestamp":1760145141000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2079-3197\/10\/11\/200"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,11,8]]},"references-count":31,"journal-issue":{"issue":"11","published-online":{"date-parts":[[2022,11]]}},"alternative-id":["computation10110200"],"URL":"https:\/\/doi.org\/10.3390\/computation10110200","relation":{},"ISSN":["2079-3197"],"issn-type":[{"type":"electronic","value":"2079-3197"}],"subject":[],"published":{"date-parts":[[2022,11,8]]}}}