{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,7]],"date-time":"2026-01-07T23:49:08Z","timestamp":1767829748735,"version":"3.49.0"},"reference-count":52,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2025,3,22]],"date-time":"2025-03-22T00:00:00Z","timestamp":1742601600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000001","name":"USA National Science Foundation","doi-asserted-by":"publisher","award":["2148878"],"award-info":[{"award-number":["2148878"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Future Internet"],"abstract":"<jats:p>With the availability of a virtually infinite number of text documents in digital format, automatic comparison of textual data is essential for extracting meaningful insights that are difficult to identify manually. Many existing tools, including AI and large language models, struggle to provide precise and explainable insights into textual similarities. In many cases, they determine the similarity between documents as reflected by the text, rather than the similarities between the subjects being discussed in these documents. This study addresses these limitations by developing an n-gram analysis framework designed to compare documents automatically and uncover explainable similarities. A scoring formula is applied to assigns each of the n-grams with a weight, where the weight is higher when the n-grams are more frequent in both documents, but is penalized when the n-grams are more frequent in the English language. Visualization tools like word clouds enhance the representation of these patterns, providing clearer insights. The findings demonstrate that this framework effectively uncovers similarities between text documents, offering explainable insights that are often difficult to identify manually. This non-parametric approach provides a deterministic solution for identifying similarities across various fields, including biographies, scientific literature, historical texts, and more. Code for the method is publicly available.<\/jats:p>","DOI":"10.3390\/fi17040135","type":"journal-article","created":{"date-parts":[[2025,3,24]],"date-time":"2025-03-24T04:48:18Z","timestamp":1742791698000},"page":"135","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Explainable Identification of Similarities Between Entities for Discovery in Large Text"],"prefix":"10.3390","volume":"17","author":[{"given":"Akhil","family":"Joshi","sequence":"first","affiliation":[{"name":"Department of Computer Science, Kansas State University, Manhattan, KS 66502, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-0375-9813","authenticated-orcid":false,"given":"Sai Teja","family":"Erukude","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Kansas State University, Manhattan, KS 66502, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6207-1491","authenticated-orcid":false,"given":"Lior","family":"Shamir","sequence":"additional","affiliation":[{"name":"Department of Computer Science, Kansas State University, Manhattan, KS 66502, USA"}]}],"member":"1968","published-online":{"date-parts":[[2025,3,22]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L., and Brown, D. (2019). Text classification algorithms: A survey. Information, 10.","DOI":"10.3390\/info10040150"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Gasparetto, A., Marcuzzo, M., Zangari, A., and Albarelli, A. (2022). A survey on text classification algorithms: From text to predictions. Information, 13.","DOI":"10.3390\/info13020083"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3439726","article-title":"Deep learning\u2013based text classification: A comprehensive review","volume":"54","author":"Minaee","year":"2021","journal-title":"ACM Comput. Surv. (CSUR)"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"327","DOI":"10.1093\/llc\/fqn015","article-title":"An evaluation of text classification methods for literary study","volume":"23","author":"Yu","year":"2008","journal-title":"Lit. Linguist. Comput."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"91","DOI":"10.1016\/S0065-2458(08)60607-5","article-title":"The present status of automatic translation of languages","volume":"1","year":"1960","journal-title":"Adv. Comput."},{"key":"ref_6","first-page":"194","article-title":"Efficacy of English to Spanish automatic translation","volume":"2","author":"Ablanedo","year":"2007","journal-title":"Int. J. Inf. Oper. Manag. Educ."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Campr, M., and Je\u017eek, K. (2015, January 14\u201317). Comparing semantic models for evaluating automatic document summarization. Proceedings of the 18th International Conference on Text, Speech, and Dialogue, Pilsen, Czech Republic.","DOI":"10.1007\/978-3-319-24033-6_29"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"545","DOI":"10.1111\/j.1467-8640.2012.00417.x","article-title":"Multi-document summarization of evaluative text","volume":"29","author":"Carenini","year":"2013","journal-title":"Comput. Intell."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3529754","article-title":"Multi-document summarization via deep learning techniques: A survey","volume":"55","author":"Ma","year":"2022","journal-title":"ACM Comput. Surv."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Wu, Y. (2024). Large Language Model and Text Generation. Natural Language Processing in Biomedicine: A Practical Guide, Springer.","DOI":"10.1007\/978-3-031-55865-8_10"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"94","DOI":"10.1109\/13.28038","article-title":"Computer algorithms for plagiarism detection","volume":"32","author":"Parker","year":"1989","journal-title":"IEEE Trans. Educ."},{"key":"ref_12","first-page":"1","article-title":"Academic plagiarism detection: A systematic literature review","volume":"52","author":"Meuschke","year":"2019","journal-title":"ACM Comput. Surv. (CSUR)"},{"key":"ref_13","first-page":"16","article-title":"A review on plagiarism detection tools","volume":"125","author":"Naik","year":"2015","journal-title":"Int. J. Comput. Appl."},{"key":"ref_14","first-page":"7110","article-title":"Online plagiarism detection tools in the digital age: A review","volume":"25","author":"Chandere","year":"2021","journal-title":"Ann. Rom. Soc. Cell Biol."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Kulkarni, S., Govilkar, S., and Amin, D. (2021, January 27\u201329). Analysis of plagiarism detection tools and methods. Proceedings of the 4th International Conference on Advances in Science & Technology (ICAST2021), Bahir Dar, Ethiopia.","DOI":"10.2139\/ssrn.3869091"},{"key":"ref_16","first-page":"49","article-title":"Document\u2013document similarity approaches and science mapping: Experimental comparison of five approaches","volume":"3","author":"Ahlgren","year":"2009","journal-title":"J. Inf."},{"key":"ref_17","unstructured":"Milios, E., Zhang, Y., He, B., and Dong, L. (2003, January 7\u201312). Automatic term extraction and document similarity in special text corpora. Proceedings of the Sixth Conference of the Pacific Association for Computational Linguistics, Sapporo, Japan."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"103069","DOI":"10.1016\/j.ipm.2022.103069","article-title":"Legal case document similarity: You need both network and text","volume":"59","author":"Bhattacharya","year":"2022","journal-title":"Inf. Process. Manag."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"102798","DOI":"10.1016\/j.ipm.2021.102798","article-title":"A comparative study of automated legal text classification using random forests and deep learning","volume":"59","author":"Chen","year":"2022","journal-title":"Inf. Process. Manag."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"765","DOI":"10.1111\/1911-3846.12825","article-title":"Textual analysis in accounting: What\u2019s next?","volume":"40","author":"Bochkay","year":"2023","journal-title":"Contemp. Account. Res."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Bezdan, T., Stoean, C., Naamany, A.A., Bacanin, N., Rashid, T.A., Zivkovic, M., and Venkatachalam, K. (2021). Hybrid fruit-fly optimization algorithm with k-means for text document clustering. Mathematics, 9.","DOI":"10.3390\/math9161929"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"187","DOI":"10.1093\/llc\/fqaa007","article-title":"UDAT: Compound quantitative analysis of text using machine learning","volume":"36","author":"Shamir","year":"2021","journal-title":"Digit. Scholarsh. Humanit."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"10829","DOI":"10.46298\/jdmdh.10829","article-title":"A data science and machine learning approach to continuous analysis of Shakespeare\u2019s plays","volume":"2023","author":"Swisher","year":"2023","journal-title":"J. Data Min. Digit. Humanit."},{"key":"ref_24","first-page":"63","article-title":"The performance of text similarity algorithms","volume":"4","author":"Prasetya","year":"2018","journal-title":"Int. J. Adv. Intell. Inform."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Wang, J., Li, G., and Fe, J. (2011, January 11\u201316). Fast-join: An efficient method for fuzzy token matching based string similarity join. Proceedings of the IEEE 27th International Conference on Data Engineering, Hannover, Germany.","DOI":"10.1109\/ICDE.2011.5767865"},{"key":"ref_26","first-page":"8","article-title":"Binary codes capable of correcting spurious insertions and deletions of ones","volume":"1","author":"Levenshtein","year":"1965","journal-title":"Probl. Inf. Transm."},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"171","DOI":"10.1145\/363958.363994","article-title":"A technique for computer detection and correction of spelling errors","volume":"7","author":"Damerau","year":"1964","journal-title":"Commun. ACM"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"W20","DOI":"10.1093\/nar\/gkh435","article-title":"BLAST: At the core of a powerful and diverse set of sequence analysis tools","volume":"32","author":"McGinnis","year":"2004","journal-title":"Nucleic Acids Res."},{"key":"ref_29","first-page":"1","article-title":"A review of semantic similarity measures in wordnet","volume":"6","author":"Meng","year":"2013","journal-title":"Int. J. Hybrid Inf. Technol."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"17","DOI":"10.1109\/21.24528","article-title":"Development and application of a metric on semantic nets","volume":"19","author":"Rada","year":"1989","journal-title":"IEEE Trans. Syst. Man Cybern."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"193","DOI":"10.1037\/1082-989X.11.2.193","article-title":"Assessing heterogeneity in meta-analysis: Q statistic or I2 index?","volume":"11","author":"Botella","year":"2006","journal-title":"Psychol. Methods"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"327","DOI":"10.1037\/0033-295X.84.4.327","article-title":"Features of similarity","volume":"84","author":"Tversky","year":"1977","journal-title":"Psychol. Rev."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"211","DOI":"10.1037\/0033-295X.104.2.211","article-title":"A solution to Plato\u2019s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge","volume":"104","author":"Landauer","year":"1997","journal-title":"Psychol. Rev."},{"key":"ref_34","unstructured":"Gabrilovich, E., and Markovitch, S. (2007, January 6\u201312). Computing semantic relatedness using Wikipedia-based explicit semantic analysis. Proceedings of the International Joint Conference on Artificial Intelligence, Hyderabad, India."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"24","DOI":"10.5463\/dcid.v24i4.274","article-title":"Empowerment in community-based rehabilitation and disability-inclusive development","volume":"24","author":"Kuipers","year":"2013","journal-title":"Disabil. CBR Incl. Dev."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"6656","DOI":"10.48084\/etasr.3968","article-title":"A survey of text matching techniques","volume":"11","author":"Alqahtani","year":"2021","journal-title":"Eng. Technol. Appl. Sci. Res."},{"key":"ref_37","first-page":"95","article-title":"A new text clustering algorithm based on improved K-means","volume":"7","author":"Xinwu","year":"2012","journal-title":"J. Softw."},{"key":"ref_38","first-page":"636","article-title":"A survey: Clustering ensembles techniques","volume":"50","author":"Ghaemi","year":"2009","journal-title":"World Acad. Sci. Eng. Technol."},{"key":"ref_39","first-page":"551","article-title":"An improved similarity matching based clustering framework for short and sentence level text","volume":"7","author":"Basha","year":"2017","journal-title":"Int. J. Electr. Comput. Eng."},{"key":"ref_40","first-page":"30","article-title":"Knowledge discovery in text mining using association rule extraction","volume":"143","author":"Kulkarni","year":"2016","journal-title":"Int. J. Comput. Appl."},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Manimaran, J., and Velmurugan, T. (2013, January 26\u201328). A survey of association rule mining in text applications. Proceedings of the IEEE International Conference on Computational Intelligence and Computing Research, Enathi, India.","DOI":"10.1109\/ICCIC.2013.6724258"},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"419","DOI":"10.1134\/S1054661816020139","article-title":"A stochastic approach for association rule extraction","volume":"26","author":"Oliinyk","year":"2016","journal-title":"Pattern Recognit. Image Anal."},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"853","DOI":"10.1016\/j.eswa.2013.08.015","article-title":"Syntactic n-grams as machine learning features for natural language processing","volume":"41","author":"Sidorov","year":"2014","journal-title":"Expert Syst. Appl."},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"48","DOI":"10.1108\/EUM0000000007161","article-title":"Applications of n-grams in textual information systems","volume":"54","author":"Robertson","year":"1998","journal-title":"J. Doc."},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"843","DOI":"10.1126\/science.267.5199.843","article-title":"Gauging similarity with n-grams: Language-independent categorization of text","volume":"267","author":"Damashek","year":"1995","journal-title":"Science"},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Silva, J.F., and Cunha, J.C. (2024, January 7\u201310). How large corpora sizes influence the distribution of low frequency text n-grams. Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Taipei, Taiwan.","DOI":"10.1007\/978-981-97-2259-4_16"},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Schmitt, X., Kubler, S., Robert, J., Papadakis, M., and LeTraon, Y. (2019, January 22\u201325). A replicable comparison study of NER software: StanfordNLP, NLTK, OpenNLP, SpaCy, Gate. Proceedings of the Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS), Granada, Spain.","DOI":"10.1109\/SNAMS.2019.8931850"},{"key":"ref_48","unstructured":"Altinok, D. (2021). Mastering spaCy: An End-to-End Practical Guide to Implementing NLP Applications Using the Python Ecosystem, Packt Publishing Ltd."},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Fantechi, A., Gnesi, S., Livi, S., and Semini, L. (2021, January 6\u201311). A spaCy-based tool for extracting variability from NL requirements. Proceedings of the 25th ACM International Systems and Software Product Line Conference, Leicester, UK.","DOI":"10.1145\/3461002.3473074"},{"key":"ref_50","doi-asserted-by":"crossref","first-page":"447","DOI":"10.1093\/llc\/fqq018","article-title":"The Corpus of Contemporary American English as the first reliable monitor corpus of English","volume":"25","author":"Davies","year":"2010","journal-title":"Lit. Linguist. Comput."},{"key":"ref_51","unstructured":"OpenAI (2024, December 01). ChatGPT (3.5 Version) [Large Language Model]. Available online: https:\/\/chatgpt.com\/g\/g-F00faAwkE-open-a-i-gpt-3-5."},{"key":"ref_52","unstructured":"Le, Q., and Mikolov, T. (2014, January 21\u201326). Distributed representations of sentences and documents. Proceedings of the International Conference on Machine Learning, PMLR, Beijing, China."}],"container-title":["Future Internet"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1999-5903\/17\/4\/135\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T16:58:26Z","timestamp":1760029106000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1999-5903\/17\/4\/135"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,3,22]]},"references-count":52,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2025,4]]}},"alternative-id":["fi17040135"],"URL":"https:\/\/doi.org\/10.3390\/fi17040135","relation":{},"ISSN":["1999-5903"],"issn-type":[{"value":"1999-5903","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,3,22]]}}}