{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,23]],"date-time":"2026-01-23T08:26:07Z","timestamp":1769156767831,"version":"3.49.0"},"reference-count":36,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2020,9,14]],"date-time":"2020-09-14T00:00:00Z","timestamp":1600041600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2020,9,14]],"date-time":"2020-09-14T00:00:00Z","timestamp":1600041600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"Research Incentive Fund","award":["R19093"],"award-info":[{"award-number":["R19093"]}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Big Data"],"published-print":{"date-parts":[[2020,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Similarity measures have long been utilized in information retrieval and machine learning domains for multi-purposes including text retrieval, text clustering, text summarization, plagiarism detection, and several other text-processing applications. However, the problem with these measures is that, until recently, there has never been one single measure recorded to be highly effective and efficient at the same time. Thus, the quest for an efficient and effective similarity measure is still an open-ended challenge. This study, in consequence, introduces a new highly-effective and time-efficient similarity measure for text clustering and classification. Furthermore, the study aims to provide a comprehensive scrutinization for seven of the most widely used similarity measures, mainly concerning their effectiveness and efficiency. Using the K-nearest neighbor algorithm (KNN) for classification, the K-means algorithm for clustering, and the bag of word (BoW) model for feature selection, all similarity measures are carefully examined in detail. The experimental evaluation has been made on two of the most popular datasets, namely, Reuters-21 and Web-KB. The obtained results confirm that the proposed set theory-based similarity measure (STB-SM), as a pre-eminent measure, outweighs all state-of-art measures significantly with regards to both effectiveness and efficiency.<\/jats:p>","DOI":"10.1186\/s40537-020-00344-3","type":"journal-article","created":{"date-parts":[[2020,9,14]],"date-time":"2020-09-14T10:02:59Z","timestamp":1600077779000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":37,"title":["A set theory based similarity measure for text clustering and classification"],"prefix":"10.1186","volume":"7","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2002-948X","authenticated-orcid":false,"given":"Ali A.","family":"Amer","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Hassan I.","family":"Abdalla","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2020,9,14]]},"reference":[{"key":"344_CR1","unstructured":"Alvarez, J.E. and H. Bast, A review of word embedding and document similarity algorithms applied to academic text. Bachelor thesis, 2017."},{"issue":"1","key":"344_CR2","doi-asserted-by":"publisher","first-page":"52","DOI":"10.1186\/s40537-018-0163-2","volume":"5","author":"M Oghbaie","year":"2018","unstructured":"Oghbaie M, Zanjireh MM. Pairwise document similarity measure based on present term set. J Big Data. 2018;5(1):52.","journal-title":"J Big Data"},{"issue":"1","key":"344_CR3","doi-asserted-by":"publisher","first-page":"25","DOI":"10.1186\/s40537-017-0083-6","volume":"4","author":"S Sohangir","year":"2017","unstructured":"Sohangir S, Wang D. Improved sqrt-Cosine similarity measurement. J Big Data. 2017;4(1):25.","journal-title":"J Big Data"},{"issue":"7","key":"344_CR4","doi-asserted-by":"publisher","first-page":"1575","DOI":"10.1109\/TKDE.2013.19","volume":"26","author":"Y-S Lin","year":"2013","unstructured":"Lin Y-S, Jiang J-Y, Lee S-J. A similarity measure for text classification and clustering. IEEE Trans Knowl Data Eng. 2013;26(7):1575\u201390.","journal-title":"IEEE Trans Knowl Data Eng"},{"issue":"1","key":"344_CR5","doi-asserted-by":"publisher","first-page":"48","DOI":"10.1177\/0165551516677946","volume":"44","author":"S Xu","year":"2018","unstructured":"Xu S. Bayesian Na\u00efve Bayes classifiers to text classification. J Inform Sci. 2018;44(1):48\u201359.","journal-title":"J Inform Sci"},{"issue":"1","key":"344_CR6","doi-asserted-by":"publisher","first-page":"3","DOI":"10.1177\/0165551514550143","volume":"41","author":"N Sheydaei","year":"2015","unstructured":"Sheydaei N, Saraee M, Shahgholian A. A novel feature selection method for text classification using association rules and clustering. J Inform Sci. 2015;41(1):3\u201315.","journal-title":"J Inform Sci"},{"key":"344_CR7","doi-asserted-by":"publisher","unstructured":"Subhashini R, Kumar VJ. Evaluating the performance of similarity measures used in document clustering and information retrieval. In: 1st Int Conf integrated intelligent computing, Bangalore, 2010, p. 27\u201331. https:\/\/doi.org\/10.1109\/iciic.20https:\/\/doi.org\/10.42.","DOI":"10.1109\/iciic.20https:\/\/doi.org\/10.42"},{"issue":"1","key":"344_CR8","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1186\/s40537-020-00306-9","volume":"7","author":"AA Amer","year":"2020","unstructured":"Amer AA. On K-means clustering-based approach for DDBSs design. J Big Data. 2020;7(1):1\u201331.","journal-title":"J Big Data"},{"issue":"1","key":"344_CR9","doi-asserted-by":"publisher","first-page":"e03172","DOI":"10.1016\/j.heliyon.2020.e03172","volume":"6","author":"AA Amer","year":"2020","unstructured":"Amer AA, Mohamed MH, Asri K. ASGOP: An aggregated similarity-based greedy-oriented approach for relational DDBSs design. Heliyon. 2020;6(1):e03172.","journal-title":"Heliyon."},{"key":"344_CR10","first-page":"21","volume":"1","author":"L Nguyen","year":"2019","unstructured":"Nguyen L, Amer AA. Advanced cosine measures for collaborative filtering. Adapt Personalization (ADP). 2019;1:21\u201341.","journal-title":"Adapt Personalization (ADP)"},{"key":"344_CR11","doi-asserted-by":"crossref","unstructured":"Shahmirzadi O, Lugowski A, Younge K. Text similarity in vector space models: a comparative study. In 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA). 2019. IEEE.","DOI":"10.1109\/ICMLA.2019.00120"},{"key":"344_CR12","unstructured":"Strehl A, Ghosh J, Mooney R. Impact of similarity measures on web-page clustering. In Workshop on artificial intelligence for web search (AAAI 2000). 2000."},{"key":"344_CR13","doi-asserted-by":"crossref","unstructured":"White RW, Jose JM. A study of topic similarity measures. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. 2004.","DOI":"10.1145\/1008992.1009100"},{"key":"344_CR14","unstructured":"Huang A. Similarity measures for text document clustering. In Proceedings of the sixth New Zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand. 2008."},{"issue":"1","key":"344_CR15","doi-asserted-by":"publisher","first-page":"6","DOI":"10.1093\/llc\/fqt002","volume":"29","author":"RS Forsyth","year":"2014","unstructured":"Forsyth RS, Sharoff S. Document dissimilarity within and across languages: a benchmarking study. Literary Linguistic Comput. 2014;29(1):6\u201322.","journal-title":"Literary Linguistic Comput"},{"key":"344_CR16","doi-asserted-by":"crossref","unstructured":"Thompson VU, Panchev C, Oakes M. Performance evaluation of similarity measures on similar and dissimilar text retrieval. In 2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K). IEEE. 2015.","DOI":"10.5220\/0005619105770584"},{"issue":"3","key":"344_CR17","doi-asserted-by":"publisher","first-page":"267","DOI":"10.1109\/TETC.2014.2330519","volume":"2","author":"A Fahad","year":"2014","unstructured":"Fahad A, et al. A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Topics Comput. 2014;2(3):267\u201379.","journal-title":"IEEE Trans Emerg Topics Comput"},{"key":"344_CR18","doi-asserted-by":"crossref","unstructured":"Aslam JA, Frost M. An information-theoretic measure for document similarity. In: Proc 26th SIGIR, Toronto. 2003. p. 449\u201350.","DOI":"10.1145\/860435.860545"},{"key":"344_CR19","volume-title":"R and data mining: examples and case studies","author":"Y Zhao","year":"2012","unstructured":"Zhao Y. R and data mining: examples and case studies. Cambridge: Academic Press; 2012."},{"issue":"2","key":"344_CR20","doi-asserted-by":"publisher","first-page":"7","DOI":"10.1145\/1328854.1328855","volume":"36","author":"S Tata","year":"2007","unstructured":"Tata S, Patel JM. Estimating the selectivity of tf-idf based Cosine similarity predicates. ACM Sigmod Record. 2007;36(2):7\u201312.","journal-title":"ACM Sigmod Record"},{"key":"344_CR21","first-page":"99","volume":"35","author":"A Bhattacharyya","year":"1943","unstructured":"Bhattacharyya A. On a measure of divergence between two statistical populations defined by their probability distributions. Bull Calcutta Math Soc. 1943;35:99\u2013109.","journal-title":"Bull Calcutta Math Soc."},{"key":"344_CR22","doi-asserted-by":"crossref","unstructured":"Schoenharl TW, Madey G. Evaluation of measurement techniques for the validation of agent-based simulations against streaming data. In International Conference on Computational Science. 2008. Springer.","DOI":"10.1007\/978-3-540-69389-5_3"},{"issue":"1","key":"344_CR23","doi-asserted-by":"publisher","first-page":"79","DOI":"10.1214\/aoms\/1177729694","volume":"22","author":"S Kullback","year":"1951","unstructured":"Kullback S, Leibler RA. On information and sufficiency. Ann Math Stat. 1951;22(1):79\u201386.","journal-title":"Ann Math Stat"},{"key":"344_CR24","unstructured":"Kullback S. Information theory and statistics Wiley. New York, 1959."},{"key":"344_CR25","doi-asserted-by":"crossref","unstructured":"Jaccard P. The distribution of the flora in the alpine zone. 1. New phytologist, 1912. 11(2): p. 37\u201350.","DOI":"10.1111\/j.1469-8137.1912.tb05611.x"},{"key":"344_CR26","doi-asserted-by":"publisher","first-page":"141","DOI":"10.1613\/jair.2934","volume":"37","author":"PD Turney","year":"2010","unstructured":"Turney PD, Pantel P. From frequency to meaning: vector space models of semantics. J Artificial Intell Res. 2010;37:141\u201388.","journal-title":"J Artificial Intell Res"},{"key":"344_CR27","doi-asserted-by":"crossref","unstructured":"Al-Ghuribi SM, Alshomrani S. A simple study of webpage text classification algorithms for Arabic and English Languages. In 2013 International Conference on IT Convergence and Security (ICITCS). 2013. IEEE.","DOI":"10.1109\/ICITCS.2013.6717784"},{"key":"344_CR28","first-page":"34","volume":"4","author":"DB Patil","year":"2015","unstructured":"Patil DB, Dongre YV. A fuzzy approach for text mining. IJ Math Sci Comput. 2015;4:34\u201343.","journal-title":"IJ Math Sci Comput"},{"issue":"5","key":"344_CR29","doi-asserted-by":"publisher","first-page":"513","DOI":"10.1016\/0306-4573(88)90021-0","volume":"24","author":"G Salton","year":"1988","unstructured":"Salton G, Buckley C. Term-weighting approaches in automatic text retrieval. Inf Process Manage. 1988;24(5):513\u201323.","journal-title":"Inf Process Manage"},{"key":"344_CR30","doi-asserted-by":"crossref","unstructured":"Jabalameli, M., A. Arman, and M. Nematbakhsh, Improving the efficiency of term weighting in set of dynamic documents. 2015. International Journal of Modern Education and Computer Science, 7, 42-47.","DOI":"10.5815\/ijmecs.2015.02.06"},{"key":"344_CR31","first-page":"163","volume-title":"A survey of text classification algorithms, in mining text data","author":"CC Aggarwal","year":"2012","unstructured":"Aggarwal CC, Zhai C. A survey of text classification algorithms, in mining text data. Boston: Springer; 2012. p. 163\u2013222."},{"issue":"6","key":"344_CR32","doi-asserted-by":"publisher","first-page":"818","DOI":"10.1177\/0165551518816302","volume":"45","author":"R Lakshmi","year":"2019","unstructured":"Lakshmi R, Baskar S. DIC-DOC-K-means: dissimilarity-based Initial Centroid selection for DOCument clustering using K-means for improving the effectiveness of text document clustering. J Inform Sci. 2019;45(6):818\u201332.","journal-title":"J Inform Sci"},{"issue":"3","key":"344_CR33","doi-asserted-by":"publisher","first-page":"645","DOI":"10.1109\/TNN.2005.845141","volume":"16","author":"R Xu","year":"2005","unstructured":"Xu R, Wunsch D. Survey of clustering algorithms. IEEE Trans Neural Networks. 2005;16(3):645\u201378.","journal-title":"IEEE Trans Neural Networks"},{"key":"344_CR34","doi-asserted-by":"crossref","unstructured":"Khadija A. Almohsen, Huda Al-Jobori, \u201cRecommender Systems in Light of Big Data\u201d, International Journal of Electrical and Computer Engineering (IJECE), Vol. 5, No. 6, December 2015, pp. 1553\u20131563, 2015.","DOI":"10.11591\/ijece.v5i6.pp1553-1563"},{"key":"344_CR35","doi-asserted-by":"publisher","first-page":"203","DOI":"10.1002\/asi.10170","volume":"54","author":"TC Hoad","year":"2003","unstructured":"Hoad TC, Zobel J. Methods for identifying versioned and plagiarized documents. JASIST. 2003;54:203\u201315.","journal-title":"JASIST."},{"key":"344_CR36","doi-asserted-by":"publisher","first-page":"6","DOI":"10.1186\/s40537-015-0020-5","volume":"2","author":"NK Nagwani","year":"2015","unstructured":"Nagwani NK. Summarizing large text collection using topic modelling and clustering based on MapReduce framework. J Big Data. 2015;2:6.","journal-title":"J Big Data"}],"container-title":["Journal of Big Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-020-00344-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s40537-020-00344-3\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-020-00344-3.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,9,13]],"date-time":"2021-09-13T23:15:15Z","timestamp":1631574915000},"score":1,"resource":{"primary":{"URL":"https:\/\/journalofbigdata.springeropen.com\/articles\/10.1186\/s40537-020-00344-3"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,9,14]]},"references-count":36,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2020,12]]}},"alternative-id":["344"],"URL":"https:\/\/doi.org\/10.1186\/s40537-020-00344-3","relation":{},"ISSN":["2196-1115"],"issn-type":[{"value":"2196-1115","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,9,14]]},"assertion":[{"value":"24 April 2020","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 August 2020","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 September 2020","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The authors declare that they have no competing interests.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"74"}}