{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,6]],"date-time":"2026-01-06T13:45:35Z","timestamp":1767707135718,"version":"3.38.0"},"reference-count":31,"publisher":"SAGE Publications","issue":"4","license":[{"start":{"date-parts":[[2015,3,24]],"date-time":"2015-03-24T00:00:00Z","timestamp":1427155200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["Journal of Information Science"],"published-print":{"date-parts":[[2015,8]]},"abstract":"<jats:p> Near duplicate data not only increase the cost of information processing in big data, but also increase decision time. Therefore, detecting and eliminating nearly identical information is vital to enhance overall business decisions. To identify near-duplicates in large-scale text data, the shingling algorithm has been widely used. This algorithm is based on occurrences of contiguous subsequences of tokens in two or more sets of information, such as in documents. In other words, if there is a slight variation among documents, the overall performance of the algorithm decreases. Therefore, to increase the efficiency and accuracy performances of the shingling algorithm, we propose a hybrid approach that embeds Jaro distance and statistical results of word usage frequency for fixing the ill-defined data. In a real text dataset, the proposed hybrid approach improved the shingling algorithm\u2019s accuracy performance by 27% on average and achieved above 90% common shingles. <\/jats:p>","DOI":"10.1177\/0165551515577912","type":"journal-article","created":{"date-parts":[[2015,3,25]],"date-time":"2015-03-25T05:51:58Z","timestamp":1427262718000},"page":"405-414","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":5,"title":["Detecting near-duplicate text documents with a hybrid approach"],"prefix":"10.1177","volume":"41","author":[{"given":"Cihan","family":"Varol","sequence":"first","affiliation":[{"name":"Sam Houston State University, Texas, USA"}]},{"given":"Sairam","family":"Hari","sequence":"additional","affiliation":[{"name":"Sam Houston State University, Texas, USA"}]}],"member":"179","published-online":{"date-parts":[[2015,3,24]]},"reference":[{"key":"bibr1-0165551515577912","first-page":"514","volume":"32","author":"Prasanna KJ","year":"2009","journal-title":"European Journal of Scientific Research"},{"key":"bibr2-0165551515577912","doi-asserted-by":"crossref","first-page":"192","DOI":"10.7763\/IJET.2012.V4.347","volume":"4","author":"Gupta T","year":"2012","journal-title":"International Journal of Advances in Engineering & Technology"},{"key":"bibr3-0165551515577912","first-page":"845","volume":"163","author":"Levenshtein VI","year":"1995","journal-title":"Doklady Akademii Nauk SSSR"},{"key":"bibr4-0165551515577912","unstructured":"Becchetti C, Ricotti LP. Speech recognition: Theory and C++ implementation, 1st ed. Chichester: John Wiley & Sons, 1999, p. 74."},{"key":"bibr5-0165551515577912","doi-asserted-by":"publisher","DOI":"10.1145\/146370.146380"},{"first-page":"398","volume-title":"Proceedings of the ACM SIGMOD annual conference","author":"Brin S","key":"bibr6-0165551515577912"},{"volume-title":"Proceedings of 2nd international conference in theory and practice of digital libraries","author":"Shiva KN","key":"bibr7-0165551515577912"},{"volume-title":"Plagiarism: Prevention, practice and policies conference","author":"Lyon C","key":"bibr8-0165551515577912"},{"key":"bibr9-0165551515577912","first-page":"1","volume":"1","author":"Lyon C","year":"2006","journal-title":"Plagiary: Cross-Disciplinary Studies in Plagiarism, Fabrication, and Falsification"},{"key":"bibr10-0165551515577912","first-page":"36","author":"Xiao C","year":"2011","journal-title":"ACM Transactions on Database Systems"},{"first-page":"204","volume-title":"Proceedings of workshop on Web databases","author":"Shiva KN","key":"bibr11-0165551515577912"},{"key":"bibr12-0165551515577912","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-45123-4_1"},{"first-page":"141","volume-title":"Proceedings of the 16th international World Wide Web conference","author":"Manku GS","key":"bibr13-0165551515577912"},{"first-page":"284","volume-title":"Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval","author":"Henzinger M","key":"bibr14-0165551515577912"},{"first-page":"273","volume-title":"Proceedings of the international conference on advanced science, engineering and information technology","author":"Das SN","key":"bibr15-0165551515577912"},{"first-page":"563","volume-title":"Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2008)","author":"Theobald M","key":"bibr16-0165551515577912"},{"key":"bibr17-0165551515577912","first-page":"325","volume":"5","author":"Tian ZP","year":"2001","journal-title":"International Journal on Digital Libraries"},{"first-page":"127","volume-title":"Proceedings of the 1995 ACM SIGMOD international conference on management of data","author":"Hernandez MA","key":"bibr18-0165551515577912"},{"key":"bibr19-0165551515577912","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2011.01.009"},{"key":"bibr20-0165551515577912","doi-asserted-by":"publisher","DOI":"10.1145\/506309.506311"},{"first-page":"419","volume-title":"Proceedings of the 33th annual international ACM SIGIR conference on research and development in information retrieval","author":"Hajishirzi H","key":"bibr21-0165551515577912"},{"volume-title":"Proceedings of 34th annual symposium on theory of computing","author":"Charikar M","key":"bibr22-0165551515577912"},{"first-page":"877","volume-title":"Proceedings of the 2008 Pacific\u2013Asia conference on knowledge discovery and data mining","author":"Gong C","key":"bibr23-0165551515577912"},{"key":"bibr24-0165551515577912","first-page":"514","volume":"32","author":"Kumar JP","year":"2009","journal-title":"European Journal of Scientific Research"},{"volume-title":"Proceedings of the 12th international Semantic Web conference (ISWC 2013)","author":"Cheatham M","key":"bibr25-0165551515577912"},{"key":"bibr26-0165551515577912","unstructured":"Word Frequency Data. Corpus of Contemporary American English, www.wordfrequency.info (accessed September 2014)."},{"key":"bibr27-0165551515577912","unstructured":"Lincoln A. The Gettysburg Address, www.abrahamlincolnonline.org\/lincoln\/speeches\/gettysburg.htm (accessed September 2014)."},{"key":"bibr28-0165551515577912","doi-asserted-by":"publisher","DOI":"10.1093\/comjnl\/20.2.141"},{"key":"bibr29-0165551515577912","doi-asserted-by":"publisher","DOI":"10.1002\/j.1538-7305.1950.tb00463.x"},{"key":"bibr30-0165551515577912","first-page":"46","volume":"7","author":"Ratcliff JW","year":"1988","journal-title":"Dr Dobb\u2019s Journal"},{"first-page":"233","volume-title":"Proceedings of the 23rd international conference on marchine learning","author":"Davis J","key":"bibr31-0165551515577912"}],"container-title":["Journal of Information Science"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/0165551515577912","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.1177\/0165551515577912","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/0165551515577912","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,3]],"date-time":"2025-03-03T20:16:53Z","timestamp":1741033013000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/0165551515577912"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2015,3,24]]},"references-count":31,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2015,8]]}},"alternative-id":["10.1177\/0165551515577912"],"URL":"https:\/\/doi.org\/10.1177\/0165551515577912","relation":{},"ISSN":["0165-5515","1741-6485"],"issn-type":[{"type":"print","value":"0165-5515"},{"type":"electronic","value":"1741-6485"}],"subject":[],"published":{"date-parts":[[2015,3,24]]}}}