{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,2]],"date-time":"2026-05-02T22:34:02Z","timestamp":1777761242057,"version":"3.51.4"},"reference-count":79,"publisher":"Emerald","issue":"4","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2011,2,22]]},"abstract":"<jats:p>One of the most important primitive data types in modern data processing is text. Text data are known to have a variety of inconsistencies (e.g., spelling mistakes and representational variations). For that reason, there exists a large body of literature related to approximate processing of text. This monograph focuses specifically on the problem of approximate string matching, where, given a set of strings S and a query string \u03c5, the goal is to find all strings s \u2208 S that have a user specified degree of similarity to \u03c5. Set S could be, for example, a corpus of documents, a set of web pages, or an attribute of a relational table. The similarity between strings is always defined with respect to a similarity function that is chosen based on the characteristics of the data and application at hand. This work presents a survey of indexing techniques and algorithms specifically designed for approximate string matching. We concentrate on inverted indexes, filtering techniques, and tree data structures that can be used to evaluate a variety of set based and edit based similarity functions. We focus on all-match and top-k flavors of selection and join queries, and discuss the applicability, advantages and disadvantages of each technique for every query type.<\/jats:p>","DOI":"10.1561\/1900000010","type":"journal-article","created":{"date-parts":[[2011,3,24]],"date-time":"2011-03-24T10:41:46Z","timestamp":1300963306000},"page":"267-402","source":"Crossref","is-referenced-by-count":11,"title":["Approximate String Processing"],"prefix":"10.1108","volume":"2","author":[{"given":"Marios","family":"Hadjieleftheriou","sequence":"first","affiliation":[{"name":"AT&T Labs - Research , 180 Park Ave, Florham Park, NJ, 07932,","place":["USA"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Divesh","family":"Srivastava","sequence":"additional","affiliation":[{"name":"AT&T Labs - Research , 180 Park Ave, Florham Park, NJ, 07932,","place":["USA"]}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"140","published-online":{"date-parts":[[2011,2,22]]},"reference":[{"key":"2025121805081706600_ref001","doi-asserted-by":"crossref","first-page":"403","DOI":"10.1016\/S0022-2836(05)80360-2","article-title":"Basic local alignment search tool","volume":"215","author":"Altschul","year":"1990","journal-title":"Journal of Molecular Biology"},{"key":"2025121805081706600_ref002","first-page":"144","article-title":"Pattern matching with swaps","volume-title":"IEEE Symposium on Foundations of Computer Science (FOCS)","author":"Amir","year":"1997"},{"key":"2025121805081706600_ref003","first-page":"199","article-title":"Approximating edit distance in near-linear time","volume-title":"Proceedings of ACM Symposium on Theory of Computing (STOC)","author":"Andoni","year":"2009"},{"key":"2025121805081706600_ref004","first-page":"40","article-title":"Transformation-based framework for record matching","volume-title":"Proceedings of International Conference on Data Engineering (ICDE)","author":"Arasu","year":"2008"},{"issue":"1","key":"2025121805081706600_ref005","doi-asserted-by":"crossref","first-page":"514","DOI":"10.14778\/1687627.1687686","article-title":"Learning string transformations from examples","volume":"2","author":"Arasu","year":"2009","journal-title":"Proceedings of the VLDB Endowment (PVLDB)"},{"key":"2025121805081706600_ref006","first-page":"918","article-title":"Efficient exact set-similarity joins","volume-title":"Proceedings of Very Large Data Bases (VLDB)","author":"Arasu","year":"2006"},{"key":"2025121805081706600_ref007","doi-asserted-by":"crossref","first-page":"127","DOI":"10.1007\/3-540-45655-4_15","article-title":"Dictionary look-up within small edit distance","volume-title":"Proceedings of the Annual International Conference on Computing and Combinatorics (COCOON)","author":"Arslan","year":"2002"},{"key":"2025121805081706600_ref008","volume-title":"Modern Information Retrieval","author":"Baeza-Yates","year":"1999"},{"key":"2025121805081706600_ref009","doi-asserted-by":"crossref","first-page":"550","DOI":"10.1109\/FOCS.2004.14","article-title":"Approximating edit distance efficiently","volume-title":"IEEE Symposium on Foundations of Computer Science (FOCS)","author":"Bar-Yossef","year":"2004"},{"key":"2025121805081706600_ref010","first-page":"131","article-title":"Scaling up all pairs similarity search","volume-title":"WWW","author":"Bayardo","year":"2007"},{"key":"2025121805081706600_ref011","first-page":"604","article-title":"Space-constrained gram-based indexing for efficient approximate string search","volume-title":"Proceedings of International Conference on Data Engineering (ICDE)","author":"Behm","year":"2009"},{"issue":"1","key":"2025121805081706600_ref012","doi-asserted-by":"crossref","first-page":"255","DOI":"10.1007\/s00778-008-0098-x","article-title":"Swoosh: A generic approach to entity resolution","volume":"18","author":"Benjelloun","year":"2009","journal-title":"The VLDB Journal"},{"issue":"1\u20132","key":"2025121805081706600_ref013","doi-asserted-by":"crossref","first-page":"57","DOI":"10.1016\/S0020-0190(00)00079-X","article-title":"Improved bounds for dictionary look-up with one error","volume":"75","author":"Brodal","year":"2000","journal-title":"Information Processing Letters"},{"key":"2025121805081706600_ref014","doi-asserted-by":"crossref","first-page":"65","DOI":"10.1007\/3-540-61258-0_6","article-title":"Approximate dictionary queries","volume-title":"Proceedings of the Annual Symposium on Combinatorial Pattern Matching (CPM)","author":"Brodal","year":"1996"},{"key":"2025121805081706600_ref015","first-page":"208","article-title":"Compressed indexes for approximate string matching","volume-title":"Proceedings of the Annual European Symposium (ESA)","author":"Chan","year":"2006"},{"key":"2025121805081706600_ref016","doi-asserted-by":"crossref","first-page":"49","DOI":"10.1007\/11780441_6","article-title":"A linear size index for approximate pattern matching","volume-title":"Proceedings of the Annual Symposium on Combinatorial Pattern Matching (CPM)","author":"Chan","year":"2006"},{"key":"2025121805081706600_ref017","first-page":"116","article-title":"Approximate string matching in sublinear expected time","volume":"1","author":"Chang","year":"1990","journal-title":"IEEE Symposium on Foundations of Computer Science (FOCS)"},{"key":"2025121805081706600_ref018","first-page":"313","article-title":"Robust and efficient fuzzy match for online data cleaning","volume-title":"Proceedings of ACM Management of Data (SIGMOD)","author":"Chaudhuri","year":"2003"},{"key":"2025121805081706600_ref019","first-page":"5","article-title":"A primitive operator for similarity joins in data cleaning","volume-title":"Proceedings of International Conference on Data Engineering (ICDE)","author":"Chaudhuri","year":"2006"},{"key":"2025121805081706600_ref020","first-page":"707","article-title":"Extending autocompletion to tolerate errors","volume-title":"Proceedings of ACM Management of Data (SIGMOD)","author":"Chaudhuri","year":"2009"},{"key":"2025121805081706600_ref021","first-page":"437","article-title":"Leveraging aggregate constraints for deduplication","volume-title":"Proceedings of ACM Management of Data (SIGMOD)","author":"Chaudhuri","year":"2007"},{"key":"2025121805081706600_ref022","doi-asserted-by":"crossref","first-page":"41","DOI":"10.1007\/3-540-60044-2_33","article-title":"Fast approximate matching using suffix trees","volume-title":"Proceedings of the Annual Symposium on Combinatorial Pattern Matching (CPM)","author":"Cobbs","year":"1995"},{"key":"2025121805081706600_ref023","first-page":"91","article-title":"Dictionary matching and indexing with errors and don\u2019t cares","volume-title":"Proceedings of ACM Symposium on Theory of Computing (STOC)","author":"Cole","year":"2004"},{"issue":"2","key":"2025121805081706600_ref024","doi-asserted-by":"crossref","first-page":"121","DOI":"10.1145\/356770.356776","article-title":"The ubiquitous B-tree","volume":"11","author":"Comer","year":"1979","journal-title":"ACM Computing Surveys"},{"key":"2025121805081706600_ref025","volume-title":"Introduction to Algorithms","author":"Cormen","year":"2001"},{"issue":"1","key":"2025121805081706600_ref026","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/1186810.1186812","article-title":"The string edit distance matching problem with moves","volume":"3","author":"Cormode","year":"2007","journal-title":"ACM Transactions on Algorithms (TALG)"},{"key":"2025121805081706600_ref027","doi-asserted-by":"crossref","first-page":"17","DOI":"10.1109\/ICDE.2002.994694","article-title":"Tailor: A record linkage tool box","volume-title":"Proceedings of International Conference on Data Engineering (ICDE)","author":"Elfeky","year":"2002"},{"issue":"1","key":"2025121805081706600_ref028","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1109\/TKDE.2007.250581","article-title":"Duplicate record detection: A survey","volume":"19","author":"Elmagarmid","year":"2007","journal-title":"IEEE Transactions on Knowledge and Data Engineering (TKDE)"},{"key":"2025121805081706600_ref029","first-page":"102","article-title":"Optimal aggregation algorithms for middleware","volume-title":"Proceedings of ACM Symposium on Principles of Database Systems (PODS)","author":"Fagin","year":"2001"},{"key":"2025121805081706600_ref030","doi-asserted-by":"crossref","first-page":"236","DOI":"10.1145\/301970.301973","article-title":"The string b-tree: A new data structure for string search in external memory and its applications","volume":"46","author":"Ferragina","year":"1999","journal-title":"Journal of the ACM (JACM)"},{"key":"2025121805081706600_ref031","first-page":"491","article-title":"Approximate string joins in a database (almost) for free","volume-title":"Proceedings of Very Large Data Bases (VLDB)","author":"Gravano","year":"2001"},{"issue":"3","key":"2025121805081706600_ref032","doi-asserted-by":"crossref","first-page":"113","DOI":"10.1016\/0020-0190(89)90188-9","article-title":"Simple and efficient string matching with k mismatches","volume":"33","author":"Grossi","year":"1989","journal-title":"Information Processing Letters"},{"key":"2025121805081706600_ref033","doi-asserted-by":"crossref","DOI":"10.1109\/ICDE.2008.4497435","article-title":"Fast indexes and algorithms for set similarity selection queries","volume-title":"Proceedings of International Conference on Data Engineering (ICDE)","author":"Hadjieleftheriou","year":"2008"},{"key":"2025121805081706600_ref034","first-page":"429","article-title":"Incremental maintenance of length normalized indexes for approximate string matching","volume-title":"Proceedings of ACM Management of Data (SIGMOD)","author":"Hadjieleftheriou","year":"2009"},{"key":"2025121805081706600_ref035","first-page":"9","article-title":"Data integration: The teenage years","volume-title":"Proceedings of Very Large Data Bases (VLDB)","author":"Halevy","year":"2006"},{"issue":"12","key":"2025121805081706600_ref036","doi-asserted-by":"crossref","first-page":"777","DOI":"10.1145\/362919.362934","article-title":"Implementation of the substring test by hashing","volume":"14","author":"Harrison","year":"1971","journal-title":"Communications of the ACM"},{"key":"2025121805081706600_ref037","doi-asserted-by":"crossref","first-page":"40","DOI":"10.1007\/978-3-540-73437-6_7","article-title":"Cache-oblivious index for approximate string matching","volume-title":"Proceedings of the Annual Symposium on Combinatorial Pattern Matching (CPM)","author":"Hon","year":"2007"},{"key":"2025121805081706600_ref038","first-page":"371","article-title":"Efficient interactive fuzzy keyword search","volume-title":"WWW","author":"Ji","year":"2009"},{"key":"2025121805081706600_ref039","first-page":"240","article-title":"Two algorithms for approximate string matching in static texts","volume-title":"Proceedings of the Mathematical Foundations of Computer Science (MFCS)","author":"Jokinen","year":"1991"},{"issue":"2","key":"2025121805081706600_ref040","doi-asserted-by":"crossref","first-page":"249","DOI":"10.1147\/rd.312.0249","article-title":"Efficient randomized pattern-matching algorithms","volume":"31","author":"Karp","year":"1987","journal-title":"IBM Journal of Research and Development"},{"key":"2025121805081706600_ref041","doi-asserted-by":"crossref","first-page":"87","DOI":"10.1007\/BFb0029799","article-title":"Exact and approximation algorithms for the inversion distance between two chromosomes","volume-title":"Proceedings of the Annual Symposium on Combinatorial Pattern Matching (CPM)","author":"Kececioglu","year":"1993"},{"issue":"1","key":"2025121805081706600_ref042","doi-asserted-by":"crossref","first-page":"128","DOI":"10.1006\/jcom.1998.0497","article-title":"Efficient algorithms for approximate string matching with swaps","volume":"15","author":"Kim","year":"1999","journal-title":"Journal of Complexity"},{"key":"2025121805081706600_ref043","volume-title":"Sorting and Searching","author":"Knuth","year":"1998"},{"key":"2025121805081706600_ref044","first-page":"1078","article-title":"Flexible string matching against large databases in practice","volume-title":"Proceedings of Very Large Data Bases (VLDB)","author":"Koudas","year":"2004"},{"key":"2025121805081706600_ref045","first-page":"1146","article-title":"Propagating updates in SPIDER","volume-title":"Proceedings of International Conference on Data Engineering (ICDE)","author":"Koudas","year":"2007"},{"issue":"3","key":"2025121805081706600_ref046","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/1386118.1386125","article-title":"Efficient online index construction for text databases","volume":"33","author":"Lester","year":"2008","journal-title":"ACM Transactions on Database Systems (TODS)"},{"issue":"4","key":"2025121805081706600_ref047","first-page":"845","article-title":"Binary codes capable of correcting deletions, insertions and reversals (in Russian)","volume":"163","author":"Levenshtein","year":"1965","journal-title":"Doklady Akademii Nauk SSSR"},{"key":"2025121805081706600_ref048","first-page":"257","article-title":"Efficient merging and filtering algorithms for approximate string searches","volume-title":"Proceedings of International Conference on Data Engineering (ICDE)","author":"Li","year":"2008"},{"key":"2025121805081706600_ref049","first-page":"303","article-title":"Vgram: Improving performance of approximate queries on string collections using variable-length grams","volume-title":"Proceedings of Very Large Data Bases (VLDB)","author":"Li","year":"2007"},{"key":"2025121805081706600_ref050","volume-title":"Personal Communication","author":"Linderman","year":"2011"},{"issue":"4","key":"2025121805081706600_ref051","doi-asserted-by":"crossref","first-page":"191","DOI":"10.1016\/0020-0190(94)00032-8","article-title":"An algorithm for approximate membership checking with application to password security","volume":"50","author":"Manber","year":"1994","journal-title":"Information Processing Letters"},{"issue":"1","key":"2025121805081706600_ref052","doi-asserted-by":"crossref","first-page":"18","DOI":"10.1016\/0022-0000(80)90002-1","article-title":"A faster algorithm computing string edit distances","volume":"20","author":"Masek","year":"1980","journal-title":"Journal of Computer and System Sciences"},{"key":"2025121805081706600_ref053","volume-title":"Perceptrons: An Introduction to Computational Geometry","author":"Minsky","year":"1969"},{"issue":"1","key":"2025121805081706600_ref054","doi-asserted-by":"crossref","first-page":"31","DOI":"10.1145\/375360.375365","article-title":"A guided tour to approximate string matching","volume":"33","author":"Navarro","year":"2001","journal-title":"ACM Computing Surveys"},{"issue":"3","key":"2025121805081706600_ref055","doi-asserted-by":"crossref","first-page":"443","DOI":"10.1016\/0022-2836(70)90057-4","article-title":"A general method applicable to the search for similarities in the amino acid sequence of two proteins","volume":"48","author":"Needleman","year":"1970","journal-title":"Journal of Molecular Biology"},{"issue":"5","key":"2025121805081706600_ref056","doi-asserted-by":"crossref","first-page":"23","DOI":"10.1145\/1284320.1284322","article-title":"Low distortion embeddings for edit distance","volume":"54","author":"Ostrovsky","year":"2007","journal-title":"Journal of the ACM"},{"issue":"4","key":"2025121805081706600_ref057","doi-asserted-by":"crossref","first-page":"387","DOI":"10.1002\/spe.4380180407","article-title":"Fast approximate string matching","volume":"18","author":"Owolabi","year":"1988","journal-title":"Software \u2014 Practice & Experience"},{"key":"2025121805081706600_ref058","doi-asserted-by":"crossref","first-page":"197","DOI":"10.1007\/BFb0029806","article-title":"A fast filtration algorithm for the substring matching problem","volume-title":"Proceedings of the Annual Symposium on Combinatorial Pattern Matching (CPM)","author":"Pevzner","year":"1993"},{"issue":"2","key":"2025121805081706600_ref059","doi-asserted-by":"crossref","first-page":"4","DOI":"10.1145\/1242471.1242472","article-title":"A taxonomy of suffix array construction algorithms","volume":"39","author":"Puglisi","year":"2007","journal-title":"ACM Computing Surveys"},{"key":"2025121805081706600_ref060","first-page":"743","article-title":"Efficient set joins on similarity predicates","volume-title":"Proceedings of ACM Management of Data (SIGMOD)","author":"Sarawagi","year":"2004"},{"key":"2025121805081706600_ref061","doi-asserted-by":"crossref","first-page":"195","DOI":"10.1016\/0022-2836(81)90087-5","article-title":"Identification of common molecular subsequences","volume":"147","author":"Smith","year":"1981","journal-title":"Journal of Molecular Biology"},{"key":"2025121805081706600_ref062","volume-title":"String Searching Algorithms","author":"Stephen","year":"1998"},{"key":"2025121805081706600_ref063","first-page":"327","article-title":"On using q-gram locations in approximate string matching","volume-title":"Proceedings of the Annual European Symposium (ESA)","author":"Sutinen","year":"1995"},{"key":"2025121805081706600_ref064","first-page":"234","article-title":"Approximate pattern matching with samples","volume-title":"Proceedings of the International Symposium on Algorithms and Computation (ISAAC)","author":"Takaoka","year":"1994"},{"issue":"4","key":"2025121805081706600_ref065","doi-asserted-by":"crossref","first-page":"309","DOI":"10.1145\/357401.357404","article-title":"The string-to-string correction problem with block moves","volume":"2","author":"Tichy","year":"1984","journal-title":"ACM Transactions on Computer Systems (TOCS)"},{"issue":"1\u20133","key":"2025121805081706600_ref066","doi-asserted-by":"crossref","first-page":"100","DOI":"10.1016\/S0019-9958(85)80046-2","article-title":"Algorithms for approximate string matching","volume":"64","author":"Ukkonen","year":"1985","journal-title":"Information and Control"},{"issue":"1","key":"2025121805081706600_ref067","doi-asserted-by":"crossref","first-page":"191","DOI":"10.1016\/0304-3975(92)90143-4","article-title":"Approximate string-matching with q-grams and maximal matches","volume":"92","author":"Ukkonen","year":"1992","journal-title":"Theoretical Computer Science"},{"key":"2025121805081706600_ref068","doi-asserted-by":"crossref","first-page":"228","DOI":"10.1007\/BFb0029808","article-title":"Approximate string matching over suffix trees","volume-title":"Proceedings of the Annual Symposium on Combinatorial Pattern Matching (CPM)","author":"Ukkonen","year":"1993"},{"key":"2025121805081706600_ref069","doi-asserted-by":"crossref","first-page":"9","DOI":"10.1145\/1557670.1557677","article-title":"Efficient top-k algorithms for fuzzy search in string collections","volume-title":"Proceedings of the International Workshop on Keyword Search on Structured Data (KEYS)","author":"Vernica","year":"2009"},{"issue":"1","key":"2025121805081706600_ref070","doi-asserted-by":"crossref","first-page":"168","DOI":"10.1145\/321796.321811","article-title":"The string-to-string correction problem","volume":"21","author":"Wagner","year":"1974","journal-title":"Journal of the ACM"},{"key":"2025121805081706600_ref071","volume-title":"Managing Gigabytes: Compressing and Indexing Documents and Images","author":"Witten","year":"1994"},{"issue":"10","key":"2025121805081706600_ref072","doi-asserted-by":"crossref","first-page":"83","DOI":"10.1145\/135239.135244","article-title":"Fast text searching: Allowing errors","volume":"35","author":"Wu","year":"1992","journal-title":"Communications of the ACM (CACM)"},{"issue":"1","key":"2025121805081706600_ref073","doi-asserted-by":"crossref","first-page":"933","DOI":"10.14778\/1453856.1453957","article-title":"Ed-join: An efficient algorithm for similarity joins with edit distance constraints","volume":"1","author":"Xiao","year":"2008","journal-title":"Proceedings of the VLDB Endowment (PVLDB)"},{"key":"2025121805081706600_ref074","first-page":"916","article-title":"Top-k set similarity joins","volume-title":"Proceedings of International Conference on Data Engineering (ICDE)","author":"Xiao","year":"2009"},{"key":"2025121805081706600_ref075","first-page":"131","article-title":"Efficient similarity joins for near duplicate detection","volume-title":"WWW","author":"Xiao","year":"2008"},{"key":"2025121805081706600_ref076","first-page":"353","article-title":"Cost-based variable-length-gram selection for string collections to support approximate queries efficiently","volume-title":"Proceedings of ACM Management of Data (SIGMOD)","author":"Yang","year":"2008"},{"issue":"1","key":"2025121805081706600_ref077","first-page":"194","article-title":"Dictionary look-up with one error","volume":"25","author":"Yao","year":"1997"},{"key":"2025121805081706600_ref078","doi-asserted-by":"crossref","DOI":"10.1145\/1807167.1807266","article-title":"Bed-tree: An all-purpose index structure for string similarity search based on edit distance","volume-title":"Proceedings of ACM Management of Data (SIGMOD)","author":"Zhang","year":"2010"},{"issue":"2","key":"2025121805081706600_ref079","doi-asserted-by":"crossref","first-page":"6","DOI":"10.1145\/1132956.1132959","article-title":"Inverted files for text search engines","volume":"38","author":"Zobel","year":"2006","journal-title":"ACM Computing Surveys"}],"container-title":["Foundations and Trends in Databases"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.emerald.com\/ftdbs\/article-pdf\/2\/4\/267\/11098093\/1900000010en.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/www.emerald.com\/ftdbs\/article-pdf\/2\/4\/267\/11098093\/1900000010en.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,29]],"date-time":"2026-04-29T14:11:39Z","timestamp":1777471899000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.emerald.com\/ftdbs\/article\/2\/4\/267\/1330360\/Approximate-String-Processing"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2011,2,22]]},"references-count":79,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2011,2,22]]}},"URL":"https:\/\/doi.org\/10.1561\/1900000010","relation":{},"ISSN":["1931-7883","1931-7891"],"issn-type":[{"value":"1931-7883","type":"print"},{"value":"1931-7891","type":"electronic"}],"subject":[],"published":{"date-parts":[[2011,2,22]]}}}