{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,29]],"date-time":"2025-09-29T11:58:26Z","timestamp":1759147106716},"reference-count":23,"publisher":"Association for Computing Machinery (ACM)","issue":"1","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2009,8]]},"abstract":"<jats:p>Many entity extraction techniques leverage large reference entity tables to identify entities in documents. Often, an entity is referenced in document collections differently from that in the reference entity tables. Therefore, we study the problem of determining whether or not a substring \"approximately\" matches with a reference entity. Similarity measures which exploit the correlation between candidate substrings and reference entities across a large number of documents are known to be more robust than traditional stand alone string-based similarity functions. However, such an approach has significant efficiency challenges. In this paper, we adopt a new architecture and propose new techniques to address these efficiency challenges. We mine document collections and expand a given reference entity table with variations of each of its entities. Thus, the problem of approximately matching an input string against reference entities reduces to that of exact match against the expanded reference table, which can be implemented efficiently. In an extensive experimental evaluation, we demonstrate the accuracy and scalability of our techniques.<\/jats:p>","DOI":"10.14778\/1687627.1687673","type":"journal-article","created":{"date-parts":[[2014,6,24]],"date-time":"2014-06-24T12:17:57Z","timestamp":1403612277000},"page":"395-406","source":"Crossref","is-referenced-by-count":24,"title":["Mining document collections to facilitate accurate approximate entity matching"],"prefix":"10.14778","volume":"2","author":[{"given":"Surajit","family":"Chaudhuri","sequence":"first","affiliation":[{"name":"Microsoft Research, Redmond, WA"}]},{"given":"Venkatesh","family":"Ganti","sequence":"additional","affiliation":[{"name":"Microsoft Research, Redmond, WA"}]},{"given":"Dong","family":"Xin","sequence":"additional","affiliation":[{"name":"Microsoft Research, Redmond, WA"}]}],"member":"320","published-online":{"date-parts":[[2009,8]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.14778\/1453856.1453958"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/360825.360855"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/360825.360855"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/1390749.1390756"},{"key":"e_1_2_1_5_1","volume-title":"Introduction to Information Extraction Technology. IJCAI-99 Tutorial","author":"Appelt D. E.","year":"1999","unstructured":"D. E. Appelt and D. Israel . Introduction to Information Extraction Technology. IJCAI-99 Tutorial , 1999 . D. E. Appelt and D. Israel. Introduction to Information Extraction Technology. IJCAI-99 Tutorial, 1999."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2008.4497412"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.14778\/1687627.1687686"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.14778\/1454159.1454166"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/1376616.1376697"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2006.55"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2006.9"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/1526709.1526731"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/1014052.1014065"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.3115\/981574.981596"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/1327452.1327492"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/1066157.1066168"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/342009.335372"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/1142473.1142599"},{"key":"e_1_2_1_19_1","volume-title":"Proceedings of ICML","author":"Lafferty J.","year":"2001","unstructured":"J. Lafferty , A. McCallum , and F. Pereira . Conditional random fields: probabilistic models for segmenting and labeling sequence data . In Proceedings of ICML , 2001 . J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML, 2001."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.3115\/980691.980696"},{"key":"e_1_2_1_21_1","volume-title":"Foundations of statistical natural language processing","author":"Manning C.","year":"1999","unstructured":"C. Manning and H. Schu\u00fctze . Foundations of statistical natural language processing . In The MIT Press , 1999 . C. Manning and H. Schu\u00fctze. Foundations of statistical natural language processing. In The MIT Press, 1999."},{"key":"e_1_2_1_22_1","first-page":"68","volume-title":"IIWeb","author":"Michelson M.","year":"2007","unstructured":"M. Michelson and C. A. Knoblock . Mining heterogeneous transformations for record linkage . In IIWeb , pages 68 -- 73 . AAAI Press , 2007 . M. Michelson and C. A. Knoblock. Mining heterogeneous transformations for record linkage. In IIWeb, pages 68--73. AAAI Press, 2007."},{"key":"e_1_2_1_23_1","volume-title":"Pmi-ir versus lsa on toefl. CoRR, cs.LG\/0212033","author":"Turney P. D.","year":"2002","unstructured":"P. D. Turney . Mining the web for synonyms : Pmi-ir versus lsa on toefl. CoRR, cs.LG\/0212033 , 2002 . P. D. Turney. Mining the web for synonyms: Pmi-ir versus lsa on toefl. CoRR, cs.LG\/0212033, 2002."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/1687627.1687673","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T11:27:56Z","timestamp":1672226876000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/1687627.1687673"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2009,8]]},"references-count":23,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2009,8]]}},"alternative-id":["10.14778\/1687627.1687673"],"URL":"https:\/\/doi.org\/10.14778\/1687627.1687673","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2009,8]]}}}