{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,1]],"date-time":"2026-06-01T20:37:05Z","timestamp":1780346225447,"version":"3.54.1"},"reference-count":36,"publisher":"Association for Computing Machinery (ACM)","issue":"1-2","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2010,9]]},"abstract":"<jats:p>\n            Rule-based information extraction from text is increasingly being used to populate databases and to support structured queries on unstructured text. Specification of suitable information extraction rules requires considerable skill and standard practice is to refine rules iteratively, with substantial effort. In this paper, we show that techniques developed in the context of data provenance, to determine the lineage of a tuple in a database, can be leveraged to assist in rule refinement. Specifically, given a set of extraction rules and correct and incorrect extracted data, we have developed a technique to suggest a ranked list of rule modifications that an expert rule specifier can consider. We implemented our technique in the\n            <jats:italic>SystemT<\/jats:italic>\n            information extraction system developed at IBM Research -- Almaden and experimentally demonstrate its effectiveness.\n          <\/jats:p>","DOI":"10.14778\/1920841.1920916","type":"journal-article","created":{"date-parts":[[2014,6,24]],"date-time":"2014-06-24T12:17:57Z","timestamp":1403612277000},"page":"588-597","source":"Crossref","is-referenced-by-count":25,"title":["Automatic rule refinement for information extraction"],"prefix":"10.14778","volume":"3","author":[{"given":"Bin","family":"Liu","sequence":"first","affiliation":[{"name":"University of Michigan"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Laura","family":"Chiticariu","sequence":"additional","affiliation":[{"name":"IBM Research - Almaden"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Vivian","family":"Chu","sequence":"additional","affiliation":[{"name":"IBM Research - Almaden"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"H. V.","family":"Jagadish","sequence":"additional","affiliation":[{"name":"University of Michigan"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Frederick R.","family":"Reiss","sequence":"additional","affiliation":[{"name":"IBM Research - Almaden"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2010,9]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Framework (SQL\/Framework). Technical report. ISO\/IEC 9075--1","author":"Database","year":"2003","unstructured":"Database languages -- SQL -- Part 1 : Framework (SQL\/Framework). Technical report. ISO\/IEC 9075--1 : 2003 . Database languages -- SQL -- Part 1: Framework (SQL\/Framework). Technical report. ISO\/IEC 9075--1:2003."},{"key":"e_1_2_1_2_1","unstructured":"The Enron corpus. www.cs.cmu.edu\/enron\/.  The Enron corpus. www.cs.cmu.edu\/enron\/."},{"key":"e_1_2_1_3_1","unstructured":"Automatic Content Extraction 2005 Evaluation Dataset 2005.  Automatic Content Extraction 2005 Evaluation Dataset 2005."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/336597.336644"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.3115\/1119089.1119095"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/CSIE.2009.857"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/358234.381162"},{"key":"e_1_2_1_8_1","volume-title":"RANLP","author":"Boguraev B.","year":"2003","unstructured":"B. Boguraev . Annotation-based Finite State Processing in a Large-Scale NLP Architecture . In RANLP , 2003 . B. Boguraev. Annotation-based Finite State Processing in a Large-Scale NLP Architecture. In RANLP, 2003."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/1559845.1559901"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1561\/1900000006"},{"key":"e_1_2_1_11_1","volume-title":"ACL","author":"Chiticariu L.","year":"2010","unstructured":"L. Chiticariu , R. Krishnamurthy , Y. Li , S. Raghavan , F. Reiss , and S. Vaithyanathan . SystemT: An Algebraic Approach to Declarative Information Extraction . In ACL , 2010 . L. Chiticariu, R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, and S. Vaithyanathan. SystemT: An Algebraic Approach to Declarative Information Extraction. In ACL, 2010."},{"key":"e_1_2_1_12_1","volume-title":"JAPE: a Java Annotation Patterns Engine. Research Memorandum CS - 99 - 06","author":"Cunningham H.","year":"1999","unstructured":"H. Cunningham . JAPE: a Java Annotation Patterns Engine. Research Memorandum CS - 99 - 06 , University of Sheffield , May 1999 . H. Cunningham. JAPE: a Java Annotation Patterns Engine. Research Memorandum CS - 99 - 06, University of Sheffield, May 1999."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/1807167.1807253"},{"key":"e_1_2_1_14_1","volume-title":"Strategies for Natural language Processing.","author":"DeJong D.","year":"1982","unstructured":"D. DeJong . An Overview of the FRUMP System . In Strategies for Natural language Processing. 1982 . D. DeJong. An Overview of the FRUMP System. In Strategies for Natural language Processing. 1982."},{"key":"e_1_2_1_15_1","volume-title":"ICML","author":"Freitag D.","year":"1998","unstructured":"D. Freitag . Multistrategy Learning for Information Extraction . In ICML , 1998 . D. Freitag. Multistrategy Learning for Information Extraction. In ICML, 1998."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2009.15"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/1265530.1265535"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.14778\/1920841.1920869"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.3115\/1075671.1075701"},{"key":"e_1_2_1_20_1","volume-title":"On the Provenance of Non-Answers to Queries over Extracted Data. PVLDB, 1(1)","author":"Huang J.","year":"2008","unstructured":"J. Huang , T. Chen , A. Doan , and J. F. Naughton . On the Provenance of Non-Answers to Queries over Extracted Data. PVLDB, 1(1) , 2008 . J. Huang, T. Chen, A. Doan, and J. F. Naughton. On the Provenance of Non-Answers to Queries over Extracted Data. PVLDB, 1(1), 2008."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2009.138"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/1519103.1519105"},{"key":"e_1_2_1_23_1","volume-title":"ICML","author":"Lafferty J.","year":"2001","unstructured":"J. Lafferty , A. McCallum , and F. Pereira . Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data . In ICML , 2001 . J. Lafferty, A. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In ICML, 2001."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.3115\/1072017.1072043"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.5555\/1613715.1613719"},{"key":"e_1_2_1_26_1","volume-title":"HLT-NAACL","author":"Peng F.","year":"2004","unstructured":"F. Peng and A. McCallum . Accurate Information Extraction from Research Papers Using Conditional Random Fields . In HLT-NAACL , 2004 . F. Peng and A. McCallum. Accurate Information Extraction from Research Papers Using Conditional Random Fields. In HLT-NAACL, 2004."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2008.4497502"},{"key":"e_1_2_1_28_1","volume-title":"KDD","author":"Riloff E.","year":"1993","unstructured":"E. Riloff . Automatically Constructing a Dictionary for Information Extraction Tasks . In KDD , 1993 . E. Riloff. Automatically Constructing a Dictionary for Information Extraction Tasks. In KDD, 1993."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/1376616.1376718"},{"key":"e_1_2_1_30_1","volume-title":"VLDB","author":"Shen W.","year":"2007","unstructured":"W. Shen , A. Doan , J. F. Naughton , and R. Ramakrishnan . Declarative Information Extraction Using Datalog with Embedded Extraction Predicates . In VLDB , 2007 . W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative Information Extraction Using Datalog with Embedded Extraction Predicates. In VLDB, 2007."},{"key":"e_1_2_1_31_1","volume-title":"U. Mass.","author":"Soderland S. G.","year":"1996","unstructured":"S. G. Soderland . Learning Text Analysis Rules for Domain-specific Natural Language Processing. Technical report , U. Mass. , 1996 . S. G. Soderland. Learning Text Analysis Rules for Domain-specific Natural Language Processing. Technical report, U. Mass., 1996."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2006.47"},{"key":"e_1_2_1_33_1","volume-title":"ICML","author":"Thompson C.","year":"1999","unstructured":"C. Thompson , M. Califf , and R. Mooney . Active Learning for Natural Language Parsing and Information Extraction . In ICML , 1999 . C. Thompson, M. Califf, and R. Mooney. Active Learning for Natural Language Parsing and Information Extraction. In ICML, 1999."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.3115\/1119176.1119195"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.5555\/1614164.1614177"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.3115\/1219840.1219892"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/1920841.1920916","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T11:49:15Z","timestamp":1672228155000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/1920841.1920916"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2010,9]]},"references-count":36,"journal-issue":{"issue":"1-2","published-print":{"date-parts":[[2010,9]]}},"alternative-id":["10.14778\/1920841.1920916"],"URL":"https:\/\/doi.org\/10.14778\/1920841.1920916","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2010,9]]}}}