{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,8]],"date-time":"2026-04-08T09:00:05Z","timestamp":1775638805786,"version":"3.50.1"},"reference-count":21,"publisher":"Association for Computing Machinery (ACM)","issue":"5","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2016,1]]},"abstract":"<jats:p>Since regular expressions are often used to detect errors in sequences such as strings or date, it is natural to use them for data repair. Motivated by this, we propose a data repair method based on regular expression to make the input sequence data obey the given regular expression with minimal revision cost. The proposed method contains two steps, sequence repair and token value repair.<\/jats:p>\n          <jats:p>\n            For sequence repair, we propose the Regular-expression-based Structural Repair (RSR in short) algorithm. RSR algorithm is a dynamic programming algorithm that utilizes Nondeterministic Finite Automata (NFA) to calculate the edit distance between a prefix of the input string and a partial pattern regular expression with time complexity of\n            <jats:italic>O<\/jats:italic>\n            (\n            <jats:italic>nm<\/jats:italic>\n            <jats:sup>2<\/jats:sup>\n            ) and space complexity of\n            <jats:italic>O<\/jats:italic>\n            (\n            <jats:italic>mn<\/jats:italic>\n            ) where\n            <jats:italic>m<\/jats:italic>\n            is the edge number of NFA and\n            <jats:italic>n<\/jats:italic>\n            is the input string length. We also develop an optimization strategy to achieve higher performance for long strings. For token value repair, we combine the edit-distance-based method and associate rules by a unified argument for the selection of the proper method. Experimental results on both real and synthetic data show that the proposed method could repair the data effectively and efficiently.\n          <\/jats:p>","DOI":"10.14778\/2876473.2876478","type":"journal-article","created":{"date-parts":[[2016,2,1]],"date-time":"2016-02-01T14:10:31Z","timestamp":1454335831000},"page":"432-443","source":"Crossref","is-referenced-by-count":10,"title":["Repairing data through regular expressions"],"prefix":"10.14778","volume":"9","author":[{"given":"Zeyu","family":"Li","sequence":"first","affiliation":[{"name":"Harbin Institute of Technology"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Hongzhi","family":"Wang","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wei","family":"Shao","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jianzhong","family":"Li","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Hong","family":"Gao","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2016,1]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"Summary of movielens datasets. http:\/\/files.grouplens.org\/datasets\/movielens\/ml-1m-README.txt.  Summary of movielens datasets. http:\/\/files.grouplens.org\/datasets\/movielens\/ml-1m-README.txt."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/170035.170072"},{"key":"e_1_2_1_3_1","first-page":"487","volume-title":"VLDB","author":"Agrawal R.","year":"1994","unstructured":"R. Agrawal and R. Srikant . Fast algorithms for mining association rules in large databases . In VLDB , pages 487 -- 499 , 1994 . R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB, pages 487--499, 1994."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0304-3975(02)00737-5"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/2746539.2746612"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2007.367920"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.14778\/1453856.1453980"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2013.6544886"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.14778\/1687627.1687690"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.14778\/1687627.1687691"},{"issue":"4","key":"e_1_2_1_11_1","first-page":"355","article-title":"Data quality and the bottom line","volume":"160","author":"Eckerson W.","year":"1992","unstructured":"W. Eckerson . Data quality and the bottom line . Journal of Radioanalytical & Nuclear Chemistry , 160 ( 4 ): 355 -- 362 , 1992 . W. Eckerson. Data quality and the bottom line. Journal of Radioanalytical & Nuclear Chemistry, 160(4):355--362, 1992.","journal-title":"Journal of Radioanalytical & Nuclear Chemistry"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2013.6544848"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/1718487.1718504"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.14778\/2536360.2536363"},{"key":"e_1_2_1_15_1","first-page":"140","volume-title":"KDD","author":"Lakshminarayan K.","year":"1996","unstructured":"K. Lakshminarayan , S. A. Harp , R. P. Goldman , and T. Samad . Imputation of missing data using machine learning techniques . In KDD , pages 140 -- 145 , 1996 . K. Lakshminarayan, S. A. Harp, R. P. Goldman, and T. Samad. Imputation of missing data using machine learning techniques. In KDD, pages 140--145, 1996."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/1807167.1807178"},{"issue":"4","key":"e_1_2_1_17_1","first-page":"245","article-title":"To err is human: Building a safer health system","volume":"7","author":"Medicine I. O.","year":"2000","unstructured":"I. O. Medicine , J. M. Corrigan , and M. S. Donaldson . To err is human: Building a safer health system . Institute of Medicine the National Academies , 7 ( 4 ): 245 -- 246 , 2000 . I. O. Medicine, J. M. Corrigan, and M. S. Donaldson. To err is human: Building a safer health system. Institute of Medicine the National Academies, 7(4):245--246, 2000.","journal-title":"Institute of Medicine the National Academies"},{"key":"e_1_2_1_18_1","first-page":"381","volume-title":"VLDB","author":"Raman V.","year":"2001","unstructured":"V. Raman and J. M. Hellerstein . Potter's Wheel: An interactive data cleaning system . In VLDB , pages 381 -- 390 , 2001 . V. Raman and J. M. Hellerstein. Potter's Wheel: An interactive data cleaning system. In VLDB, pages 381--390, 2001."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/BMEI.2008.322"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/321796.321811"},{"key":"e_1_2_1_21_1","unstructured":"T. Warren. Using regular expressions for data cleansing and standardization. http:\/\/www.kimballgroup.com\/2009\/01\/using-regular-expressions-for-data-cleansing-and-standardization.  T. Warren. Using regular expressions for data cleansing and standardization. http:\/\/www.kimballgroup.com\/2009\/01\/using-regular-expressions-for-data-cleansing-and-standardization."}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/2876473.2876478","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T10:44:08Z","timestamp":1672224248000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/2876473.2876478"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2016,1]]},"references-count":21,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2016,1]]}},"alternative-id":["10.14778\/2876473.2876478"],"URL":"https:\/\/doi.org\/10.14778\/2876473.2876478","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2016,1]]}}}