{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,21]],"date-time":"2026-05-21T01:29:05Z","timestamp":1779326945686,"version":"3.51.4"},"reference-count":37,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2025,2,10]],"date-time":"2025-02-10T00:00:00Z","timestamp":1739145600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2025,2,10]]},"abstract":"<jats:p>\n                    String data is common in real-world datasets: 67.6% of values in a sample of 1.8 million real Excel spreadsheets from the web were represented as text. Automatically cleaning such string data can have a significant impact on users. Previous approaches are limited to error detection, require that the user provides annotations, examples, or constraints to fix the errors, and focus independently on syntactic errors or semantic errors in strings, but ignore that strings often contain both syntactic and semantic substrings. We introduce\n                    <jats:sc>DataVinci,<\/jats:sc>\n                    a fully unsupervised string data error detection and repair system.\n                    <jats:sc>DataVinci<\/jats:sc>\n                    learns regular-expression-based patterns that cover a majority of values in a column and reports values that do not satisfy such majority patterns as data errors.\n                    <jats:sc>DataVinci<\/jats:sc>\n                    can automatically derive edits to the data error based on the majority patterns and using row tuples associated with majority values as examples. To handle strings with both syntactic and semantic substrings,\n                    <jats:sc>DataVinci<\/jats:sc>\n                    uses an LLM to abstract (and re-concretize) portions of strings that are semantic. Because not all data columns can result in majority patterns, when available,\n                    <jats:sc>DataVinci<\/jats:sc>\n                    can leverage execution information from an existing data program (which uses the target data as input) to identify and correct data repairs that would not otherwise be identified.\n                    <jats:sc>DataVinci<\/jats:sc>\n                    outperforms eleven baseline systems on both data error detection and repair as demonstrated on four existing and new benchmarks.\n                  <\/jats:p>","DOI":"10.1145\/3709677","type":"journal-article","created":{"date-parts":[[2025,2,11]],"date-time":"2025-02-11T15:45:06Z","timestamp":1739288706000},"page":"1-26","source":"Crossref","is-referenced-by-count":2,"title":["<scp>DataVinci:<\/scp>\n                    Learning Syntactic and Semantic String Repairs"],"prefix":"10.1145","volume":"3","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-9510-4512","authenticated-orcid":false,"given":"Mukul","family":"Singh","sequence":"first","affiliation":[{"name":"Microsoft, Redmond, Washington, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0713-6141","authenticated-orcid":false,"given":"Jos\u00e9","family":"Cambronero","sequence":"additional","affiliation":[{"name":"Microsoft, Redmond, Washington, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9226-9634","authenticated-orcid":false,"given":"Sumit","family":"Gulwani","sequence":"additional","affiliation":[{"name":"Microsoft Research, Redmond, Washington, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3727-3291","authenticated-orcid":false,"given":"Vu","family":"Le","sequence":"additional","affiliation":[{"name":"Microsoft, Redmond, Washington, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2130-7223","authenticated-orcid":false,"given":"Carina","family":"Negreanu","sequence":"additional","affiliation":[{"name":"Robin AI, Cambridge, United Kingdom"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5559-5932","authenticated-orcid":false,"given":"Arjun","family":"Radhakrishna","sequence":"additional","affiliation":[{"name":"Microsoft, Redmond, Washington, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9182-597X","authenticated-orcid":false,"given":"Gust","family":"Verbruggen","sequence":"additional","affiliation":[{"name":"Microsoft, Keerbergen, Belgium"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,2,11]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-020-00617-6"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-24583-1_7"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/ASE51524.2021.9678525"},{"key":"e_1_2_1_4_1","volume-title":"Lin (Eds.)","volume":"33","author":"Brown Tom","year":"2020","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877--1901. https:\/\/proceedings.neurips.cc\/paper\/2020\/file\/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf"},{"key":"e_1_2_1_5_1","volume-title":"Derivatives of approximate regular expressions. Discrete Mathematics & Theoretical Computer Science","author":"Champarnaud Jean-Marc","year":"2013","unstructured":"Jean-Marc Champarnaud, Hadrien Jeanne, and Ludovic Mignot. 2013. Derivatives of approximate regular expressions. Discrete Mathematics & Theoretical Computer Science, Vol. 15, Automata, Logic and Semantics (2013)."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3622863"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/3622863"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2912574"},{"key":"e_1_2_1_9_1","volume-title":"Introduction to Algorithms","author":"Cormen Thomas H.","unstructured":"Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms, Third Edition 3rd ed.). The MIT Press.","edition":"3"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3319888"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3196889"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/3292500.3330993"},{"key":"e_1_2_1_13_1","unstructured":"Albert Q. Jiang Alexandre Sablayrolles Arthur Mensch Chris Bamford Devendra Singh Chaplot Diego de las Casas Florian Bressand Gianna Lengyel Guillaume Lample Lucile Saulnier L\u00e9lio Renard Lavaud Marie-Anne Lachaux Pierre Stock Teven Le Scao Thibaut Lavril Thomas Wang Timoth\u00e9e Lacroix and William El Sayed. 2023. Mistral 7B. arxiv: 2310.06825 [cs.CL] https:\/\/arxiv.org\/abs\/2310.06825"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994514"},{"key":"e_1_2_1_15_1","volume-title":"Suriya Gunasekar, and Yin Tat Lee.","author":"Li Yuanzhi","year":"2023","unstructured":"Yuanzhi Li, S\u00e9bastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. 2023. Textbooks Are All You Need II: textbfphi-1.5 technical report. arXiv preprint arXiv:2309.05463 (2023)."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407801"},{"key":"e_1_2_1_17_1","unstructured":"Mohammad Mahdavi and Ziawasch Abedjan. 2021. Semi-Supervised Data Cleaning with Raha and Baran.. In CIDR."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3324956"},{"key":"e_1_2_1_19_1","unstructured":"Microsoft. 2015. Program synthesis from input-output examples (PROSE). https:\/\/microsoft.github.io\/prose."},{"key":"e_1_2_1_20_1","volume-title":"Approximate matching of regular expressions. Bulletin of mathematical biology","author":"Myers Eugene W","year":"1989","unstructured":"Eugene W Myers and Webb Miller. 1989. Approximate matching of regular expressions. Bulletin of mathematical biology, Vol. 51, 1 (1989), 5--37."},{"key":"e_1_2_1_21_1","volume-title":"Can Foundation Models Wrangle Your Data? arXiv preprint arXiv:2205.09911","author":"Narayan Avanika","year":"2022","unstructured":"Avanika Narayan, Ines Chami, Laurel Orr, Simran Arora, and Christopher R\u00e9. 2022. Can Foundation Models Wrangle Your Data? arXiv preprint arXiv:2205.09911 (2022)."},{"key":"e_1_2_1_22_1","volume-title":"A guided tour to approximate string matching. ACM computing surveys (CSUR)","author":"Navarro Gonzalo","year":"2001","unstructured":"Gonzalo Navarro. 2001. A guided tour to approximate string matching. ACM computing surveys (CSUR), Vol. 33, 1 (2001), 31--88."},{"key":"e_1_2_1_23_1","unstructured":"OpenAI Josh Achiam Steven Adler Sandhini Agarwal Lama Ahmad Ilge Akkaya Florencia Leoni Aleman Diogo Almeida Janko Altenschmidt Sam Altman Shyamal Anadkat Red Avila Igor Babuschkin"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/3276520"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19--1677"},{"key":"e_1_2_1_26_1","doi-asserted-by":"crossref","unstructured":"Abdulhakim Qahtan Nan Tang Mourad Ouzzani Yang Cao and Michael Stonebraker. 2020. Pattern functional dependencies for data cleaning. (2020).","DOI":"10.14778\/3377369.3377377"},{"key":"e_1_2_1_27_1","first-page":"1","article-title":"Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer","volume":"21","author":"Raffel Colin","year":"2020","unstructured":"Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, Vol. 21, 140 (2020), 1--67. http:\/\/jmlr.org\/papers\/v21\/20-074.html","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3485535"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3485535"},{"key":"e_1_2_1_30_1","first-page":"381","article-title":"Potter's wheel: An interactive data cleaning system","volume":"1","author":"Raman Vijayshankar","year":"2001","unstructured":"Vijayshankar Raman and Joseph M Hellerstein. 2001. Potter's wheel: An interactive data cleaning system. In VLDB, Vol. 1. 381--390.","journal-title":"VLDB"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137628.3137631"},{"key":"e_1_2_1_32_1","volume-title":"CORNET: A neurosymbolic approach to learning conditional table formatting rules by example. arXiv preprint arXiv:2208.06032","author":"Singh Mukul","year":"2022","unstructured":"Mukul Singh, Jos\u00e9 Cambronero, Sumit Gulwani, Vu Le, Carina Negreanu, Mohammad Raza, and Gust Verbruggen. 2022. CORNET: A neurosymbolic approach to learning conditional table formatting rules by example. arXiv preprint arXiv:2208.06032 (2022)."},{"key":"e_1_2_1_33_1","unstructured":"Hugo Touvron Louis Martin Kevin Stone Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang Angela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic Sergey Edunov and Thomas Scialom. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arxiv: 2307.09288 [cs.CL] https:\/\/arxiv.org\/abs\/2307.09288"},{"key":"e_1_2_1_34_1","volume-title":"\u0141 ukasz Kaiser, and Illia Polosukhin","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, \u0141 ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https:\/\/proceedings.neurips.cc\/paper\/2017\/file\/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3485477"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3319855"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.emnlp-main.685"}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3709677","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3709677","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,31]],"date-time":"2026-03-31T18:20:33Z","timestamp":1774981233000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3709677"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,2,10]]},"references-count":37,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025,2,10]]}},"alternative-id":["10.1145\/3709677"],"URL":"https:\/\/doi.org\/10.1145\/3709677","relation":{},"ISSN":["2836-6573"],"issn-type":[{"value":"2836-6573","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,2,10]]}}}