{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,9]],"date-time":"2026-01-09T03:33:53Z","timestamp":1767929633426,"version":"3.49.0"},"reference-count":39,"publisher":"Association for Computing Machinery (ACM)","issue":"PLDI","license":[{"start":{"date-parts":[[2023,6,6]],"date-time":"2023-06-06T00:00:00Z","timestamp":1686009600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"JSPS KAKENHI","award":["JP17H01720, JP18K19787, JP20H04162, JP20K20625, and JP22H03570."],"award-info":[{"award-number":["JP17H01720, JP18K19787, JP20H04162, JP20K20625, and JP22H03570."]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Program. Lang."],"published-print":{"date-parts":[[2023,6,6]]},"abstract":"<jats:p>While synthesizing and repairing regular expressions (regexes) based on Programming-by-Examples (PBE) methods have seen rapid progress in recent years, all existing works only support synthesizing or repairing regexes for membership testing, and the support for extraction is still an open problem. This paper fills the void by proposing the first PBE-based method for synthesizing and repairing regexes for extraction. Our work supports regexes that have real-world extensions such as backreferences and lookarounds. The extensions significantly affect the PBE-based synthesis and repair problem. In fact, we show that there are unsolvable instances of the problem if the synthesized regexes are not allowed to use the extensions, i.e., there is no regex without the extensions that correctly classify the given set of examples, whereas every problem instance is solvable if the extensions are allowed. This is in stark contrast to the case for the membership where every instance is guaranteed to have a solution expressible by a pure regex without the extensions. The main contribution of the paper is an algorithm to solve the PBE-based synthesis and repair problem for extraction. Our algorithm builds on existing methods for synthesizing and repairing regexes for membership testing, i.e., the enumerative search algorithms with SMT constraint solving. However, significant extensions are needed because the SMT constraints in the previous works are based on a non-deterministic semantics of regexes. Non-deterministic semantics is sound for membership but not for extraction, because which substrings are extracted depends on the deterministic behavior of actual regex engines. To address the issue, we propose a new SMT constraint generation method that respects the deterministic behavior of regex engines. For this, we first define a novel formal semantics of an actual regex engine as a deterministic big-step operational semantics, and use it as a basis to design the new SMT constraint generation method. The key idea to simulate the determinism in the formal semantics and the constraints is to consider continuations of regex matching and use them for disambiguation. We also propose two new search space pruning techniques called approximation-by-pure-regex and approximation-by-backreferences that make use of the extraction information in the examples.\u3000We have implemented the synthesis and repair algorithm in a tool called R3 (Repairing Regex for extRaction) and evaluated it on 50 regexes that contain real-world extensions. Our evaluation shows the effectiveness of the algorithm and that our new pruning techniques substantially prune the search space.<\/jats:p>","DOI":"10.1145\/3591287","type":"journal-article","created":{"date-parts":[[2023,6,6]],"date-time":"2023-06-06T20:06:24Z","timestamp":1686081984000},"page":"1633-1656","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":7,"title":["Repairing Regular Expressions for Extraction"],"prefix":"10.1145","volume":"7","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9542-9234","authenticated-orcid":false,"given":"Nariyoshi","family":"Chida","sequence":"first","affiliation":[{"name":"NTT Social Informatics Laboratories, Japan \/ Waseda University, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5305-4916","authenticated-orcid":false,"given":"Tachio","family":"Terauchi","sequence":"additional","affiliation":[{"name":"Waseda University, Japan"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2023,6,6]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0019-9958(78)90683-6"},{"key":"e_1_2_1_2_1","volume-title":"Angular: The modern web developer\u2019s platform..  https:\/\/angular.io\/","year":"2022","unstructured":"Angular. 2022 . Angular: The modern web developer\u2019s platform.. https:\/\/angular.io\/ Angular. 2022. Angular: The modern web developer\u2019s platform.. https:\/\/angular.io\/"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/MC.2014.344"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2016.2515587"},{"key":"e_1_2_1_5_1","volume-title":"Proceedings of the Prague Stringology Conference 2017","author":"Berglund Martin","year":"2017","unstructured":"Martin Berglund and Brink van der Merwe. 2017. Regular Expressions with Backreferences Re-examined . In Proceedings of the Prague Stringology Conference 2017 , Prague, Czech Republic , August 28-30, 2017 , Jan Holub and Jan Zd\u00e1rek (Eds.). Department of Theoretical Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 30\u201341. http:\/\/www.stringology.org\/event\/2017\/p04.html Martin Berglund and Brink van der Merwe. 2017. Regular Expressions with Backreferences Re-examined. In Proceedings of the Prague Stringology Conference 2017, Prague, Czech Republic, August 28-30, 2017, Jan Holub and Jan Zd\u00e1rek (Eds.). Department of Theoretical Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 30\u201341. http:\/\/www.stringology.org\/event\/2017\/p04.html"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/168304.168340"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/3385412.3385988"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3498707"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.4230\/LIPIcs.FSCD.2022.15"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/SP46214.2022.9833597"},{"key":"e_1_2_1_11_1","unstructured":"Russ Cox. 2007. Regular Expression Matching Can Be Simple And Fast (but is slow in Java Perl PHP Python Ruby ...).  https:\/\/swtch.com\/ rsc\/regexp\/regexp1.html \t\t\t\t  Russ Cox. 2007. Regular Expression Matching Can Be Simple And Fast (but is slow in Java Perl PHP Python Ruby ...).  https:\/\/swtch.com\/ rsc\/regexp\/regexp1.html"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/3236024.3236027"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/3338906.3338909"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.5555\/1792734.1792766"},{"key":"e_1_2_1_15_1","volume-title":"Django: The Web framework for perfectionists with deadlines..  https:\/\/www.djangoproject.com\/","year":"2022","unstructured":"Django. 2022 . Django: The Web framework for perfectionists with deadlines.. https:\/\/www.djangoproject.com\/ Django. 2022. Django: The Web framework for perfectionists with deadlines.. https:\/\/www.djangoproject.com\/"},{"key":"e_1_2_1_16_1","unstructured":"ECMA International. 2022. ECMAScript\u00ae 2023 Language Specification.  https:\/\/tc39.es\/ecma262\/multipage\/#sec-intro \t\t\t\t  ECMA International. 2022. ECMAScript\u00ae 2023 Language Specification.  https:\/\/tc39.es\/ecma262\/multipage\/#sec-intro"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.ic.2008.12.008"},{"key":"e_1_2_1_18_1","volume-title":"FOREST: An Interactive Multi-tree Synthesizer for Regular Expressions. In Tools and Algorithms for the Construction and Analysis of Systems","author":"Ferreira Margarida","year":"2021","unstructured":"Margarida Ferreira , Miguel Terra-Neves , Miguel Ventura , In\u00eas Lynce , and Ruben Martins . 2021 . FOREST: An Interactive Multi-tree Synthesizer for Regular Expressions. In Tools and Algorithms for the Construction and Analysis of Systems , Jan Friso Groote and Kim Guldstrand Larsen (Eds.). Springer International Publishing , Cham . 152\u2013169. isbn:978-3-030-72016-2 Margarida Ferreira, Miguel Terra-Neves, Miguel Ventura, In\u00eas Lynce, and Ruben Martins. 2021. FOREST: An Interactive Multi-tree Synthesizer for Regular Expressions. In Tools and Algorithms for the Construction and Analysis of Systems, Jan Friso Groote and Kim Guldstrand Larsen (Eds.). Springer International Publishing, Cham. 152\u2013169. isbn:978-3-030-72016-2"},{"key":"e_1_2_1_19_1","volume-title":"Mastering Regular Expressions (3 ed.). O\u2019Reilly","author":"Friedl Jeffrey E. F.","unstructured":"Jeffrey E. F. Friedl . 2006. Mastering Regular Expressions (3 ed.). O\u2019Reilly , Beijing . isbn:978-0-596-52812-6 https:\/\/www.safaribooksonline.com\/library\/view\/mastering-regular-expressions\/0596528124\/ Jeffrey E. F. Friedl. 2006. Mastering Regular Expressions (3 ed.). O\u2019Reilly, Beijing. isbn:978-0-596-52812-6 https:\/\/www.safaribooksonline.com\/library\/view\/mastering-regular-expressions\/0596528124\/"},{"key":"e_1_2_1_20_1","volume-title":"Greedy Regular Expression Matching","author":"Frisch Alain","unstructured":"Alain Frisch and Luca Cardelli . 2004. Greedy Regular Expression Matching . In Automata, Languages and Programming, Josep D\u00edaz, Juhani Karhum\u00e4ki, Arto Lepist\u00f6, and Donald Sannella (Eds.). Springer Berlin Heidelberg , Berlin, Heidelberg . 618\u2013629. isbn:978-3-540-27836-8 Alain Frisch and Luca Cardelli. 2004. Greedy Regular Expression Matching. In Automata, Languages and Programming, Josep D\u00edaz, Juhani Karhum\u00e4ki, Arto Lepist\u00f6, and Donald Sannella (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg. 618\u2013629. isbn:978-3-540-27836-8"},{"key":"e_1_2_1_21_1","unstructured":"Google. [n. d.]. RE2.  https:\/\/github.com\/google\/re2 \t\t\t\t  Google. [n. d.]. RE2.  https:\/\/github.com\/google\/re2"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4684-2001-2_9"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3093335.2993244"},{"key":"e_1_2_1_24_1","volume-title":"Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics","author":"Li Yunyao","unstructured":"Yunyao Li , Rajasekar Krishnamurthy , Sriram Raghavan , Shivakumar Vaithyanathan , and H. V. Jagadish . 2008. Regular Expression Learning for Information Extraction . In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics , Honolulu, Hawaii. 21\u201330. https:\/\/aclanthology.org\/D08-1003 Yunyao Li, Rajasekar Krishnamurthy, Sriram Raghavan, Shivakumar Vaithyanathan, and H. V. Jagadish. 2008. Regular Expression Learning for Information Extraction. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Honolulu, Hawaii. 21\u201330. https:\/\/aclanthology.org\/D08-1003"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSE43902.2021.00111"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3324884.3416556"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3314221.3314645"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3355369.3355589"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/ASE.2019.00047"},{"key":"e_1_2_1_30_1","unstructured":"Chris O\u2019Hara. 2022. Validator.js.  https:\/\/github.com\/validatorjs\/validator.js\/ \t\t\t\t  Chris O\u2019Hara. 2022. Validator.js.  https:\/\/github.com\/validatorjs\/validator.js\/"},{"key":"e_1_2_1_31_1","unstructured":"OWASP. 2022. Input Validation Cheat Sheet.  https:\/\/cheatsheetseries.owasp.org\/cheatsheets\/Input_Validation_Cheat_Sheet.html \t\t\t\t  OWASP. 2022. Input Validation Cheat Sheet.  https:\/\/cheatsheetseries.owasp.org\/cheatsheets\/Input_Validation_Cheat_Sheet.html"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/3360565"},{"key":"e_1_2_1_33_1","volume-title":"Suchanek","author":"Rebele Thomas","year":"2018","unstructured":"Thomas Rebele , Katerina Tzompanaki , and Fabian M . Suchanek . 2018 . Adding Missing Words to Regular Expressions. In Advances in Knowledge Discovery and Data Mining, Dinh Phung, Vincent S. Tseng, Geoffrey I. Webb, Bao Ho, Mohadeseh Ganji, and Lida Rashidi (Eds.). Springer International Publishing , Cham. 67\u201379. isbn:978-3-319-93037-4 Thomas Rebele, Katerina Tzompanaki, and Fabian M. Suchanek. 2018. Adding Missing Words to Regular Expressions. In Advances in Knowledge Discovery and Data Mining, Dinh Phung, Vincent S. Tseng, Geoffrey I. Webb, Bao Ho, Mohadeseh Ganji, and Lida Rashidi (Eds.). Springer International Publishing, Cham. 67\u201379. isbn:978-3-319-93037-4"},{"key":"e_1_2_1_34_1","unstructured":"RegExLib. 2022.  https:\/\/regexlib.com\/ \t\t\t\t  RegExLib. 2022.  https:\/\/regexlib.com\/"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jal.2011.11.003"},{"key":"e_1_2_1_36_1","unstructured":"Amazon Web Services. 2022. Regex match rule statement.  https:\/\/docs.aws.amazon.com\/waf\/latest\/developerguide\/waf-rule-statement-type-regex-match.html \t\t\t\t  Amazon Web Services. 2022. Regex match rule statement.  https:\/\/docs.aws.amazon.com\/waf\/latest\/developerguide\/waf-rule-statement-type-regex-match.html"},{"key":"e_1_2_1_37_1","unstructured":"Snort. 2022. Snort.  https:\/\/www.snort.org\/ \t\t\t\t  Snort. 2022. Snort.  https:\/\/www.snort.org\/"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3379597.3387464"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/3379337.3415900"}],"container-title":["Proceedings of the ACM on Programming Languages"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3591287","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3591287","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:47:20Z","timestamp":1750178840000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3591287"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,6,6]]},"references-count":39,"journal-issue":{"issue":"PLDI","published-print":{"date-parts":[[2023,6,6]]}},"alternative-id":["10.1145\/3591287"],"URL":"https:\/\/doi.org\/10.1145\/3591287","relation":{},"ISSN":["2475-1421"],"issn-type":[{"value":"2475-1421","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,6,6]]},"assertion":[{"value":"2023-06-06","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}