{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,11]],"date-time":"2026-05-11T22:53:27Z","timestamp":1778540007250,"version":"3.51.4"},"reference-count":54,"publisher":"Association for Computing Machinery (ACM)","issue":"OOPSLA","license":[{"start":{"date-parts":[[2018,10,24]],"date-time":"2018-10-24T00:00:00Z","timestamp":1540339200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-sa\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["CCF-1527923"],"award-info":[{"award-number":["CCF-1527923"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100006112","name":"Microsoft Research","doi-asserted-by":"publisher","award":["PhD Fellowship"],"award-info":[{"award-number":["PhD Fellowship"]}],"id":[{"id":"10.13039\/100006112","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100006112","name":"Microsoft","doi-asserted-by":"publisher","award":["Internship"],"award-info":[{"award-number":["Internship"]}],"id":[{"id":"10.13039\/100006112","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Program. Lang."],"published-print":{"date-parts":[[2018,10,24]]},"abstract":"<jats:p>\n            We address the problem of learning a\n            <jats:italic>syntactic profile<\/jats:italic>\n            for a collection of strings, i.e. a set of regex-like patterns that succinctly describe the syntactic variations in the strings. Real-world datasets, typically curated from multiple sources, often contain data in various syntactic formats. Thus, any data processing task is preceded by the critical step of data format identification. However, manual inspection of data to identify the different formats is infeasible in standard big-data scenarios.\n          <\/jats:p>\n          <jats:p>Prior techniques are restricted to a small set of pre-defined patterns (e.g. digits, letters, words etc.), and provide no control over granularity of profiles. We define syntactic profiling as a problem of clustering strings based on syntactic similarity, followed by identifying patterns that succinctly describe each cluster. We present a technique for synthesizing such profiles over a given language of patterns, that also allows for interactive refinement by requesting a desired number of clusters.<\/jats:p>\n          <jats:p>Using a state-of-the-art inductive synthesis framework, PROSE, we have implemented our technique as FlashProfile. Across 153 tasks over 75 large real datasets, we observe a median profiling time of only \u223c 0.7s. Furthermore, we show that access to syntactic profiles may allow for more accurate synthesis of programs, i.e. using fewer examples, in programming-by-example (PBE) workflows such as Flash Fill.<\/jats:p>","DOI":"10.1145\/3276520","type":"journal-article","created":{"date-parts":[[2018,10,24]],"date-time":"2018-10-24T11:57:18Z","timestamp":1540382238000},"page":"1-28","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":21,"title":["FlashProfile: a framework for synthesizing data profiles"],"prefix":"10.1145","volume":"2","author":[{"given":"Saswat","family":"Padhi","sequence":"first","affiliation":[{"name":"University of California at Los Angeles, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Prateek","family":"Jain","sequence":"additional","affiliation":[{"name":"Microsoft Research Lab, India"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Daniel","family":"Perelman","sequence":"additional","affiliation":[{"name":"Microsoft, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Oleksandr","family":"Polozov","sequence":"additional","affiliation":[{"name":"Microsoft Research, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sumit","family":"Gulwani","sequence":"additional","affiliation":[{"name":"Microsoft, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Todd","family":"Millstein","sequence":"additional","affiliation":[{"name":"University of California at Los Angeles, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2018,10,24]]},"reference":[{"key":"e_1_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-015-0389-y"},{"key":"e_1_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1016\/0890-5401(87)90052-6"},{"key":"e_1_2_2_3_1","volume-title":"Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007","author":"Arthur David","year":"2007","unstructured":"David Arthur and Sergei Vassilvitskii . 2007 . k-means++: The Advantages of Careful Seeding . In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007 , New Orleans, Louisiana, USA , January 7-9, 2007, Nikhil Bansal, Kirk Pruhs, and Clifford Stein (Eds.). SIAM, 1027\u20131035. http:\/\/dl.acm.org\/citation.cfm?id=1283383.1283494 David Arthur and Sergei Vassilvitskii. 2007. k-means++: The Advantages of Careful Seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, Louisiana, USA, January 7-9, 2007, Nikhil Bansal, Kirk Pruhs, and Clifford Stein (Eds.). SIAM, 1027\u20131035. http:\/\/dl.acm.org\/citation.cfm?id=1283383.1283494"},{"key":"e_1_2_2_4_1","unstructured":"Ataccama. 2017. Ataccama One Platform. https:\/\/www.ataccama.com\/ .  Ataccama. 2017. Ataccama One Platform. https:\/\/www.ataccama.com\/ ."},{"key":"e_1_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/2330784.2331000"},{"key":"e_1_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3062341.3062349"},{"key":"e_1_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/2821650.2821667"},{"key":"e_1_2_2_8_1","volume-title":"Pattern Recognition and Machine Learning","author":"Bishop Christopher M.","unstructured":"Christopher M. Bishop . 2016. Pattern Recognition and Machine Learning . Springer New York . http:\/\/www.worldcat.org\/ oclc\/1005113608 Christopher M. Bishop. 2016. Pattern Recognition and Machine Learning. Springer New York. http:\/\/www.worldcat.org\/ oclc\/1005113608"},{"key":"e_1_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1010933404324"},{"key":"e_1_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1561\/1900000004"},{"key":"e_1_2_2_11_1","volume-title":"Grammatical inference: learning automata and grammars","author":"la Higuera Colin De","unstructured":"Colin De la Higuera . 2010. Grammatical inference: learning automata and grammars . Cambridge University Press . Colin De la Higuera. 2010. Grammatical inference: learning automata and grammars. Cambridge University Press."},{"key":"e_1_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.14778\/2536222.2536253"},{"key":"e_1_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.5555\/3172077.3172115"},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/1376616.1376759"},{"key":"e_1_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.5120\/11638-7118"},{"key":"e_1_2_2_16_1","unstructured":"Google. 2017. OpenRefine: A free open source powerful tool for working with messy data. http:\/\/openrefine.org\/ .  Google. 2017. OpenRefine: A free open source powerful tool for working with messy data. http:\/\/openrefine.org\/ ."},{"key":"e_1_2_2_17_1","volume-title":"Concrete Mathematics - A Foundation for Computer Science","author":"Graham Ronald L.","unstructured":"Ronald L. Graham , Donald E. Knuth , and Oren Patashnik . 1994. Concrete Mathematics - A Foundation for Computer Science , 2 nd Edition). Addison-Wesley . http:\/\/www.worldcat.org\/oclc\/992331503 Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. 1994. Concrete Mathematics - A Foundation for Computer Science, 2nd Edition). Addison-Wesley. http:\/\/www.worldcat.org\/oclc\/992331503","edition":"2"},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/1926385.1926423"},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1561\/2500000010"},{"key":"e_1_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.5555\/645921.673295"},{"key":"e_1_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1012801612483"},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1561\/2200000037"},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.5555\/1315451.1315455"},{"key":"e_1_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/331499.331504"},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2007.367889"},{"key":"e_1_2_2_26_1","volume-title":"Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015","author":"Kini Dileep","year":"2015","unstructured":"Dileep Kini and Sumit Gulwani . 2015 . FlashNormalize: Programming by Examples for Text Normalization . In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015 , Buenos Aires, Argentina , July 25-31, 2015, Qiang Yang and Michael Wooldridge (Eds.). AAAI Press, 776\u2013783. http:\/\/ijcai.org\/Abstract\/15\/115 Dileep Kini and Sumit Gulwani. 2015. FlashNormalize: Programming by Examples for Text Normalization. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, Qiang Yang and Michael Wooldridge (Eds.). AAAI Press, 776\u2013783. http:\/\/ijcai.org\/Abstract\/15\/115"},{"key":"e_1_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1609\/aimag.v30i4.2262"},{"key":"e_1_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/2594291.2594333"},{"key":"e_1_2_2_29_1","first-page":"707","article-title":"Binary Codes Capable of Correcting Deletions, Insertions, and Reversals","volume":"10","author":"Levenshtein Vladimir I","year":"1966","unstructured":"Vladimir I Levenshtein . 1966 . Binary Codes Capable of Correcting Deletions, Insertions, and Reversals . In Soviet Physics Doklady , Vol. 10. 707 \u2013 710 . http:\/\/adsabs.harvard.edu\/abs\/1966SPhD...10..707L Vladimir I Levenshtein. 1966. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. In Soviet Physics Doklady, Vol. 10. 707\u2013710. http:\/\/adsabs.harvard.edu\/abs\/1966SPhD...10..707L","journal-title":"Soviet Physics Doklady"},{"key":"e_1_2_2_30_1","unstructured":"Yunyao Li Rajasekar Krishnamurthy Sriram Raghavan Shivakumar Vaithyanathan and H. V. Jagadish. 2008. Regular Expression Learning for Information Extraction. In 2008 Conference on Empirical Methods in Natural Language Processing EMNLP 2008 Proceedings of the Conference 25-27 October 2008 Honolulu Hawaii USA A meeting of SIGDAT a Special Interest Group of the ACL. ACL 21\u201330. http:\/\/www.aclweb.org\/anthology\/D08- 1003   Yunyao Li Rajasekar Krishnamurthy Sriram Raghavan Shivakumar Vaithyanathan and H. V. Jagadish. 2008. Regular Expression Learning for Information Extraction. In 2008 Conference on Empirical Methods in Natural Language Processing EMNLP 2008 Proceedings of the Conference 25-27 October 2008 Honolulu Hawaii USA A meeting of SIGDAT a Special Interest Group of the ACL. ACL 21\u201330. http:\/\/www.aclweb.org\/anthology\/D08- 1003"},{"key":"e_1_2_2_31_1","unstructured":"Henry Lieberman. 2001. Your wish is my command: Programming by example. Morgan Kaufmann.  Henry Lieberman. 2001. Your wish is my command: Programming by example. Morgan Kaufmann."},{"key":"e_1_2_2_32_1","volume-title":"Is Key Hurdle to Insights.","author":"Lohr Steve","year":"2014","unstructured":"Steve Lohr . 2014. For Big-Data Scientists, \u2018Janitor Work \u2019 Is Key Hurdle to Insights. New York Times 17 ( 2014 ). https: \/\/www.nytimes.com\/2014\/08\/18\/technology\/for- big- data- scientists- hurdle- to- insights- is- janitor- work.html Steve Lohr. 2014. For Big-Data Scientists, \u2018Janitor Work\u2019 Is Key Hurdle to Insights. New York Times 17 (2014). https: \/\/www.nytimes.com\/2014\/08\/18\/technology\/for- big- data- scientists- hurdle- to- insights- is- janitor- work.html"},{"key":"e_1_2_2_33_1","volume-title":"Proceedings of the fifth Berkeley symposium on mathematical statistics and probability","volume":"1","author":"James","unstructured":"James MacQueen et al. 1967. Some methods for classification and analysis of multivariate observations . In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability , Vol. 1 . Oakland, CA, USA., 281\u2013297. James MacQueen et al. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1. Oakland, CA, USA., 281\u2013297."},{"key":"e_1_2_2_34_1","volume-title":"Introduction to information retrieval","author":"Manning Christopher D.","unstructured":"Christopher D. Manning , Prabhakar Raghavan , and Hinrich Sch\u00fctze . 2008. Introduction to information retrieval . Cambridge University Press . Christopher D. Manning, Prabhakar Raghavan, and Hinrich Sch\u00fctze. 2008. Introduction to information retrieval. Cambridge University Press."},{"key":"e_1_2_2_35_1","volume-title":"Data Quality Assessment","author":"Maydanchik Arkady","unstructured":"Arkady Maydanchik . 2007. Data Quality Assessment . Technics Publications . https:\/\/technicspub.com\/ data- quality- assessment\/ Arkady Maydanchik. 2007. Data Quality Assessment. Technics Publications. https:\/\/technicspub.com\/ data- quality- assessment\/"},{"key":"e_1_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/2807442.2807459"},{"key":"e_1_2_2_37_1","unstructured":"Microsoft. 2017a. Azure Machine Learning By-Example Data Transform. https:\/\/www.youtube.com\/watch?v=9KG0Sc2B2KI .  Microsoft. 2017a. Azure Machine Learning By-Example Data Transform. https:\/\/www.youtube.com\/watch?v=9KG0Sc2B2KI ."},{"key":"e_1_2_2_38_1","unstructured":"Microsoft. 2017b. Data Transformations \"By Example\" in the Azure ML Workbench. https:\/\/blogs.technet.microsoft.com\/ machinelearning\/2017\/09\/25\/by- example- transformations- in- the- azure- machine- learning- workbench\/ .  Microsoft. 2017b. Data Transformations \"By Example\" in the Azure ML Workbench. https:\/\/blogs.technet.microsoft.com\/ machinelearning\/2017\/09\/25\/by- example- transformations- in- the- azure- machine- learning- workbench\/ ."},{"key":"e_1_2_2_39_1","unstructured":"Microsoft. 2017c. Microsoft SQL Server Data Tools (SSDT). https:\/\/docs.microsoft.com\/en- gb\/sql\/ssdt .  Microsoft. 2017c. Microsoft SQL Server Data Tools (SSDT). https:\/\/docs.microsoft.com\/en- gb\/sql\/ssdt ."},{"key":"e_1_2_2_40_1","unstructured":"Microsoft. 2017d. Program Synthesis using Examples SDK. https:\/\/microsoft.github.io\/prose\/ .  Microsoft. 2017d. Program Synthesis using Examples SDK. https:\/\/microsoft.github.io\/prose\/ ."},{"key":"e_1_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1142\/9789812797919_0007"},{"key":"e_1_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/2814270.2814310"},{"key":"e_1_2_2_43_1","volume-title":"Potter\u2019s Wheel: An Interactive Data Cleaning System. In VLDB 2001, Proceedings of 27th International Conference on Very Large Data Bases","author":"Raman Vijayshankar","year":"2001","unstructured":"Vijayshankar Raman and Joseph M. Hellerstein . 2001 . Potter\u2019s Wheel: An Interactive Data Cleaning System. In VLDB 2001, Proceedings of 27th International Conference on Very Large Data Bases , September 11-14, 2001 , Roma, Italy, Peter M. G. Apers, Paolo Atzeni, Stefano Ceri, Stefano Paraboschi, Kotagiri Ramamohanarao, and Richard T. Snodgrass (Eds.). Morgan Kaufmann, 381\u2013390. http:\/\/www.vldb.org\/conf\/ 2001\/P381.pdf Vijayshankar Raman and Joseph M. Hellerstein. 2001. Potter\u2019s Wheel: An Interactive Data Cleaning System. In VLDB 2001, Proceedings of 27th International Conference on Very Large Data Bases, September 11-14, 2001, Roma, Italy, Peter M. G. Apers, Paolo Atzeni, Stefano Ceri, Stefano Paraboschi, Kotagiri Ramamohanarao, and Richard T. Snodgrass (Eds.). Morgan Kaufmann, 381\u2013390. http:\/\/www.vldb.org\/conf\/2001\/P381.pdf"},{"key":"e_1_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/2837614.2837671"},{"key":"e_1_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.14778\/2977797.2977807"},{"key":"e_1_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/1168857.1168907"},{"key":"e_1_2_2_47_1","first-page":"1","article-title":"A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons","volume":"5","author":"S\u00f8rensen Thorvald","year":"1948","unstructured":"Thorvald S\u00f8rensen . 1948 . A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons . Biol. Skr. 5 (1948), 1 \u2013 34 . Thorvald S\u00f8rensen. 1948. A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. Biol. Skr. 5 (1948), 1\u201334.","journal-title":"Biol. Skr."},{"key":"e_1_2_2_48_1","volume-title":"Proc. Genetic Programming","author":"Svingen Borge","year":"1998","unstructured":"Borge Svingen . 1998 . Learning Regular Languages using Genetic Programming . Proc. Genetic Programming (1998), 374\u2013376. Borge Svingen. 1998. Learning Regular Languages using Genetic Programming. Proc. Genetic Programming (1998), 374\u2013376."},{"key":"e_1_2_2_49_1","first-page":"1035","article-title":"Solution of Incorrectly Formulated Problems and the Regularization Method","volume":"151","author":"Tikhonov Andrei N","year":"1963","unstructured":"Andrei N Tikhonov . 1963 . Solution of Incorrectly Formulated Problems and the Regularization Method . In Dokl. Akad. Nauk. , Vol. 151. 1035 \u2013 1038 . Andrei N Tikhonov. 1963. Solution of Incorrectly Formulated Problems and the Regularization Method. In Dokl. Akad. Nauk., Vol. 151. 1035\u20131038.","journal-title":"Dokl. Akad. Nauk."},{"key":"e_1_2_2_50_1","unstructured":"Trifacta. 2017. Trifacta Wrangler. https:\/\/www.trifacta.com\/products\/wrangler\/ .  Trifacta. 2017. Trifacta Wrangler. https:\/\/www.trifacta.com\/products\/wrangler\/ ."},{"key":"e_1_2_2_51_1","unstructured":"William E Winkler. 1999. The State of Record Linkage and Current Research Problems.  William E Winkler. 1999. The State of Record Linkage and Current Research Problems."},{"key":"e_1_2_2_52_1","volume-title":"Data Mining: Practical Machine Learning Tools and Techniques","author":"Witten Ian H","year":"2017","unstructured":"Ian H Witten , Eibe Frank , Mark A Hall , and Christopher J Pal . 2017 . Data Mining: Practical Machine Learning Tools and Techniques , 4 th Edition. Elsevier Science & amp; Technology. http:\/\/www.worldcat.org\/oclc\/1007085077 Ian H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal. 2017. Data Mining: Practical Machine Learning Tools and Techniques, 4th Edition. Elsevier Science &amp; Technology. http:\/\/www.worldcat.org\/oclc\/1007085077","edition":"4"},{"key":"e_1_2_2_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/TNN.2005.845141"},{"key":"e_1_2_2_54_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-27694-1_13"}],"container-title":["Proceedings of the ACM on Programming Languages"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3276520","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3276520","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3276520","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T19:03:39Z","timestamp":1750273419000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3276520"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,10,24]]},"references-count":54,"journal-issue":{"issue":"OOPSLA","published-print":{"date-parts":[[2018,10,24]]}},"alternative-id":["10.1145\/3276520"],"URL":"https:\/\/doi.org\/10.1145\/3276520","relation":{},"ISSN":["2475-1421"],"issn-type":[{"value":"2475-1421","type":"electronic"}],"subject":[],"published":{"date-parts":[[2018,10,24]]},"assertion":[{"value":"2018-10-24","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}