{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,11]],"date-time":"2026-04-11T02:13:04Z","timestamp":1775873584154,"version":"3.50.1"},"reference-count":73,"publisher":"Association for Computing Machinery (ACM)","issue":"8","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2021,4]]},"abstract":"<jats:p>\n            <jats:italic>Can AI help automate human-easy but computer-hard data preparation tasks that burden data scientists, practitioners, and crowd workers?<\/jats:italic>\n            We answer this question by presenting RPT, a denoising autoencoder for\n            <jats:italic>tuple-to-X<\/jats:italic>\n            models (\"\n            <jats:italic>X<\/jats:italic>\n            \" could be tuple, token, label, JSON, and so on). RPT is pre-trained for a\n            <jats:italic>tuple-to-tuple<\/jats:italic>\n            model by corrupting the input tuple and then learning a model to reconstruct the original tuple. It adopts a Transformer-based neural translation architecture that consists of a bidirectional encoder (similar to BERT) and a left-to-right autoregressive decoder (similar to GPT), leading to a generalization of both BERT and GPT. The pre-trained RPT can already support several common data preparation tasks such as data cleaning, auto-completion and schema matching. Better still, RPT can be fine-tuned on a wide range of data preparation tasks, such as value normalization, data transformation, data annotation, etc. To complement RPT, we also discuss several appealing techniques such as collaborative training and few-shot learning for entity resolution, and few-shot learning and NLP question-answering for information extraction. In addition, we identify a series of research opportunities to advance the field of data preparation.\n          <\/jats:p>","DOI":"10.14778\/3457390.3457391","type":"journal-article","created":{"date-parts":[[2021,10,21]],"date-time":"2021-10-21T22:48:38Z","timestamp":1634856518000},"page":"1254-1261","source":"Crossref","is-referenced-by-count":48,"title":["RPT"],"prefix":"10.14778","volume":"14","author":[{"given":"Nan","family":"Tang","sequence":"first","affiliation":[{"name":"HBKU, Qatar"}]},{"given":"Ju","family":"Fan","sequence":"additional","affiliation":[{"name":"Renmin University, China"}]},{"given":"Fangyi","family":"Li","sequence":"additional","affiliation":[{"name":"Renmin University, China"}]},{"given":"Jianhong","family":"Tu","sequence":"additional","affiliation":[{"name":"Renmin University, China"}]},{"given":"Xiaoyong","family":"Du","sequence":"additional","affiliation":[{"name":"Renmin University, China"}]},{"given":"Guoliang","family":"Li","sequence":"additional","affiliation":[{"name":"Tsinghua University, China"}]},{"given":"Sam","family":"Madden","sequence":"additional","affiliation":[{"name":"MIT"}]},{"given":"Mourad","family":"Ouzzani","sequence":"additional","affiliation":[{"name":"HBKU, Qatar"}]}],"member":"320","published-online":{"date-parts":[[2021,10,21]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994518"},{"key":"e_1_2_1_2_1","unstructured":"Abt-Buy. [n.d.]. https:\/\/github.com\/anhaidgroup\/deepmatcher\/blob\/master\/Datasets.md#abt-buy.  Abt-Buy. [n.d.]. https:\/\/github.com\/anhaidgroup\/deepmatcher\/blob\/master\/Datasets.md#abt-buy."},{"key":"e_1_2_1_3_1","unstructured":"Amazon-Google. [n.d.]. https:\/\/github.com\/anhaidgroup\/deepmatcher\/blob\/master\/Datasets.md#amazon-google.  Amazon-Google. [n.d.]. https:\/\/github.com\/anhaidgroup\/deepmatcher\/blob\/master\/Datasets.md#amazon-google."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/2500490"},{"key":"e_1_2_1_5_1","doi-asserted-by":"crossref","unstructured":"Nikita Bhutani Yoshihiko Suhara Wang-Chiew Tan Alon Y. Halevy and H. V. Jagadish. 2019. Open Information Extraction from Question-Answer Pairs. In NAACL-HLT Jill Burstein Christy Doran and Thamar Solorio (Eds.). 2294--2305.  Nikita Bhutani Yoshihiko Suhara Wang-Chiew Tan Alon Y. Halevy and H. V. Jagadish. 2019. Open Information Extraction from Question-Answer Pairs. In NAACL-HLT Jill Burstein Christy Doran and Thamar Solorio (Eds.). 2294--2305.","DOI":"10.18653\/v1\/N19-1239"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/1066157.1066175"},{"key":"e_1_2_1_7_1","volume-title":"David Petrou, Daniel Ramage, and Jason Roselander.","author":"Bonawitz Keith","year":"2019"},{"key":"e_1_2_1_8_1","unstructured":"Tom B. Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan Rewon Child Aditya Ramesh Daniel M. Ziegler Jeffrey Wu Clemens Winter Christopher Hesse Mark Chen Eric Sigler Mateusz Litwin Scott Gray Benjamin Chess Jack Clark Christopher Berner Sam McCandlish Alec Radford Ilya Sutskever and Dario Amodei. 2020. Language Models are Few-Shot Learners. In NeurIPS.  Tom B. Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan Rewon Child Aditya Ramesh Daniel M. Ziegler Jeffrey Wu Clemens Winter Christopher Hesse Mark Chen Eric Sigler Mateusz Litwin Scott Gray Benjamin Chess Jack Clark Christopher Berner Sam McCandlish Alec Radford Ilya Sutskever and Dario Amodei. 2020. Language Models are Few-Shot Learners. In NeurIPS."},{"key":"e_1_2_1_9_1","unstructured":"Ursin Brunner and Kurt Stockinger. 2020. Entity Matching with Transformer Architectures - A Step Forward in Data Integration. In EDBT. 463--473.  Ursin Brunner and Kurt Stockinger. 2020. Entity Matching with Transformer Architectures - A Step Forward in Data Integration. In EDBT. 463--473."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/1519103.1519112"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3389742"},{"key":"e_1_2_1_12_1","volume-title":"Hinton","author":"Chen Ting","year":"2020"},{"key":"e_1_2_1_13_1","volume-title":"TabFact: A Large-scale Dataset for Table-based Fact Verification. CoRR abs\/1909.02164","author":"Chen Wenhu","year":"2019"},{"key":"e_1_2_1_14_1","first-page":"73","article-title":"Siamese Neural Networks","volume":"2190","author":"Chicco Davide","year":"2021","journal-title":"An Overview. In Artificial Neural Networks - Third Edition. Methods in Molecular Biology"},{"key":"e_1_2_1_15_1","volume-title":"Peggy Cellier and Kurt Driessens (Eds.)","volume":"1168","author":"Christen Victor","year":"2019"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.14778\/2536258.2536262"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/2723372.2749431"},{"key":"e_1_2_1_18_1","volume-title":"Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed K. Elmagarmid, Ihab F. Ilyas, Samuel Madden, Mourad Ouzzani, and Nan Tang.","author":"Deng Dong","year":"2017"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.5555\/3430915.3442430"},{"key":"e_1_2_1_20_1","unstructured":"Jacob Devlin Ming-Wei Chang Kenton Lee and Kristina Toutanova. [n.d.]. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT Jill Burstein Christy Doran and Thamar Solorio (Eds.). 4171--4186.  Jacob Devlin Ming-Wei Chang Kenton Lee and Kristina Toutanova. [n.d.]. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT Jill Burstein Christy Doran and Thamar Solorio (Eds.). 4171--4186."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.5555\/2401764"},{"key":"e_1_2_1_22_1","volume-title":"A Roadmap for a Rigorous Science of Interpretability. CoRR abs\/1702.08608","author":"Doshi-Velez Finale","year":"2017"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.5555\/3236187.3269461"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.5555\/1191547.1191739"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.5555\/2371176"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/1366102.1366103"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2013.6544848"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.14778\/1920841.1920867"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3329859.3329877"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3034786.3056124"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3444831.3444835"},{"key":"e_1_2_1_32_1","unstructured":"Shuang Hao Nan Tang Guoliang Li and Jian Li. 2017. Cleaning Relations Using Knowledge Bases. In ICDE. 933--944.  Shuang Hao Nan Tang Guoliang Li and Jian Li. 2017. Cleaning Relations Using Knowledge Bases. In ICDE. 933--944."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2915242"},{"key":"e_1_2_1_34_1","volume-title":"Record fusion: A learning approach. CoRR abs\/2006.10208","author":"Heidari Alireza","year":"2020"},{"key":"e_1_2_1_35_1","volume-title":"Thomas M\u00fcller, Francesco Piccinno, and Julian Martin Eisenschlos.","author":"Herzig Jonathan","year":"2020"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3310205"},{"key":"e_1_2_1_37_1","doi-asserted-by":"crossref","unstructured":"Matteo Interlandi and Nan Tang. 2015. Proof positive and negative in data cleaning. In ICDE. 18--29.  Matteo Interlandi and Nan Tang. 2015. Proof positive and negative in data cleaning. In ICDE. 18--29.","DOI":"10.1109\/ICDE.2015.7113269"},{"key":"e_1_2_1_38_1","unstructured":"Itunes-Amazon. [n.d.]. https:\/\/github.com\/anhaidgroup\/deepmatcher\/blob\/master\/Datasets.md#itunes-amazon.  Itunes-Amazon. [n.d.]. https:\/\/github.com\/anhaidgroup\/deepmatcher\/blob\/master\/Datasets.md#itunes-amazon."},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.14778\/3067421.3067431"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00300"},{"key":"e_1_2_1_41_1","doi-asserted-by":"crossref","unstructured":"Jungo Kasai Kun Qian Sairam Gurajada Yunyao Li and Lucian Popa. 2019. Low-resource Deep Entity Resolution with Transfer and Active Learning. In ACL Anna Korhonen David R. Traum and Llu\u00eds M\u00e0rquez (Eds.). 5851--5861.  Jungo Kasai Kun Qian Sairam Gurajada Yunyao Li and Lucian Popa. 2019. Low-resource Deep Entity Resolution with Transfer and Active Learning. In ACL Anna Korhonen David R. Traum and Llu\u00eds M\u00e0rquez (Eds.). 5851--5861.","DOI":"10.18653\/v1\/P19-1586"},{"key":"e_1_2_1_42_1","unstructured":"Been Kim and Finale Doshi-Velez. 2017. Interpretable Machine Learning: The fuss the concrete and the questions. In ICML Tutorial.  Been Kim and Finale Doshi-Velez. 2017. Interpretable Machine Learning: The fuss the concrete and the questions. In ICML Tutorial."},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994535"},{"key":"e_1_2_1_44_1","volume-title":"BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In ACL. 7871--7880.","author":"Lewis Mike","year":"2020"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.14778\/3421424.3421431"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.14778\/1920841.1920916"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/MIS.2020.2988525"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407801"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3324956"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3196926"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/3377455"},{"key":"e_1_2_1_52_1","unstructured":"Ofir Press Noah A. Smith and Omer Levy. [n.d.]. Improving Transformer Models by Reordering their Sublayers. In ACL Dan Jurafsky Joyce Chai Natalie Schluter and Joel R. Tetreault (Eds.). 2996--3005.  Ofir Press Noah A. Smith and Omer Levy. [n.d.]. Improving Transformer Models by Reordering their Sublayers. In ACL Dan Jurafsky Joyce Chai Natalie Schluter and Joel R. Tetreault (Eds.). 2996--3005."},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.14778\/2856318.2856325"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.14778\/3377369.3377377"},{"key":"e_1_2_1_55_1","article-title":"Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer","volume":"21","author":"Raffel Colin","year":"2020","journal-title":"J. Mach. Learn. Res."},{"key":"e_1_2_1_56_1","doi-asserted-by":"crossref","unstructured":"Pranav Rajpurkar Jian Zhang Konstantin Lopyrev and Percy Liang. 2016. SQuAD: 100 000+ Questions for Machine Comprehension of Text. In EMNLP Jian Su Xavier Carreras and Kevin Duh (Eds.). 2383--2392.  Pranav Rajpurkar Jian Zhang Konstantin Lopyrev and Percy Liang. 2016. SQuAD: 100 000+ Questions for Machine Comprehension of Text. In EMNLP Jian Su Xavier Carreras and Kevin Duh (Eds.). 2383--2392.","DOI":"10.18653\/v1\/D16-1264"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137628.3137631"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/2939672.2939778"},{"key":"e_1_2_1_59_1","volume-title":"Exploiting Cloze Questions for Few-Shot Text Classification and Natural Language Inference. CoRR abs\/2001.07676","author":"Schick Timo","year":"2020"},{"key":"e_1_2_1_60_1","unstructured":"sigmod-2020-contest. [n.d.]. http:\/\/www.inf.uniroma3.it\/db\/sigmod2020contest\/task.html.  sigmod-2020-contest. [n.d.]. http:\/\/www.inf.uniroma3.it\/db\/sigmod2020contest\/task.html."},{"key":"e_1_2_1_61_1","volume-title":"Mourad Ouzzani, Nan Tang, and Shafiq R. Joty.","author":"Thirumuruganathan Saravanan","year":"2018"},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.5555\/296635.296639"},{"key":"e_1_2_1_63_1","unstructured":"Arata Ugawa Akihiro Tamura Takashi Ninomiya Hiroya Takamura and Manabu Okumura. 2018. Neural Machine Translation Incorporating Named Entity. In COLING. 3240--3250.  Arata Ugawa Akihiro Tamura Takashi Ninomiya Hiroya Takamura and Manabu Okumura. 2018. Neural Machine Translation Incorporating Named Entity. In COLING. 3240--3250."},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.5555\/3295222.3295349"},{"key":"e_1_2_1_65_1","unstructured":"Walmart-Amazon. [n.d.]. https:\/\/github.com\/anhaidgroup\/deepmatcher\/blob\/master\/Datasets.md#walmart-amazon.  Walmart-Amazon. [n.d.]. https:\/\/github.com\/anhaidgroup\/deepmatcher\/blob\/master\/Datasets.md#walmart-amazon."},{"key":"e_1_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1145\/2588555.2610494"},{"key":"e_1_2_1_67_1","volume-title":"Xin Luna Dong, and Shuiwang Ji","author":"Wang Zhengyang","year":"2020"},{"key":"e_1_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3389743"},{"key":"e_1_2_1_69_1","doi-asserted-by":"publisher","DOI":"10.1145\/2463676.2463706"},{"key":"e_1_2_1_70_1","doi-asserted-by":"publisher","DOI":"10.14778\/1952376.1952378"},{"key":"e_1_2_1_71_1","doi-asserted-by":"publisher","DOI":"10.1145\/3298981"},{"key":"e_1_2_1_72_1","unstructured":"Pengcheng Yin Graham Neubig Wen-tau Yih and Sebastian Riedel. 2020. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In ACL. 8413--8426.  Pengcheng Yin Graham Neubig Wen-tau Yih and Sebastian Riedel. 2020. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In ACL. 8413--8426."},{"key":"e_1_2_1_73_1","unstructured":"Zhuosheng Zhang Junjie Yang and Hai Zhao. 2020. Retrospective Reader for Machine Reading Comprehension. arXiv:cs.CL\/2001.09694  Zhuosheng Zhang Junjie Yang and Hai Zhao. 2020. Retrospective Reader for Machine Reading Comprehension. arXiv:cs.CL\/2001.09694"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3457390.3457391","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,28]],"date-time":"2022-12-28T10:47:20Z","timestamp":1672224440000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3457390.3457391"}},"subtitle":["relational pre-trained transformer is almost all you need towards democratizing data preparation"],"short-title":[],"issued":{"date-parts":[[2021,4]]},"references-count":73,"journal-issue":{"issue":"8","published-print":{"date-parts":[[2021,4]]}},"alternative-id":["10.14778\/3457390.3457391"],"URL":"https:\/\/doi.org\/10.14778\/3457390.3457391","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2021,4]]}}}