{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,30]],"date-time":"2026-05-30T00:32:55Z","timestamp":1780101175738,"version":"3.54.0"},"reference-count":73,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2021,5,18]],"date-time":"2021-05-18T00:00:00Z","timestamp":1621296000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM\/IMS Trans. Data Sci."],"published-print":{"date-parts":[[2021,8,31]]},"abstract":"<jats:p>Error detection is a crucial preliminary phase in any data analytics pipeline. Existing error detection techniques typically target specific types of errors. Moreover, most of these detection models either require user-defined rules or ample hand-labeled training examples. Therefore, in this article, we present TabReformer, a model that learns bidirectional encoder representations for tabular data. The proposed model consists of two main phases. In the first phase, TabReformer follows encoder architecture with multiple self-attention layers to model the dependencies between cells and capture tuple-level representations. Also, the model utilizes a Gaussian Error Linear Unit activation function with the Masked Data Model objective to achieve deeper probabilistic understanding. In the second phase, the model parameters are fine-tuned for the task of erroneous data detection. The model applies a data augmentation module to generate more erroneous examples to represent the minority class. The experimental evaluation considers a wide range of databases with different types of errors and distributions. The empirical results show that our solution can enhance the recall values by 32.95% on average compared with state-of-the-art techniques while reducing the manual effort by up to 48.86%.<\/jats:p>","DOI":"10.1145\/3447541","type":"journal-article","created":{"date-parts":[[2021,5,18]],"date-time":"2021-05-18T10:45:29Z","timestamp":1621334729000},"page":"1-29","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":8,"title":["TabReformer: Unsupervised Representation Learning for Erroneous Data Detection"],"prefix":"10.1145","volume":"2","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7580-5757","authenticated-orcid":false,"given":"Mona","family":"Nashaat","sequence":"first","affiliation":[{"name":"Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta, Canada"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4908-9491","authenticated-orcid":false,"given":"Aindrila","family":"Ghosh","sequence":"additional","affiliation":[{"name":"Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta, Canada"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"James","family":"Miller","sequence":"additional","affiliation":[{"name":"Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta, Canada"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Shaikh","family":"Quader","sequence":"additional","affiliation":[{"name":"IBM Canada Software Lab, IBM Canada, Toronto, Ontario, Canada"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2021,5,18]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2018.2866863"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2013.6544847"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137628.3137631"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/SPW.2017.9"},{"key":"e_1_2_1_5_1","unstructured":"C. Pit\u2013Claudel Z. Mariet R. Harding and S. Madden. 2016. Outlier detection in heterogeneous datasets using automatic tuple expansion. 2016."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10618-020-00677-w"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","unstructured":"Y. Liu et al. 2019. Generative adversarial active learning for unsupervised outlier detection. IEEE Transactions on Knowledge and Data Engineering 2019 DOI:10.1109\/TKDE.2019.2905606.","DOI":"10.1109\/TKDE.2019.2905606"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.14778\/3229863.3229867"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/2463676.2465327"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.14778\/3377369.3377379"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.14778\/3368289.3368293"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2912574"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994518"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3371425.3371427"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3319888"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3357384.3358129"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994514"},{"key":"e_1_2_1_18_1","unstructured":"S. Krishnan M. J. Franklin K. Goldberg and E. Wu. 2017. BoostClean: Automated error detection and repair for machine learning. arXiv:1711.01299 [cs] 2017."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.5555\/3295222.3295349"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.5555\/2969239.2969304"},{"key":"e_1_2_1_21_1","unstructured":"J. Krantz and J. Kalita. 2018. Abstractive summarization using attentive neural techniques. arXiv:1810.08838 [cs] Oct. 2018."},{"key":"e_1_2_1_22_1","unstructured":"A. Sternberg J. Soares D. Carvalho and E. Ogasawara. 2017. A review on flight delay prediction. arXiv:1703.06118 [cs] 2017."},{"key":"e_1_2_1_23_1","volume-title":"BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 [cs]","author":"Devlin J.","year":"2019","unstructured":"J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 [cs], 2019."},{"key":"e_1_2_1_24_1","unstructured":"A. Adhikari A. Ram R. Tang and J. Lin. 2019. DocBERT: BERT for document classification. arXiv:1904.08398 [cs] Aug. 2019 [Online]. Available: http:\/\/arxiv.org\/abs\/1904.08398."},{"key":"e_1_2_1_25_1","volume-title":"ALBERT: A lite BERT for self-supervised learning of language representations. arXiv:1909.11942 [cs]","author":"Lan Z.","year":"2020","unstructured":"Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. 2020. ALBERT: A lite BERT for self-supervised learning of language representations. arXiv:1909.11942 [cs], 2020 [Online]. Available: http:\/\/arxiv.org\/abs\/1909.11942."},{"key":"e_1_2_1_26_1","unstructured":"K. Ahmed N. S. Keskar and R. Socher. 2017. Weighted transformer network for machine translation. arXiv:1711.02132 [cs] Nov. 2017 [Online]. Available: http:\/\/arxiv.org\/abs\/1711.02132."},{"key":"e_1_2_1_27_1","first-page":"330","article-title":"Convolution neural network with active learning for information extraction of enterprise announcements. In Natural Language Processing and Chinese Computing","volume":"2018","author":"Fu L.","year":"2018","unstructured":"L. Fu, Z. Yin, Y. Liu, and J. Zhang. 2018. Convolution neural network with active learning for information extraction of enterprise announcements. In Natural Language Processing and Chinese Computing, Cham 2018, 330\u2013339.","journal-title":"Cham"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.websem.2018.11.004"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2899391"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2019.00149"},{"key":"e_1_2_1_31_1","unstructured":"E. K. Rezig M. Ouzzani W. G. Aref A. K. Elmagarmid and A. R. Mahmood. 2017. Pattern-driven data cleaning. ArXiv:1712.09437 [cs] 2017."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.14778\/3377369.3377377"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-015-0389-y"},{"key":"e_1_2_1_34_1","doi-asserted-by":"crossref","unstructured":"E. D. Cubuk B. Zoph J. Shlens and Q. V. Le. 2019. RandAugment: Practical automated data augmentation with a reduced search space. arXiv:1909.13719 [cs] Nov. 2019.","DOI":"10.1109\/CVPRW50498.2020.00359"},{"key":"e_1_2_1_35_1","first-page":"113","article-title":"AutoAugment: Learning augmentation strategies from data","volume":"2019","author":"Cubuk E. D.","year":"2019","unstructured":"E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le. 2019. AutoAugment: Learning augmentation strategies from data. Long Beach, CA, 2019, 113\u2013123.","journal-title":"Long Beach, CA"},{"key":"e_1_2_1_36_1","unstructured":"B. Zoph and Q. V. Le. 2017. Neural architecture search with reinforcement learning. arXiv:1611.01578 [cs] 2017 [Online]. Available: http:\/\/arxiv.org\/abs\/1611.01578."},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2018.8461722"},{"key":"e_1_2_1_38_1","first-page":"6665","article-title":"Fast autoaugment. In Advances in Neural Information Processing Systems, Curran Associates","volume":"2019","author":"Lim S.","year":"2019","unstructured":"S. Lim, I. Kim, T. Kim, C. Kim, and S. Kim. 2019. Fast autoaugment. In Advances in Neural Information Processing Systems, Curran Associates, Inc., 2019, 6665\u20136675.","journal-title":"Inc."},{"key":"e_1_2_1_39_1","volume-title":"DADA: Differentiable automatic data augmentation. arXiv:2003.03780 [cs]","author":"Li Y.","year":"2003","unstructured":"Y. Li, G. Hu, Y. Wang, T. Hospedales, N. M. Robertson, and Y. Yang. 2003. DADA: Differentiable automatic data augmentation. arXiv:2003.03780 [cs], 2020, [Online]. Available: http:\/\/arxiv.org\/abs\/2003.03780."},{"key":"e_1_2_1_40_1","unstructured":"D. Hendrycks and K. Gimpel. 2018. Gaussian error linear units (GELUs). arXiv:1606.08415 [cs] 2018 [Online]. Available: http:\/\/arxiv.org\/abs\/1606.08415."},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","unstructured":"J. Torres C. Vaca L. Ter\u00e1n and C. L. Abad. 2020. Seq2Seq models for recommending short text conversations. Expert Systems with Applications 150 2020 DOI:10.1016\/j.eswa.2020.113270.","DOI":"10.1016\/j.eswa.2020.113270"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1285"},{"key":"e_1_2_1_43_1","unstructured":"Y. Liu et al. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692 [cs] Jul. 2019 [Online]. Available: http:\/\/arxiv.org\/abs\/1907.11692."},{"key":"e_1_2_1_44_1","volume-title":"ALBERT: A Lite BERT for self-supervised learning of language representations. arXiv:1909.11942 [cs]","author":"Lan Z.","year":"2020","unstructured":"Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. 2020. ALBERT: A Lite BERT for self-supervised learning of language representations. arXiv:1909.11942 [cs], 2020 [Online]. Available: http:\/\/arxiv.org\/abs\/1909.11942."},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1186\/s40537-020-00305-w"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.5555\/3327144.3327181"},{"key":"e_1_2_1_47_1","unstructured":"D. Ulyanov A. Vedaldi and V. Lempitsky. 2017. Instance normalization: The missing ingredient for fast stylization. arXiv:1607.08022 [cs] 2017 [Online]. Available: http:\/\/arxiv.org\/abs\/1607.08022."},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1177\/107769905303000401"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/3345768.3355923"},{"key":"e_1_2_1_50_1","volume-title":"32nd Conference on Neural Information Processing Systems (NIPS'19)","author":"Dun P.","unstructured":"P. Dun, L. Zhu, and D. Zhao. 2019. Extending answer prediction for deep bi-directional transformers. In 32nd Conference on Neural Information Processing Systems (NIPS'19)."},{"key":"e_1_2_1_51_1","volume-title":"LXMERT: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490","author":"Tan H.","year":"2019","unstructured":"H. Tan and M. Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019."},{"key":"e_1_2_1_52_1","volume-title":"Accessed","author":"Neutatz F.","year":"2020","unstructured":"F. Neutatz, M. Mahdavi, and Z. Abedjan. 2019. ED2: Two-stage active learning for error detection \u2013 technical report. arXiv:1908.06309 [cs, stat], Aug. 2019, Accessed: Apr. 17, 2020 [Online]. Available: http:\/\/arxiv.org\/abs\/1908.06309."},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jbi.2020.103396"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3324956"},{"key":"e_1_2_1_55_1","volume-title":"Integrate","author":"Crane D.","unstructured":"D. Crane. \u201cThe Cost of Bad Data,\u201d Integrate, Inc, 201AD [Online]. Available: https:\/\/demand.integrate.com\/rs\/951-JPP-414\/images\/Integrate_TheCostofBadLeads_Whitepaper.pdf."},{"key":"e_1_2_1_56_1","volume-title":"Gartner, 2020","author":"Cearley D. W.","year":"2020","unstructured":"D. W. Cearley. 2020. Top 10 strategic technology trends for 2020, Gartner, 2020 [Online]. Available: https:\/\/www.gartner.com\/en\/publications\/top-tech-trends-2020."},{"key":"e_1_2_1_57_1","volume-title":"School of Information and Computer Sciences","author":"Dua D.","year":"2017","unstructured":"D. Dua and C. Graff. 2017. UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences, 2017."},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/3299869.3319855"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10115-012-0507-8"},{"key":"e_1_2_1_60_1","volume-title":"Adam: A method for stochastic optimization. arXiv:1412.6980 [cs]","author":"Kingma D. P.","year":"2017","unstructured":"D. P. Kingma and J. Ba. 2017. Adam: A method for stochastic optimization. arXiv:1412.6980 [cs], Jan. 2017 [Online]. Available: http:\/\/arxiv.org\/abs\/1412.6980."},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1109\/IALP.2018.8629116"},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1111\/j.0824-7935.2004.t01-1-00228.x"},{"key":"e_1_2_1_63_1","doi-asserted-by":"crossref","unstructured":"S. Akcay A. Atapour-Abarghouei and T. P. Breckon. 2019. GANomaly: Semi-supervised anomaly detection via adversarial training. In Computer Vision \u2013 ACCV 2018 2019 622\u2013637.","DOI":"10.1007\/978-3-030-20893-6_39"},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2018.2794470"},{"key":"e_1_2_1_65_1","unstructured":"S. Eduardo and C. Sutton. 2016. Data cleaning using probabilistic models of integrity constraints. Neural Information Processing Systems."},{"key":"e_1_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2019.2925014"},{"key":"e_1_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3380568"},{"key":"e_1_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-20351-1_3"},{"key":"e_1_2_1_69_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2019.8683590"},{"key":"e_1_2_1_70_1","unstructured":"Q. Xie Z. Dai E. Hovy M.-T. Luong and Q. V. Le. 2020. Unsupervised Data Augmentation for Consistency Training. arXiv 1904.12848v6 [csLG] 2020 [Online]. Available: https:\/\/arxiv.org\/abs\/1904.12848."},{"key":"e_1_2_1_71_1","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, USA","author":"Zhang L.","year":"2019","unstructured":"L. Zhang, G.-J. Qi, L. Wang, and J. Luo. 2019. AET vs. AED: Unsupervised representation learning by auto-encoding transformations rather than data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, USA, 2019, 2547\u20132555."},{"key":"e_1_2_1_72_1","doi-asserted-by":"publisher","DOI":"10.5555\/3122009.3242030"},{"key":"e_1_2_1_73_1","unstructured":"S. O. Arik and T. Pfister. 2020. TabNet: Attentive interpretable tabular learning. arXiv:1908.07442 [cs stat] Feb. 2020 [Online]. Available: http:\/\/arxiv.org\/abs\/1908.07442."}],"container-title":["ACM\/IMS Transactions on Data Science"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3447541","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3447541","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,16]],"date-time":"2026-04-16T13:56:08Z","timestamp":1776347768000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3447541"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,5,18]]},"references-count":73,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2021,8,31]]}},"alternative-id":["10.1145\/3447541"],"URL":"https:\/\/doi.org\/10.1145\/3447541","relation":{},"ISSN":["2691-1922"],"issn-type":[{"value":"2691-1922","type":"print"}],"subject":[],"published":{"date-parts":[[2021,5,18]]},"assertion":[{"value":"2020-07-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-01-01","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-05-18","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}