{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,19]],"date-time":"2026-03-19T02:20:28Z","timestamp":1773886828134,"version":"3.50.1"},"reference-count":62,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2023,12,8]],"date-time":"2023-12-08T00:00:00Z","timestamp":1701993600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100006374","name":"Agence Nationale de la Recherche","doi-asserted-by":"publisher","award":["ANR-21-CE23-0037"],"award-info":[{"award-number":["ANR-21-CE23-0037"]}],"id":[{"id":"10.13039\/501100006374","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2023,12,8]]},"abstract":"<jats:p>Tabular data is becoming increasingly important in Natural Language Processing (NLP) tasks, such as Tabular Natural Language Inference (TNLI). Given a table and a hypothesis expressed in NL text, the goal is to assess if the former structured data supports or refutes the latter. In this work, we focus on the role played by the annotated data in training the inference model. We introduce a system, Tenet, for the automatic augmentation and generation of training examples for TNLI. Given the tables, existing approaches are either based on human annotators, and thus expensive, or on methods that produce simple examples that lack data variety and complex reasoning. Instead, our approach is built around the intuition that SQL queries are the right tool to achieve variety in the generated examples, both in terms of data variety and reasoning complexity. The first is achieved by evidence-queries that identify cell values over tables according to different data patterns. Once the data for the example is identified, semantic-queries describe the different ways such data can be identified with standard SQL clauses. These rich descriptions are then verbalized as text to create the annotated examples for the TNLI task. The same approach is also extended to create counterfactual examples, i.e., examples where the hypothesis is false, with a method based on injecting errors in the original (clean) table. For all steps, we introduce generic generation algorithms that take as input only the tables. For our experimental study, we use three datasets from the TNLI literature and two crafted by us on more complex tables. Tenet generates human-like examples, which lead to the effective training of several inference models with results comparable to those obtained by training the same models with manually-written examples.<\/jats:p>","DOI":"10.1145\/3626730","type":"journal-article","created":{"date-parts":[[2023,12,12]],"date-time":"2023-12-12T14:01:21Z","timestamp":1702389681000},"page":"1-27","source":"Crossref","is-referenced-by-count":6,"title":["Generation of Training Examples for Tabular Natural Language Inference"],"prefix":"10.1145","volume":"1","author":[{"ORCID":"https:\/\/orcid.org\/0009-0009-8869-6025","authenticated-orcid":false,"given":"Jean-Flavien","family":"Bussotti","sequence":"first","affiliation":[{"name":"EURECOM, Biot, France"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9947-8909","authenticated-orcid":false,"given":"Enzo","family":"Veltri","sequence":"additional","affiliation":[{"name":"University of Basilicata, Potenza, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5651-8584","authenticated-orcid":false,"given":"Donatello","family":"Santoro","sequence":"additional","affiliation":[{"name":"University of Basilicata, Potenza, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0651-4128","authenticated-orcid":false,"given":"Paolo","family":"Papotti","sequence":"additional","affiliation":[{"name":"EURECOM, Biot, France"}]}],"member":"320","published-online":{"date-parts":[[2023,12,12]]},"reference":[{"key":"e_1_2_2_1_1","first-page":"1","article-title":"The efficacy of round-trip translation for MT evaluation","volume":"14","author":"Aiken Milam","year":"2010","unstructured":"Milam Aiken and Mina Park. 2010. The efficacy of round-trip translation for MT evaluation. Translation Journal, Vol. 14, 1 (2010), 1--10.","journal-title":"Translation Journal"},{"key":"e_1_2_2_2_1","volume-title":"James Thorne, Andreas Vlachos, Christos Christodoulopoulos, Oana Cocarascu, and Arpit Mittal.","author":"Aly Rami","year":"2021","unstructured":"Rami Aly, Zhijiang Guo, Michael Sejr Schlichtkrull, James Thorne, Andreas Vlachos, Christos Christodoulopoulos, Oana Cocarascu, and Arpit Mittal. 2021. FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information. In NeurIPS (Datasets and Benchmarks)."},{"key":"e_1_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6233"},{"key":"e_1_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00544"},{"key":"e_1_2_2_5_1","volume-title":"International Journal of Machine Learning and Cybernetics","author":"Bayer Markus","year":"2022","unstructured":"Markus Bayer, Marc-Andr\u00e9 Kaufhold, Bj\u00f6rn Buchhold, Marcel Keller, J\u00f6rg Dallmeyer, and Christian Reuter. 2022b. Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers. International Journal of Machine Learning and Cybernetics (2022), 1--16."},{"key":"e_1_2_2_6_1","volume-title":"A Survey on Data Augmentation for Text Classification. Comput. Surveys (jun","author":"Bayer Markus","year":"2022","unstructured":"Markus Bayer, Marc-Andr\u00e9 Kaufhold, and Christian Reuter. 2022a. A Survey on Data Augmentation for Text Classification. Comput. Surveys (jun 2022)."},{"key":"e_1_2_2_7_1","volume-title":"Synthetic and natural noise both break neural machine translation. arXiv preprint arXiv:1711.02173","author":"Belinkov Yonatan","year":"2017","unstructured":"Yonatan Belinkov and Yonatan Bisk. 2017. Synthetic and natural noise both break neural machine translation. arXiv preprint arXiv:1711.02173 (2017)."},{"key":"e_1_2_2_8_1","unstructured":"Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et al. 2020. Language models are few-shot learners. Advances in neural information processing systems Vol. 33 (2020) 1877--1901."},{"key":"e_1_2_2_9_1","volume-title":"Mixtext: Linguistically-informed interpolation of hidden space for semi-supervised text classification. arXiv preprint arXiv:2004.12239","author":"Chen Jiaao","year":"2020","unstructured":"Jiaao Chen, Zichao Yang, and Diyi Yang. 2020c. Mixtext: Linguistically-informed interpolation of hidden space for semi-supervised text classification. arXiv preprint arXiv:2004.12239 (2020)."},{"key":"e_1_2_2_10_1","doi-asserted-by":"crossref","unstructured":"Wenhu Chen Jianshu Chen Yu Su Zhiyu Chen and William Yang Wang. 2020a. Logical Natural Language Generation from Open-Domain Tables. In ACL. 7929--7942.","DOI":"10.18653\/v1\/2020.acl-main.708"},{"key":"e_1_2_2_11_1","unstructured":"Wenhu Chen Hongmin Wang Jianshu Chen Yunkai Zhang Hong Wang Shiyang Li Xiyou Zhou and William Yang Wang. 2020b. TabFact: A Large-scale Dataset for Table-based Fact Verification. In ICLR."},{"key":"e_1_2_2_12_1","volume-title":"Junyeob Kim, Sang-Woo Lee, Sang goo Lee, Kang Min Yoo, and Taeuk Kim.","author":"Cho Hyunsoo","year":"2022","unstructured":"Hyunsoo Cho, Hyuhng Joon Kim, Junyeob Kim, Sang-Woo Lee, Sang goo Lee, Kang Min Yoo, and Taeuk Kim. 2022. Prompt-Augmented Linear Probing: Scaling Beyond The Limit of Few-shot In-Context Learners. In AAAI."},{"key":"e_1_2_2_13_1","volume-title":"Generating artificial texts as substitution or complement of training data. arXiv preprint arXiv:2110.13016","author":"Claveau Vincent","year":"2021","unstructured":"Vincent Claveau, Antoine Chaffin, and Ewa Kijak. 2021. Generating artificial texts as substitution or complement of training data. arXiv preprint arXiv:2110.13016 (2021)."},{"key":"e_1_2_2_14_1","unstructured":"Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http:\/\/archive.ics.uci.edu\/ml"},{"key":"e_1_2_2_15_1","doi-asserted-by":"crossref","unstructured":"Julian Eisenschlos Syrine Krichene and Thomas M\u00fcller. 2020. Understanding tables with intermediate pre-training. In EMNLP. 281--296.","DOI":"10.18653\/v1\/2020.findings-emnlp.27"},{"key":"e_1_2_2_16_1","volume-title":"Genaug: Data augmentation for finetuning text generators. arXiv preprint arXiv:2010.01794","author":"Feng Steven Y","year":"2020","unstructured":"Steven Y Feng, Varun Gangal, Dongyeop Kang, Teruko Mitamura, and Eduard Hovy. 2020. Genaug: Data augmentation for finetuning text generators. arXiv preprint arXiv:2010.01794 (2020)."},{"key":"e_1_2_2_17_1","volume-title":"Ioannidis","author":"Gkini Orest","year":"2021","unstructured":"Orest Gkini, Theofilos Belmpas, Georgia Koutrika, and Yannis E. Ioannidis. 2021. An In-Depth Benchmarking of Text-to-SQL Systems. In SIGMOD. ACM, 632--644."},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00482"},{"key":"e_1_2_2_19_1","volume-title":"INFOTABS: Inference on Tables as Semi-structured Data. In ACL. ACL, Online, 2309--2324.","author":"Gupta Vivek","year":"2020","unstructured":"Vivek Gupta, Maitrey Mehta, Pegah Nokhiz, and Vivek Srikumar. 2020. INFOTABS: Inference on Tables as Semi-structured Data. In ACL. ACL, Online, 2309--2324."},{"key":"e_1_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.398"},{"key":"e_1_2_2_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3406601.3406618"},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407841"},{"key":"e_1_2_2_23_1","doi-asserted-by":"crossref","unstructured":"George Katsogiannis-Meimarakis and Georgia Koutrika. 2021. A Deep Dive into Deep Learning Approaches for Text-to-SQL Systems. In SIGMOD. ACM 2846--2851.","DOI":"10.1145\/3448016.3457543"},{"key":"e_1_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.3115\/981311.981340"},{"key":"e_1_2_2_25_1","volume-title":"Data augmentation using pre-trained transformer models. arXiv preprint arXiv:2003.02245","author":"Kumar Varun","year":"2020","unstructured":"Varun Kumar, Ashutosh Choudhary, and Eunah Cho. 2020. Data augmentation using pre-trained transformer models. arXiv preprint arXiv:2003.02245 (2020)."},{"key":"e_1_2_2_26_1","volume-title":"A closer look at feature space data augmentation for few-shot intent classification. arXiv preprint arXiv:1910.04176","author":"Kumar Varun","year":"2019","unstructured":"Varun Kumar, Hadrien Glaude, Cyprien de Lichy, and William Campbell. 2019. A closer look at feature space data augmentation for few-shot intent classification. arXiv preprint arXiv:1910.04176 (2019)."},{"key":"e_1_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.1910.13461"},{"key":"e_1_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.14778\/2831360.2831369"},{"key":"e_1_2_2_29_1","doi-asserted-by":"crossref","unstructured":"Chen Liang Jonathan Berant Quoc Le Kenneth D. Forbus and Ni Lao. 2017. Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision. In ACL. 23--33.","DOI":"10.18653\/v1\/P17-1003"},{"key":"e_1_2_2_30_1","volume-title":"Data boost: Text data augmentation through reinforcement learning guided conditional generation. arXiv preprint arXiv:2012.02952","author":"Liu Ruibo","year":"2020","unstructured":"Ruibo Liu, Guangxuan Xu, Chenyan Jia, Weicheng Ma, Lili Wang, and Soroush Vosoughi. 2020. Data boost: Text data augmentation through reinforcement learning guided conditional generation. arXiv preprint arXiv:2012.02952 (2020)."},{"key":"e_1_2_2_31_1","volume-title":"Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692","author":"Liu Yinhan","year":"2019","unstructured":"Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)."},{"key":"e_1_2_2_32_1","volume-title":"Generating Training Data with Language Models: Towards Zero-Shot Language Understanding. CoRR","author":"Meng Yu","year":"2022","unstructured":"Yu Meng, Jiaxin Huang, Yu Zhang, and Jiawei Han. 2022a. Generating Training Data with Language Models: Towards Zero-Shot Language Understanding. CoRR, Vol. abs\/2202.04538 (2022). showeprint[arXiv]2202.04538 https:\/\/arxiv.org\/abs\/2202.04538"},{"key":"e_1_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2211.03044"},{"key":"e_1_2_2_34_1","doi-asserted-by":"publisher","unstructured":"Yu Meng Jiaming Shen Chao Zhang and Jiawei Han. 2018. Weakly-Supervised Hierarchical Text Classification. https:\/\/doi.org\/10.48550\/ARXIV.1812.11270","DOI":"10.48550\/ARXIV.1812.11270"},{"key":"e_1_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2021\/619"},{"key":"e_1_2_2_36_1","volume-title":"Yilun Zhao, Yixin Liu, Luke Benson, Weijin Zou, and Dragomir Radev.","author":"Nan Linyong","year":"2022","unstructured":"Linyong Nan, Lorenzo Jaime Yu Flores, Yilun Zhao, Yixin Liu, Luke Benson, Weijin Zou, and Dragomir Radev. 2022. R2D2: Robust Data-to-Text with Replacement Detection. arXiv preprint arXiv:2205.12467 (2022)."},{"key":"e_1_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1186\/s13326-018-0179-8"},{"key":"e_1_2_2_38_1","volume-title":"Zero-shot Fact Verification by Claim Generation","author":"Pan Liangming","unstructured":"Liangming Pan, Wenhu Chen, Wenhan Xiong, Min-Yen Kan, and William Yang Wang. 2021. Zero-shot Fact Verification by Claim Generation. In ACL. Association for Computational Linguistics, 476--483."},{"key":"e_1_2_2_39_1","volume-title":"QATCH: Benchmarking Table Representation Learning Models on Your Data. In NeurIPS (Datasets and Benchmarks).","author":"Papicchio Simone","year":"2023","unstructured":"Simone Papicchio, Paolo Papotti, and Luca Cagliero. 2023. QATCH: Benchmarking Table Representation Learning Models on Your Data. In NeurIPS (Datasets and Benchmarks)."},{"key":"e_1_2_2_40_1","doi-asserted-by":"crossref","unstructured":"Ankur P. Parikh Xuezhi Wang Sebastian Gehrmann Manaal Faruqui Bhuwan Dhingra Diyi Yang and Dipanjan Das. 2020. ToTTo: A Controlled Table-To-Text Generation Dataset. In EMNLP. ACL 1173--1186.","DOI":"10.18653\/v1\/2020.emnlp-main.89"},{"key":"e_1_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/3366424.3383552"},{"key":"e_1_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18--2124"},{"key":"e_1_2_2_43_1","volume-title":"Natural Language Engineering","volume":"3","author":"Reiter Ehud","year":"2002","unstructured":"Ehud Reiter and Robert Dale. 2002. Building Applied Natural Language Generation Systems. Natural Language Engineering, Vol. 3 (03 2002)."},{"key":"e_1_2_2_44_1","doi-asserted-by":"crossref","unstructured":"Anish Das Sarma Aditya G. Parameswaran Hector Garcia-Molina and Jennifer Widom. 2010. Synthesizing view definitions from data. In ICDT. ACM 89--103.","DOI":"10.1145\/1804669.1804683"},{"key":"e_1_2_2_45_1","volume-title":"A simple but tough-to-beat data augmentation approach for natural language understanding and generation. arXiv preprint arXiv:2009.13818","author":"Shen Dinghan","year":"2020","unstructured":"Dinghan Shen, Mingzhi Zheng, Yelong Shen, Yanru Qu, and Weizhu Chen. 2020. A simple but tough-to-beat data augmentation approach for natural language understanding and generation. arXiv preprint arXiv:2009.13818 (2020)."},{"key":"e_1_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/3589280"},{"key":"e_1_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2211.06193"},{"key":"e_1_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.14778\/3137628.3137648"},{"key":"e_1_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.14778\/3554821.3554896"},{"key":"e_1_2_2_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE55515.2023.00041"},{"key":"e_1_2_2_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/3514221.3520164"},{"key":"e_1_2_2_52_1","doi-asserted-by":"crossref","unstructured":"Bailin Wang Richard Shin Xiaodong Liu Oleksandr Polozov and Matthew Richardson. 2020. RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. In ACL. 7567--7578.","DOI":"10.18653\/v1\/2020.acl-main.677"},{"key":"e_1_2_2_53_1","doi-asserted-by":"crossref","unstructured":"Congcong Wang and David Lillis. 2019. Classification for Crisis-Related Tweets Leveraging Word Embeddings and Data Augmentation.. In TREC.","DOI":"10.6028\/NIST.SP.1250.incident-CS-UCD"},{"key":"e_1_2_2_54_1","volume-title":"SemEval-2021 task 9: Fact verification and evidence finding for tabular data in scientific documents (SEM-TAB-FACTS). arXiv preprint arXiv:2105.13995","author":"Wang Nancy XR","year":"2021","unstructured":"Nancy XR Wang, Diwakar Mahajan, Marina Danilevsky, and Sara Rosenthal. 2021. SemEval-2021 task 9: Fact verification and evidence finding for tabular data in scientific documents (SEM-TAB-FACTS). arXiv preprint arXiv:2105.13995 (2021)."},{"key":"e_1_2_2_55_1","volume-title":"Ugur cC etintemel, and Carsten Binnig","author":"Weir Nathaniel","year":"2020","unstructured":"Nathaniel Weir, Prasetya Utama, Alex Galakatos, Andrew Crotty, Amir Ilkhechi, Shekar Ramaswamy, Rohin Bhushan, Nadja Geisler, Benjamin H\"a ttasch, Steffen Eger, Ugur cC etintemel, and Carsten Binnig. 2020. DBPal: A Fully Pluggable NL2SQL Training Pipeline. In SIGMOD. ACM, 2347--2361."},{"key":"e_1_2_2_56_1","volume-title":"Weiss and Sara Cohen","author":"Yaacov","year":"2017","unstructured":"Yaacov Y. Weiss and Sara Cohen. 2017. Reverse Engineering SPJ-Queries from Examples. In PODS. ACM, 151--166."},{"key":"e_1_2_2_57_1","doi-asserted-by":"crossref","unstructured":"Sam Wiseman Stuart Shieber and Alexander Rush. 2017. Challenges in Data-to-Document Generation. In EMNLP. ACL 2253--2263.","DOI":"10.18653\/v1\/D17-1239"},{"key":"e_1_2_2_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/2996453"},{"key":"e_1_2_2_59_1","first-page":"6256","article-title":"Unsupervised data augmentation for consistency training","volume":"33","author":"Xie Qizhe","year":"2020","unstructured":"Qizhe Xie, Zihang Dai, Eduard Hovy, Thang Luong, and Quoc Le. 2020. Unsupervised data augmentation for consistency training. Advances in Neural Information Processing Systems, Vol. 33 (2020), 6256--6268.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_2_60_1","unstructured":"Pengcheng Yin Graham Neubig Wen-tau Yih and Sebastian Riedel. 2020. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In ACL. 8413--8426."},{"key":"e_1_2_2_61_1","volume-title":"Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. In EMNLP. 3911--3921.","author":"Yu Tao","year":"2018","unstructured":"Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. In EMNLP. 3911--3921."},{"key":"e_1_2_2_62_1","doi-asserted-by":"crossref","unstructured":"Meihui Zhang Hazem Elmeleegy Cecilia M. Procopiuc and Divesh Srivastava. 2013. Reverse Engineering Complex Join Queries. In SIGMOD. ACM 809--820.","DOI":"10.1145\/2463676.2465320"}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3626730","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3626730","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T13:00:12Z","timestamp":1755867612000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3626730"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,12,8]]},"references-count":62,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2023,12,8]]}},"alternative-id":["10.1145\/3626730"],"URL":"https:\/\/doi.org\/10.1145\/3626730","relation":{},"ISSN":["2836-6573"],"issn-type":[{"value":"2836-6573","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,12,8]]}}}