{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,7]],"date-time":"2026-03-07T17:59:43Z","timestamp":1772906383162,"version":"3.50.1"},"reference-count":62,"publisher":"Association for Computing Machinery (ACM)","issue":"13","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2023,9]]},"abstract":"<jats:p>Recent publications suggest using natural language analysis on database schema elements to guide tuning and profiling efforts. The underlying hypothesis is that state-of-the-art language processing methods, so-called language models, are able to extract information on data properties from schema text.<\/jats:p>\n          <jats:p>This paper examines that hypothesis in the context of data correlation analysis: is it possible to find column pairs with correlated data by analyzing their names via language models? First, the paper introduces a novel benchmark for data correlation analysis, created by analyzing thousands of Kaggle data sets (and available for download). Second, it uses that data to study the ability of language models to predict correlation, based on column names. The analysis covers different language models, various correlation metrics, and a multitude of accuracy metrics. It pinpoints factors that contribute to successful predictions, such as the length of column names as well as the ratio of words. Finally, the study analyzes the impact of column types on prediction performance. The results show that schema text can be a useful source of information and inform future research efforts, targeted at NLP-enhanced database tuning and data profiling.<\/jats:p>","DOI":"10.14778\/3625054.3625066","type":"journal-article","created":{"date-parts":[[2023,12,4]],"date-time":"2023-12-04T17:09:42Z","timestamp":1701709782000},"page":"4310-4323","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":12,"title":["Can Large Language Models Predict Data Correlations from Column Names?"],"prefix":"10.14778","volume":"16","author":[{"given":"Immanuel","family":"Trummer","sequence":"first","affiliation":[{"name":"Cornell Database Group, Ithaca, NY, USA"}]}],"member":"320","published-online":{"date-parts":[[2023,12,4]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2014.6816721"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1017\/S0269888900005476"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2304.09433"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.5301\/JBM.2008.2127"},{"key":"e_1_2_1_5_1","volume-title":"BHUNT: Automatic discovery of fuzzy algebraic constraints in relational data. In VLDB. 668--679","author":"Brown PG","year":"2003","unstructured":"PG Brown and PJ Hass . 2003 . BHUNT: Automatic discovery of fuzzy algebraic constraints in relational data. In VLDB. 668--679 . http:\/\/dl.acm.org\/citation.cfm?id=1315509 PG Brown and PJ Hass. 2003. BHUNT: Automatic discovery of fuzzy algebraic constraints in relational data. In VLDB. 668--679. http:\/\/dl.acm.org\/citation.cfm?id=1315509"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/1007568.1007604"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1186\/s12864-019-6413-7"},{"key":"e_1_2_1_8_1","unstructured":"Aakanksha Chowdhery Sharan Narang Jacob Devlin Maarten Bosma Gaurav Mishra Adam Roberts Paul Barham Hyung Won Chung Charles Sutton Sebastian Gehrmann Parker Schuh Kensen Shi Sasha Tsvyashchenko Joshua Maynez Abhishek Rao Parker Barnes Yi Tay Noam Shazeer Vinodkumar Prabhakaran Emily Reif Nan Du Ben Hutchinson Reiner Pope James Bradbury Jacob Austin Michael Isard Guy Gur-Ari Pengcheng Yin Toju Duke Anselm Levskaya Sanjay Ghemawat Sunipa Dev Henryk Michalewski Xavier Garcia Vedant Misra Kevin Robinson Liam Fedus Denny Zhou Daphne Ippolito David Luan Hyeontaek Lim Barret Zoph Alexander Spiridonov Ryan Sepassi David Dohan Shivani Agrawal Mark Omernick Andrew M. Dai Thanumalayan Sankaranarayana Pillai Marie Pellat Aitor Lewkowycz Erica Moreira Rewon Child Oleksandr Polozov Katherine Lee Zongwei Zhou Xuezhi Wang Brennan Saeta Mark Diaz Orhan Firat Michele Catasta Jason Wei Kathy Meier-Hellstern Douglas Eck Jeff Dean Slav Petrov and Noah Fiedel. 2022. PaLM: Scaling Language Modeling with Pathways. CoRR abs\/2204.0 (2022) 1--87. arXiv:2204.02311 http:\/\/arxiv.org\/abs\/2204.02311  Aakanksha Chowdhery Sharan Narang Jacob Devlin Maarten Bosma Gaurav Mishra Adam Roberts Paul Barham Hyung Won Chung Charles Sutton Sebastian Gehrmann Parker Schuh Kensen Shi Sasha Tsvyashchenko Joshua Maynez Abhishek Rao Parker Barnes Yi Tay Noam Shazeer Vinodkumar Prabhakaran Emily Reif Nan Du Ben Hutchinson Reiner Pope James Bradbury Jacob Austin Michael Isard Guy Gur-Ari Pengcheng Yin Toju Duke Anselm Levskaya Sanjay Ghemawat Sunipa Dev Henryk Michalewski Xavier Garcia Vedant Misra Kevin Robinson Liam Fedus Denny Zhou Daphne Ippolito David Luan Hyeontaek Lim Barret Zoph Alexander Spiridonov Ryan Sepassi David Dohan Shivani Agrawal Mark Omernick Andrew M. Dai Thanumalayan Sankaranarayana Pillai Marie Pellat Aitor Lewkowycz Erica Moreira Rewon Child Oleksandr Polozov Katherine Lee Zongwei Zhou Xuezhi Wang Brennan Saeta Mark Diaz Orhan Firat Michele Catasta Jason Wei Kathy Meier-Hellstern Douglas Eck Jeff Dean Slav Petrov and Noah Fiedel. 2022. PaLM: Scaling Language Modeling with Pathways. CoRR abs\/2204.0 (2022) 1--87. arXiv:2204.02311 http:\/\/arxiv.org\/abs\/2204.02311"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1561\/1900000004"},{"key":"e_1_2_1_10_1","volume-title":"Kenton Lee, and Kristina Toutanova.","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin , Ming Wei Chang , Kenton Lee, and Kristina Toutanova. 2019 . BERT : Pre-training of deep bidirectional transformers for language understanding. In NAACL. 4171--4186. arXiv:1810.04805 Jacob Devlin, Ming Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL. 4171--4186. arXiv:1810.04805"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11023-020-09548-1"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.2307\/1907752"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.3760\/cma.j.issn.04124081.2010.02.006"},{"key":"e_1_2_1_14_1","unstructured":"Wonseok Hwang Jingyeung Yim Seunghyun Park and Minjoon Seo. 2019. (SQLNova) A Comprehensive Exploration on WikiSQL with Table-Aware Word Contextualization.  Wonseok Hwang Jingyeung Yim Seunghyun Park and Minjoon Seo. 2019. (SQLNova) A Comprehensive Exploration on WikiSQL with Table-Aware Word Contextualization."},{"key":"e_1_2_1_15_1","unstructured":"IBM. 2021. IBM Infosphere Information Analyzer.  IBM. 2021. IBM Infosphere Information Analyzer."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/1007568.1007641"},{"key":"e_1_2_1_17_1","unstructured":"Informatica. 2021. Informatica Data Profiling Solutions.  Informatica. 2021. Informatica Data Profiling Solutions."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE51399.2021.00010"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407841"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.14778\/3415478.3415520"},{"key":"e_1_2_1_21_1","volume-title":"CHORUS: Foundation Models for Unified Data Discovery and Exploration. CoRR","author":"Kayali Moe","year":"2023","unstructured":"Moe Kayali , Anton Lykov , Ilias Fountalis , Nikolaos Vasiloglou , Dan Olteanu , and Dan Suciu . 2023 . CHORUS: Foundation Models for Unified Data Discovery and Exploration. CoRR (2023). arXiv:2306.09610 http:\/\/arxiv.org\/abs\/2306.09610 Moe Kayali, Anton Lykov, Ilias Fountalis, Nikolaos Vasiloglou, Dan Olteanu, and Dan Suciu. 2023. CHORUS: Foundation Models for Unified Data Discovery and Exploration. CoRR (2023). arXiv:2306.09610 http:\/\/arxiv.org\/abs\/2306.09610"},{"key":"e_1_2_1_22_1","volume-title":"ALBERT: A Lite BERT for Self-supervised Learning of Language Representations.","author":"Lan Zhenzhong","year":"2019","unstructured":"Zhenzhong Lan , Mingda Chen , Sebastian Goodman , Kevin Gimpel , Piyush Sharma , and Radu Soricut . 2019 . ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. (2019), 1--17. arXiv:1909.11942 http:\/\/arxiv.org\/abs\/1909.11942 Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. (2019), 1--17. arXiv:1909.11942 http:\/\/arxiv.org\/abs\/1909.11942"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-015-0412-3"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.14778\/2850583.2850594"},{"key":"e_1_2_1_25_1","doi-asserted-by":"crossref","unstructured":"Fei Li and HV Jagadish. 2014. NaLIR: an interactive natural language interface for querying relational databases. In SIGMOD. 709--712.  Fei Li and HV Jagadish. 2014. NaLIR: an interactive natural language interface for querying relational databases. In SIGMOD. 709--712.","DOI":"10.1145\/2588555.2594519"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/2949741.2949744"},{"key":"e_1_2_1_27_1","volume-title":"RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs\/1907.1, 1","author":"Liu Yinhan","year":"2019","unstructured":"Yinhan Liu , Myle Ott , Naman Goyal , Jingfei Du , Mandar Joshi , Danqi Chen , Omer Levy , Mike Lewis , Luke Zettlemoyer , and Veselin Stoyanov . 2019. RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs\/1907.1, 1 ( 2019 ), 1--13. arXiv:1907.11692 https:\/\/arxiv.org\/abs\/1907.11692 Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs\/1907.1, 1 (2019), 1--13. arXiv:1907.11692 https:\/\/arxiv.org\/abs\/1907.11692"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.14778\/3342263.3342644"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-006-0030-1"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.14778\/3574245.3574258"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/2590989.2590995"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.14778\/2824032.2824086"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.14778\/2752939.2752946"},{"key":"e_1_2_1_34_1","unstructured":"Thorsten Papenbrock and Felix Naumann. 2017. A hybrid approach for efficient unique column combination discovery. In BTW. 195--204.  Thorsten Papenbrock and Felix Naumann. 2017. A hybrid approach for efficient unique column combination discovery. In BTW. 195--204."},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3389727"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3457330"},{"key":"e_1_2_1_37_1","first-page":"444","article-title":"Theil's Forecast Accuracy Coefficient","volume":"10","author":"Publications Sage","year":"2018","unstructured":"Sage Publications . 2018 . Theil's Forecast Accuracy Coefficient : Clarification. 10 , 4 (2018), 444 -- 446 . Sage Publications. 2018. Theil's Forecast Accuracy Coefficient : Clarification. 10, 4 (2018), 444--446.","journal-title":"Clarification."},{"key":"e_1_2_1_38_1","first-page":"444","article-title":"Theil's Forecast Accuracy Coefficient","volume":"10","author":"Publications Sage","year":"2018","unstructured":"Sage Publications . 2018 . Theil's Forecast Accuracy Coefficient : Clarification. 10 , 4 (2018), 444 -- 446 . Sage Publications. 2018. Theil's Forecast Accuracy Coefficient : Clarification. 10, 4 (2018), 444--446.","journal-title":"Clarification."},{"key":"e_1_2_1_39_1","volume-title":"12th International Workshop on the Web and Databases (WebDB), Providence, Rhode Island WebDB","author":"Rostin Alexandra","year":"2009","unstructured":"Alexandra Rostin , Oliver Albrecht , Jana Bauckmann , Felix Naumann , and Ulf Leser . 2009 . A machine learning approach to foreign key discovery . 12th International Workshop on the Web and Databases (WebDB), Providence, Rhode Island WebDB (2009), 1--6. http:\/\/citeseerx.ist.psu.edu\/viewdoc\/download?doi=10.1.1.150.2150&rep=rep1&type=pdf Alexandra Rostin, Oliver Albrecht, Jana Bauckmann, Felix Naumann, and Ulf Leser. 2009. A machine learning approach to foreign key discovery. 12th International Workshop on the Web and Databases (WebDB), Providence, Rhode Island WebDB (2009), 1--6. http:\/\/citeseerx.ist.psu.edu\/viewdoc\/download?doi=10.1.1.150.2150&rep=rep1&type=pdf"},{"key":"e_1_2_1_40_1","doi-asserted-by":"crossref","unstructured":"Sebastian Ruder Matthew E Peters Swabha Swayamdipta and Thomas Wolf. 2019. Transfer Learning in Natural Language Processing. In ACL: Tutorials. 15--18.  Sebastian Ruder Matthew E Peters Swabha Swayamdipta and Thomas Wolf. 2019. Transfer Learning in Natural Language Processing. In ACL: Tutorials. 15--18.","DOI":"10.18653\/v1\/N19-5004"},{"key":"e_1_2_1_41_1","first-page":"1209","article-title":"ATHENA: An ontology-driven system for natural language querying over relational data stores","volume":"9","author":"Saha Diptikalyan","year":"2016","unstructured":"Diptikalyan Saha , Avrilia Floratou , Karthik Sankaranarayanan , Umar Farooq Minhas , Ashish R Mittal , and Fatma Ozcan . 2016 . ATHENA: An ontology-driven system for natural language querying over relational data stores . VLDB 9 , 12 (2016), 1209 -- 1220 . Diptikalyan Saha, Avrilia Floratou, Karthik Sankaranarayanan, Umar Farooq Minhas, Ashish R Mittal, and Fatma Ozcan. 2016. ATHENA: An ontology-driven system for natural language querying over relational data stores. VLDB 9, 12 (2016), 1209--1220.","journal-title":"VLDB"},{"key":"e_1_2_1_42_1","unstructured":"Victor Sanh Lysandre Debut Julien Chaumond and Thomas Wolf. 2019. Distil-BERT a distilled version of BERT: smaller faster cheaper and lighter. (2019) 2--6. arXiv:1910.01108 http:\/\/arxiv.org\/abs\/1910.01108  Victor Sanh Lysandre Debut Julien Chaumond and Thomas Wolf. 2019. Distil-BERT a distilled version of BERT: smaller faster cheaper and lighter. (2019) 2--6. arXiv:1910.01108 http:\/\/arxiv.org\/abs\/1910.01108"},{"key":"e_1_2_1_43_1","doi-asserted-by":"crossref","unstructured":"PG G Selinger MM M Astrahan D D Chamberlin R A Lorie and T G Price. 1979. Access path selection in a relational database management system. In SIGMOD. 23--34. http:\/\/dl.acm.org\/citation.cfm?id=582095.582099  PG G Selinger MM M Astrahan D D Chamberlin R A Lorie and T G Price. 1979. Access path selection in a relational database management system. In SIGMOD. 23--34. http:\/\/dl.acm.org\/citation.cfm?id=582095.582099","DOI":"10.1145\/582095.582099"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.14778\/3407790.3407858"},{"key":"e_1_2_1_45_1","unstructured":"Talend. 2021. Talend Data Explorer.  Talend. 2021. Talend Data Explorer."},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.14778\/3457390.3457391"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.14778\/3447689.3447706"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.14778\/3450980.3450984"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.14778\/3551793.3551841"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/3514221.3517843"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.14778\/3554821.3554896"},{"key":"e_1_2_1_52_1","unstructured":"Immanuel Trummer. 2022. Towards NLP-Enhanced Data Profiling Tools. In CIDR. 1--1. https:\/\/www.cidrdb.org\/cidr2022\/papers\/a55-trummer.pdf  Immanuel Trummer. 2022. Towards NLP-Enhanced Data Profiling Tools. In CIDR. 1--1. https:\/\/www.cidrdb.org\/cidr2022\/papers\/a55-trummer.pdf"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/3464389"},{"key":"e_1_2_1_54_1","doi-asserted-by":"crossref","unstructured":"Kostas Tzoumas Amol Dehspande and Christian S Jensen. 2011. Lightweight graphical models for selectivity estimation without independence assumptions. In VLDB. 852--863.  Kostas Tzoumas Amol Dehspande and Christian S Jensen. 2011. Lightweight graphical models for selectivity estimation without independence assumptions. In VLDB. 852--863.","DOI":"10.14778\/3402707.3402724"},{"key":"e_1_2_1_55_1","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N. Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5999--6009. arXiv:1706.03762  Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N. Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5999--6009. arXiv:1706.03762"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4419-9863-7_372"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.14778\/3484224.3484236"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-demos.6"},{"key":"e_1_2_1_59_1","doi-asserted-by":"crossref","unstructured":"Lucas Woltmann Claudio Hartmann Maik Thiele and Dirk Habich. 2019. Cardinality estimation with local deep learning models. In aiDM. 1--8.  Lucas Woltmann Claudio Hartmann Maik Thiele and Dirk Habich. 2019. Cardinality estimation with local deep learning models. In aiDM. 1--8.","DOI":"10.1145\/3329859.3329875"},{"key":"e_1_2_1_60_1","unstructured":"Xiaojun Xu Chang Liu and Dawn Song. 2017. SQLNet: generating structured queries from natural language without reinforcement Learning. 1--13. arXiv:1711.04436 http:\/\/arxiv.org\/abs\/1711.04436  Xiaojun Xu Chang Liu and Dawn Song. 2017. SQLNet: generating structured queries from natural language without reinforcement Learning. 1--13. arXiv:1711.04436 http:\/\/arxiv.org\/abs\/1711.04436"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/d18-1425"},{"key":"e_1_2_1_62_1","volume-title":"Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. CoRR abs\/1709.0, 1","author":"Zhong Victor","year":"2017","unstructured":"Victor Zhong , Caiming Xiong , and Richard Socher . 2017. Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. CoRR abs\/1709.0, 1 ( 2017 ), 1--12. arXiv:1709.00103 http:\/\/arxiv.org\/abs\/1709.00103 Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. CoRR abs\/1709.0, 1 (2017), 1--12. arXiv:1709.00103 http:\/\/arxiv.org\/abs\/1709.00103"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3625054.3625066","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,12,4]],"date-time":"2023-12-04T17:12:00Z","timestamp":1701709920000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3625054.3625066"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,9]]},"references-count":62,"journal-issue":{"issue":"13","published-print":{"date-parts":[[2023,9]]}},"alternative-id":["10.14778\/3625054.3625066"],"URL":"https:\/\/doi.org\/10.14778\/3625054.3625066","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2023,9]]},"assertion":[{"value":"2023-12-04","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}