{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,9]],"date-time":"2026-07-09T16:16:57Z","timestamp":1783613817554,"version":"3.55.0"},"reference-count":48,"publisher":"Association for Computing Machinery (ACM)","issue":"3","funder":[{"DOI":"10.13039\/501100012166","name":"National Key R&D Program of China","doi-asserted-by":"crossref","award":["2023YFB4503600"],"award-info":[{"award-number":["2023YFB4503600"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100001809","name":"NSF of China","doi-asserted-by":"crossref","award":["62232009"],"award-info":[{"award-number":["62232009"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2025,6,17]]},"abstract":"<jats:p>Automatic dialect translation reduces the complexity of database migration, which is crucial for applications interacting with multiple database systems. However, rule-based translation tools (e.g., SQLGlot, jOOQ, SQLines) are labor-intensive to develop and often (1) fail to translate certain operations, (2) produce incorrect translations due to rule deficiencies, and (3) generate translations compatible with some database versions but not the others.<\/jats:p>\n                  <jats:p>\n                    In this paper, we investigate the problem of automating dialect translation with large language models (LLMs). There are three main challenges. First, queries often involve lengthy content (e.g., excessive column values) and multiple syntax elements that require translation, increasing the risk of LLM hallucination. Second, database dialects have diverse syntax trees and specifications, making it difficult for cross-dialect syntax matching. Third, dialect translation often involves complex many-to-one relationships between source and target operations, making it impractical to translate each operation in isolation. To address these challenges, we propose an automatic dialect translation system CrackSQL. First, we propose\n                    <jats:italic toggle=\"yes\">Functionality-based Query Processing<\/jats:italic>\n                    that segments the query by functionality syntax trees and simplifies the query via (i) customized function normalization and (ii) translation-irrelevant query abstraction. Second, we design a\n                    <jats:italic toggle=\"yes\">Cross-Dialect Syntax Embedding Model<\/jats:italic>\n                    to generate embeddings by the syntax trees and specifications (of certain version), enabling accurate query syntax matching. Third, we propose a\n                    <jats:italic toggle=\"yes\">Local-to-Global Dialect Translation<\/jats:italic>\n                    strategy, which restricts LLM-based translation and validation on operations that cause local failures, iteratively extending these operations until translation succeeds. Experiments show CrackSQL significantly outperforms existing methods (e.g., by up to 77.42%). The code is available at https:\/\/github.com\/weAIDB\/CrackSQL.\n                  <\/jats:p>","DOI":"10.1145\/3725278","type":"journal-article","created":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T21:23:29Z","timestamp":1750281809000},"page":"1-26","source":"Crossref","is-referenced-by-count":7,"title":["Cracking SQL Barriers: An LLM-based Dialect Translation System"],"prefix":"10.1145","volume":"3","author":[{"ORCID":"https:\/\/orcid.org\/0009-0000-8862-7753","authenticated-orcid":false,"given":"Wei","family":"Zhou","sequence":"first","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-0773-1902","authenticated-orcid":false,"given":"Yuyang","family":"Gao","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2285-7836","authenticated-orcid":false,"given":"Xuanhe","family":"Zhou","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1398-0621","authenticated-orcid":false,"given":"Guoliang","family":"Li","sequence":"additional","affiliation":[{"name":"Tsinghua University, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2025,6,18]]},"reference":[{"key":"e_1_2_2_1_1","unstructured":"Code-Llama. (Model). https:\/\/huggingface.co\/codellama\/CodeLlama-7b-Instruct-hf Last accessed on 2024--10."},{"key":"e_1_2_2_2_1","unstructured":"GPT-4o. (Model). https:\/\/openai.com\/index\/hello-gpt-4o\/ Last accessed on 2024--10."},{"key":"e_1_2_2_3_1","unstructured":"jOOQ. (Tool). https:\/\/www.jooq.org\/ Last accessed on 2024--10."},{"key":"e_1_2_2_4_1","unstructured":"Llama3.1. (Model). https:\/\/huggingface.co\/meta-llama\/Llama-3.1--8B-Instruct Last accessed on 2024--10."},{"key":"e_1_2_2_5_1","unstructured":"Ora2PG. (Tool). https:\/\/ora2pg.darold.net\/ Last accessed on 2024--10."},{"key":"e_1_2_2_6_1","unstructured":"SOAR. (Tool). https:\/\/github.com\/xiaomi\/soar\/ Last accessed on 2024--10."},{"key":"e_1_2_2_7_1","unstructured":"SQLGlot. (Tool). https:\/\/sqlglot.com\/sqlglot.html Last accessed on 2024--10."},{"key":"e_1_2_2_8_1","unstructured":"SQLines. (Tool). https:\/\/www.sqlines.com\/ Last accessed on 2024--10."},{"key":"e_1_2_2_9_1","unstructured":"StackExchange. (Website). https:\/\/archive.org\/details\/stackexchange Last accessed on 2024--10."},{"key":"e_1_2_2_10_1","unstructured":"TPC-DS Benchmark. (TPC). https:\/\/www.tpc.org\/tpcds"},{"key":"e_1_2_2_11_1","unstructured":"TPC-H Benchmark. (TPC). https:\/\/www.tpc.org\/tpch"},{"key":"e_1_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.3390\/E24040551"},{"key":"e_1_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICACCI.2016.7732441"},{"key":"e_1_2_2_14_1","doi-asserted-by":"crossref","unstructured":"Tao Bai Jinqi Luo Jun Zhao BihanWen and QianWang. 2021. Recent Advances in Adversarial Training for Adversarial Robustness. In IJCAI. 4312--4321.","DOI":"10.24963\/ijcai.2021\/591"},{"key":"e_1_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/3183713.3190662"},{"key":"e_1_2_2_16_1","volume-title":"Recent advances in text embedding: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark. arXiv preprint arXiv:2406.01607","author":"Cao Hongliu","year":"2024","unstructured":"Hongliu Cao. 2024. Recent advances in text embedding: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark. arXiv preprint arXiv:2406.01607 (2024)."},{"key":"e_1_2_2_17_1","first-page":"28","article-title":"On Using an Online, Automatic and Non-Intrusive Approach for Rewriting SQL Queries","volume":"5","author":"de Ara\u00fajo Arlino H. M.","year":"2014","unstructured":"Arlino H. M. de Ara\u00fajo, Jos\u00e9 Maria Monteiro, Jos\u00e9 Ant\u00f4nio Fernandes de Mac\u00eado, J\u00falio A. Tavares, Angelo Brayner, and S\u00e9rgio Lifschitz. 2014. On Using an Online, Automatic and Non-Intrusive Approach for Rewriting SQL Queries. J. Inf. Data Manag. 5, 1 (2014), 28--39. https:\/\/sol.sbc.org.br\/journals\/index.php\/jidm\/article\/view\/1517","journal-title":"J. Inf. Data Manag."},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/1476793.1476928"},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.14778\/2732240.2732246"},{"key":"e_1_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE55515.2023.00214"},{"key":"e_1_2_2_21_1","volume-title":"Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey. CoRR abs\/2403.14608","author":"Han Zeyu","year":"2024","unstructured":"Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, and Sai Qian Zhang. 2024. Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey. CoRR abs\/2403.14608 (2024)."},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.14778\/3401960.3401970"},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.14778\/3659437.3659449"},{"key":"e_1_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.14778\/3681954.3682003"},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.14778\/3685800.3685838"},{"key":"e_1_2_2_26_1","volume-title":"Fei Huang, Reynold Cheng, and Yongbin Li.","author":"Li Jinyang","year":"2023","unstructured":"Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, Xuanhe Zhou, Chenhao Ma, Guoliang Li, Kevin Chen-Chuan Chang, Fei Huang, Reynold Cheng, and Yongbin Li. 2023. Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs. In NeurIPS."},{"key":"e_1_2_2_27_1","volume-title":"Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023","author":"Li Jinyang","year":"2023","unstructured":"Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, Xuanhe Zhou, Chenhao Ma, Guoliang Li, Kevin Chen-Chuan Chang, Fei Huang, Reynold Cheng, and Yongbin Li. 2023. Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. http:\/\/papers.nips.cc\/paper_files\/paper\/2023\/hash\/ 83fc8fab1710363050bbd1d4b8cc0021-Abstract-Datasets_and_Benchmarks.html"},{"key":"e_1_2_2_28_1","unstructured":"Raymond Li Loubna Ben Allal Yangtian Zi Niklas Muennighoff Denis Kocetkov Chenghao Mou Marc Marone Christopher Akiki Jia Li Jenny Chim Qian Liu Evgenii Zheltonozhskii Terry Yue Zhuo Thomas Wang Olivier Dehaene Mishig Davaadorj Joel Lamy-Poirier Jo\u00e3o Monteiro Oleh Shliazhko Nicolas Gontier Nicholas Meade Armel Zebaze Ming-Ho Yee Logesh Kumar Umapathi Jian Zhu Benjamin Lipkin Muhtasham Oblokulov Zhiruo Wang Rudra Murthy V Jason Stillerman Siva Sankalp Patel Dmitry Abulkhanov Marco Zocca Manan Dey Zhihan Zhang Nour Moustafa-Fahmy Urvashi Bhattacharyya Wenhao Yu Swayam Singh Sasha Luccioni Paulo Villegas Maxim Kunakov Fedor Zhdanov Manuel Romero Tony Lee Nadav Timor Jennifer Ding Claire Schlesinger Hailey Schoelkopf Jan Ebert Tri Dao Mayank Mishra Alex Gu Jennifer Robinson Carolyn Jane Anderson Brendan Dolan- Gavitt Danish Contractor Siva Reddy Daniel Fried Dzmitry Bahdanau Yacine Jernite Carlos Mu\u00f1oz Ferrandis Sean Hughes Thomas Wolf Arjun Guha Leandro von Werra and Harm de Vries. 2023. StarCoder: may the source be with you! arXiv Preprint (2023). https:\/\/arxiv.org\/abs\/2305.06161"},{"key":"e_1_2_2_29_1","volume-title":"LLM-R2: A Large Language Model Enhanced Rule-based Rewrite System for Boosting Query Efficiency. CoRR abs\/2404.12872","author":"Li Zhaodonghui","year":"2024","unstructured":"Zhaodonghui Li, Haitao Yuan, Huiming Wang, Gao Cong, and Lidong Bing. 2024. LLM-R2: A Large Language Model Enhanced Rule-based Rewrite System for Boosting Query Efficiency. CoRR abs\/2404.12872 (2024)."},{"key":"e_1_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2403.09060"},{"key":"e_1_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/50202.50204"},{"key":"e_1_2_2_32_1","unstructured":"Edward Ma. 2019. NLP Augmentation. https:\/\/github.com\/makcedward\/nlpaug."},{"key":"e_1_2_2_33_1","volume-title":"Large language models: A survey. arXiv preprint arXiv:2402.06196","author":"Minaee Shervin","year":"2024","unstructured":"Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. 2024. Large language models: A survey. arXiv preprint arXiv:2402.06196 (2024)."},{"key":"e_1_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2210.07316"},{"key":"e_1_2_2_35_1","unstructured":"Hannes M\u00fchleisen and Mark Raasveldt. 2025. Runtime-Extensible Parsers. CIDR (2025)."},{"key":"e_1_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/3663742.3663973"},{"key":"e_1_2_2_37_1","first-page":"1","article-title":"Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code","volume":"82","author":"Pan Rangeet","year":"2024","unstructured":"Rangeet Pan, Ali Reza Ibrahimzada, Rahul Krishna, Divya Sankar, Lambert Pouguem Wassi, Michele Merler, Boris Sobolev, Raju Pavuluri, Saurabh Sinha, and Reyhaneh Jabbarvand. 2024. Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code. In ICSE. ACM, 82:1--82:13.","journal-title":"ICSE. ACM"},{"key":"e_1_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/130283.130294"},{"key":"e_1_2_2_39_1","doi-asserted-by":"publisher","unstructured":"Praveen Seshadri Joseph M. Hellerstein Hamid Pirahesh T. Y. Cliff Leung Raghu Ramakrishnan Divesh Srivastava Peter J. Stuckey and S. Sudarshan. 1996. Cost-Based Optimization for Magic: Algebra and Implementation. In SIGMOD H. V. Jagadish and Inderpal Singh Mumick (Eds.). ACM Press 435--446. doi:10.1145\/233269.233360","DOI":"10.1145\/233269.233360"},{"key":"e_1_2_2_40_1","volume-title":"14th Conference on Innovative Data Systems Research, CIDR 2024","author":"Singh Vikramank Y.","year":"2024","unstructured":"Vikramank Y. Singh, Kapil Vaidya, Vinayshekhar Bannihatti Kumar, Sopan Khosla, Balakrishnan Narayanaswamy, Rashmi Gangadharaiah, and Tim Kraska. 2024. Panda: Performance Debugging for Databases using LLM Agents. In 14th Conference on Innovative Data Systems Research, CIDR 2024, Chaminade, HI, USA, January 14--17, 2024. www.cidrdb.org. https:\/\/www.cidrdb.org\/cidr2024\/papers\/p6-singh.pdf"},{"key":"e_1_2_2_41_1","article-title":"Visualizing data using t-SNE","volume":"9","author":"der Maaten Laurens Van","year":"2008","unstructured":"Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).","journal-title":"Journal of machine learning research"},{"key":"e_1_2_2_42_1","unstructured":"Shenzhi Wang Yaowei Zheng Guoyin Wang Shiji Song and Gao Huang. 2024. Llama3.1--8B-Chinese-Chat. doi:10. 57967\/hf\/2779"},{"key":"e_1_2_2_43_1","volume-title":"WeTune: Automatic Discovery and Verification of Query Rewrite Rules. In SIGMOD Conference. 94--107","author":"Zhou Zhou","year":"2022","unstructured":"ZhaoguoWang, Zhou Zhou, Yicun Yang, Haoran Ding, Gansen Hu, Ding Ding, Chuzhe Tang, Haibo Chen, and Jinyang Li. 2022. WeTune: Automatic Discovery and Verification of Query Rewrite Rules. In SIGMOD Conference. 94--107."},{"key":"e_1_2_2_44_1","volume-title":"CrackSQL: A Hybrid SQL Dialect Translation System Powered by Large Language Models. arXiv Preprint","author":"Zhou Wei","year":"2025","unstructured":"Wei Zhou, Yuyang Gao, Xuanhe Zhou, and Guoliang Li. 2025. CrackSQL: A Hybrid SQL Dialect Translation System Powered by Large Language Models. arXiv Preprint (2025). https:\/\/arxiv.org\/abs\/2504.00882"},{"key":"e_1_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.14778\/3675034.3675035"},{"key":"e_1_2_2_46_1","volume-title":"TRAP: Tailored Robustness Assessement for Index Advisors via Adversarial Perturbation. In ICDE. to appear.","author":"Zhou Wei","year":"2024","unstructured":"Wei Zhou, Chen Lin, Xuanhe Zhou, Guoliang Li, and Tianqing Wang. 2024. TRAP: Tailored Robustness Assessement for Index Advisors via Adversarial Perturbation. In ICDE. to appear."},{"key":"e_1_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.14778\/3485450.3485456"},{"key":"e_1_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.14778\/3675034.3675043"}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3725278","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,31]],"date-time":"2026-03-31T18:54:28Z","timestamp":1774983268000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3725278"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,17]]},"references-count":48,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2025,6,17]]}},"alternative-id":["10.1145\/3725278"],"URL":"https:\/\/doi.org\/10.1145\/3725278","relation":{},"ISSN":["2836-6573"],"issn-type":[{"value":"2836-6573","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,6,17]]}}}