{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,4]],"date-time":"2026-02-04T21:23:39Z","timestamp":1770240219290,"version":"3.49.0"},"reference-count":41,"publisher":"Association for Computing Machinery (ACM)","issue":"FSE","license":[{"start":{"date-parts":[[2024,7,12]],"date-time":"2024-07-12T00:00:00Z","timestamp":1720742400000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["2106420"],"award-info":[{"award-number":["2106420"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Softw. Eng."],"published-print":{"date-parts":[[2024,7,12]]},"abstract":"<jats:p>\n                    SQL is the most commonly used front-end language for data-intensive scalable computing (DISC) applications due to its broad presence in new and legacy workflows and shallow learning curve. However, DISC-backed SQL introduces several layers of abstraction that significantly reduce the visibility and transparency of workflows, making it challenging for developers to find and fix errors in a query. When a query returns incorrect outputs, it takes a non-trivial effort to comprehend every stage of the query execution and find the root cause among the input data and complex SQL query. We aim to bring the benefits of\n                    <jats:italic toggle=\"yes\">step-through interactive debugging to<\/jats:italic>\n                    DISC-powered SQL with D\n                    <jats:sc>e<\/jats:sc>\n                    SQL.\n                  <\/jats:p>\n                  <jats:p>\n                    Due to the declarative nature of SQL, there are no ordered atomic statements to place a break point to monitor the flow of data. D\n                    <jats:sc>e<\/jats:sc>\n                    SQL\u2019s\n                    <jats:italic toggle=\"yes\">automated query decomposition<\/jats:italic>\n                    breaks a SQL query into its constituent sub-queries, offering natural locations for setting breakpoints and monitoring intermediate data. However, due to advanced query optimization and translation in DISC systems, a user query rarely matches the physical execution, making it challenging to associate subqueries with their intermediate data. D\n                    <jats:sc>e<\/jats:sc>\n                    SQL performs fine-grained taint analysis to dynamically map the subqueries to their intermediate data, while also recognizing subqueries removed by the optimizers. For such subqueries, D\n                    <jats:sc>e<\/jats:sc>\n                    SQL efficiently regenerates the intermediate data from a nearby subquery\u2019s data. On the popular TPC-DC benchmark, D\n                    <jats:sc>e<\/jats:sc>\n                    SQL provides a complete debugging view in 13% less time than the original job time while incurring an average overhead of 10% in addition to retaining Apache Spark\u2019s scalability. In a user study comprising 15 participants engaged in two debugging tasks, we find that participants utilizing D\n                    <jats:sc>e<\/jats:sc>\n                    SQL identify the root cause behind a wrong query output in 74% less time than the de-facto, manual debugging.\n                  <\/jats:p>","DOI":"10.1145\/3643761","type":"journal-article","created":{"date-parts":[[2024,7,12]],"date-time":"2024-07-12T10:22:09Z","timestamp":1720779729000},"page":"767-788","source":"Crossref","is-referenced-by-count":2,"title":["DeSQL: Interactive Debugging of SQL in Data-Intensive Scalable Computing"],"prefix":"10.1145","volume":"1","author":[{"ORCID":"https:\/\/orcid.org\/0009-0002-4655-6912","authenticated-orcid":false,"given":"Sabaat","family":"Haroon","sequence":"first","affiliation":[{"name":"Virginia Tech, Blacksburg, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6036-4733","authenticated-orcid":false,"given":"Chris","family":"Brown","sequence":"additional","affiliation":[{"name":"Virginia Tech, Blacksburg, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8007-8662","authenticated-orcid":false,"given":"Muhammad Ali","family":"Gulzar","sequence":"additional","affiliation":[{"name":"Virginia Tech, Blacksburg, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2024,7,12]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1145\/2380116.2380144"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1145\/2723372.2742797"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","unstructured":"Herman Banken Erik Meijer and Georgios Gousios. 2018. Debugging Data Flows in Reactive Programs. Proceedings of the 40th International Conference on Software Engineering (ICSE \u201918). Association for Computing Machinery New York NY USA 752-763. https:\/\/doi.org\/10.1145\/3180155.3180156 10.1145\/3180155.3180156","DOI":"10.1145\/3180155.3180156"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","unstructured":"Leilani Battle Danyel Fisher Robert DeLine Mike Barnett Badrish Chandramouli and Jonathan Goldstein. 2016. Making Sense of Temporal Queries with Interactive Visualization. Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI \u201916). Association for Computing Machinery New York NY USA 5433-5443. https:\/\/doi.org\/10.1145\/2858036.2858408 10.1145\/2858036.2858408","DOI":"10.1145\/2858036.2858408"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","unstructured":"Laura Chiticariu Wang-Chiew Tan and Gaurav Vijayvargiya. 2005. DBNotes: A Post-It System for Relational Databases Based on Provenance. Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data (SIGMOD \u201905). Association for Computing Machinery New York NY USA 942-944. https:\/\/doi.org\/10.1145\/1066157.1066296 10.1145\/1066157.1066296","DOI":"10.1145\/1066157.1066296"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.14778\/2994509.2994530"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","unstructured":"James Clause Wanchun Li and Alessandro Orso. 2007. Dytan: A Generic Dynamic Taint Analysis Framework. Proceedings of the 2007 International Symposium on Software Testing and Analysis (ISSTA\u201907). Association for Computing Machinery New York NY USA 471-480. https:\/\/doi.org\/10.1145\/1273463.1273490 10.1145\/1273463.1273490","DOI":"10.1145\/1273463.1273490"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","unstructured":"Bertty Contreras-Rojas Jorge-Arnulfo Quian\u00e9-Ruiz Zoi Kaoudi and Saravanan Thirumuruganathan. 2019. TagSniff: Simplified Big Data Debugging for Dataflow Jobs. Proceedings of the ACM Symposium on Cloud Computing (SoCC \u201919). Association for Computing Machinery New York NY USA 453-464. https:\/\/doi.org\/10.1145\/3357223.3362738 10.1145\/3357223.3362738","DOI":"10.1145\/3357223.3362738"},{"key":"e_1_3_1_10_2","unstructured":"Yingwei Cui and Jennifer Widom. 2001. Lineage Tracing for General Data Warehouse Transformations. Proceedings of the 27th International Conference on Very Large Data Bases (VLDB \u201901). Morgan Kaufmann Publishers Inc. San Francisco CA USA 471-480."},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1145\/357775.357777"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/1327452.1327492"},{"key":"e_1_3_1_13_2","unstructured":"Rodrigo Fonseca George Porter Randy H. Katz and Scott Shenker. 2007. X-Trace: A Pervasive Network Tracing Framework. 4th USENIX Symposium on Networked Systems Design & Implementation (NSDI 07). USENIX Association Cambridge MA. https:\/\/www.usenix.org\/conference\/nsdi-07\/x-trace-pervasive-network-tracing-framework"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","unstructured":"Sneha Gathani Peter Lim and Leilani Battle. 2020. Debugging Database Queries: A Survey of Tools Techniques and Users. Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI \u201920). Association for Computing Machinery New York NY USA 1\u201316. https:\/\/doi.org\/10.1145\/3313831.3376485 10.1145\/3313831.3376485","DOI":"10.1145\/3313831.3376485"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","unstructured":"Boris Glavic and Gustavo Alonso. 2009. Provenance for Nested Subqueries. Proceedings of the 12th International Conference on Extending Database Technology (EDBT \u201909). Association for Computing Machinery New York NY USA 482-493. https:\/\/doi.org\/10.1145\/1516360.1516472 10.1145\/1516360.1516472","DOI":"10.1145\/1516360.1516472"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","unstructured":"Torsten Grust Fabian Kliebhan Jan Rittinger and Tom Schreiber. 2011. True Language-Level SQL Debugging. Proceedings of the 14th International Conference on Extending Database Technology (EDBT\/ICDT \u201911). Association for Computing Machinery New York NY USA 562-565. https:\/\/doi.org\/10.1145\/1951365.1951441 10.1145\/1951365.1951441","DOI":"10.1145\/1951365.1951441"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","unstructured":"Muhammad Ali Gulzar Matteo Interlandi Xueyuan Han Mingda Li Tyson Condie and Miryung Kim. 2017. Automated Debugging in Data-Intensive Scalable Computing. Proceedings of the 2017 Symposium on Cloud Computing (SoCC \u201917). Association for Computing Machinery New York NY USA 520-534. https:\/\/doi.org\/10.1145\/3127479.3131624 10.1145\/3127479.3131624","DOI":"10.1145\/3127479.3131624"},{"key":"e_1_3_1_18_2","doi-asserted-by":"crossref","unstructured":"Muhammad Ali Gulzar Matteo Interlandi Seunghyun Yoo Sai Deep Tetali Tyson Condie and Todd D. MillsteinMiryung Kim. 2016. BigDebug: Debugging Primitives for Interactive Big Data Processing in Spark. 2016 IEEE\/ACM 38th International Conference on Software Engineering (ICSE) (2016). 784-795.","DOI":"10.1145\/2884781.2884813"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","unstructured":"Muhammad Ali Gulzar and Miryung Kim. 2021. OptDebug: Fault-Inducing Operation Isolation for Dataflow Applications. Proceedings of the ACM Symposium on Cloud Computing (SoCC \u201921). Association for Computing Machinery New York NY USA 520-534. https:\/\/doi.org\/10.1145\/3472883.3487016 10.1145\/3472883.3487016","DOI":"10.1145\/3472883.3487016"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","unstructured":"Sabaat Haroon. 2024. DeSQL Artifacts.https:\/\/doi.org\/10.5281\/zenodo.11069504 10.5281\/zenodo.11069504","DOI":"10.5281\/zenodo.11069504"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-11217-1_23"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.14778\/2850583.2850595"},{"key":"e_1_3_1_23_2","unstructured":"Keeptool. accessed 2023. Keeptool. https:\/\/keeptool.com\/en\/ Accessed on March 21 2023"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","unstructured":"YongChul Kwon Magdalena Balazinska Bill Howe and Jerome Rolia. 2012. SkewTune: Mitigating Skew in Mapreduce Applications. Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (Scottsdale Arizona USA) (SIGMOD \u201912). Association for Computing Machinery New York NY USA 25-36. https:\/\/doi.org\/10.1145\/2213836.2213840 10.1145\/2213836.2213840","DOI":"10.1145\/2213836.2213840"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.14778\/3415478.3415528"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","unstructured":"Zhengjie Miao Sudeepa Roy and Jun Yang. 2019. Explaining Wrong Queries Using Small Examples. Proceedings of the 2019 International Conference on Management of Data (Amsterdam Netherlands) (SIGMOD \u201919). Association for Computing Machinery New York NY USA 503-520. https:\/\/doi.org\/10.1145\/3299869.3319866 10.1145\/3299869.3319866","DOI":"10.1145\/3299869.3319866"},{"key":"e_1_3_1_27_2","unstructured":"Microsoft. accessed 2023. Transact-SQL Debugger. https:\/\/learn.microsoft.com\/en-us\/sql\/ssms\/scripting\/transact-sql-debugger?view=sql-server-ver16 Accessed on March 21 2023."},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","unstructured":"Davide Mottin Alice Marascu Senjuti Basu Roy Gautam Das Themis Palpanas and Yannis Velegrakis. 2014. >IQR: An Interactive Query Relaxation System for the Empty-Answer Problem. Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (Snowbird Utah USA) (SIGMOD \u201914). Association for Computing Machinery New York NY USA 1095-1098. https:\/\/doi.org\/10.1145\/2588555.2594512 10.1145\/2588555.2594512","DOI":"10.1145\/2588555.2594512"},{"key":"e_1_3_1_29_2","unstructured":"Michi Mutsuzaki Martin Theobald Ander de Keijzer Jennifer Widom Parag Agrawal Omar Benjelloun Anish Das Sarma Raghotham Murthy and Tomoe Sugihara. 2007. Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS (Demo). Conference on Innovative Data Systems Research"},{"key":"e_1_3_1_30_2","unstructured":"Raghunath Othayoth Nambiar and Meikel Poess. 2006. The Making of TPC-DS. Proceedings of the 32nd International Conference on Very Large Data Bases (Seoul Korea) (VLDB \u201906). VLDB Endowment 1049-1058."},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.14778\/3402755.3402758"},{"key":"e_1_3_1_32_2","unstructured":"Postgres Professional. 2023. Postgres Pro Demo Database. https:\/\/postgrespro.com\/community\/demodb"},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.14778\/3199517.3199522"},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","unstructured":"Sudeepa Roy and Dan Suciu. 2014. A Formal Approach to Finding Explanations for Database Queries. Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (Snowbird Utah USA) (SIGMOD \u201914). Association for Computing Machinery New York NY USA 15791590. https:\/\/doi.org\/10.1145\/2588555.2588578 10.1145\/2588555.2588578","DOI":"10.1145\/2588555.2588578"},{"key":"e_1_3_1_35_2","unstructured":"Ron Savage. 2023. SQL-2003 BNF Grammar. https:\/\/ronsavage.github.io\/SQL\/sql-2003-2.bnf.html Accessed on 09-28 2023."},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/3231712"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","unstructured":"Jason Teoh Muhammad Ali Gulzar and Miryung Kim. 2020. Influence-Based Provenance for Dataflow Applications with Taint Propagation. Proceedings of the 11th ACM Symposium on Cloud Computing (Virtual Event USA) (SoCC\u201920). Association for Computing Machinery New York NY USA 372386. https:\/\/doi.org\/10.1145\/3419111.3421292 10.1145\/3419111.3421292","DOI":"10.1145\/3419111.3421292"},{"key":"e_1_3_1_38_2","doi-asserted-by":"crossref","unstructured":"Ashish Thusoo Joydeep Sen Sarma Namit Jain Zheng Shao Prasad Chakka Ning Zhang Suresh Anthon Hao Liu and Raghotham Murthy. 2010. Hive - A Petabyte Scale Data Warehouse using Hadoop. In ICDE Feifei Li Mirella M. Moro Shahram Ghandeharizadeh Jayant R. Haritsa Gerhard Weikum Michael J. Carey Fabio Casati Edward Y.Chang Ioana Manolescu Sharad Mehrotra Umeshwar Dayal and Vassilis J. Tsotras (Eds.). IEEE 996\u20131005. http:\/\/infolab.stanford.edu\/~ragho\/hive-icde2010.pdf","DOI":"10.1109\/ICDE.2010.5447738"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.14778\/3415478.3415517"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.14778\/2536354.2536356"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","unstructured":"Chengxu Yang Yuanchun Li Mengwei Xu Zhenpeng Chen Yunxin Liu Gang Huang and Xuanzhe Liu. 2021. TaintStream: Fine-Grained Taint Tracking for Big Data Platforms through Dynamic Code Translation. Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Athens Greece) (ESEC\/FSE 2021). Association for Computing Machinery New York NY USA 806817. https:\/\/doi.org\/10.1145\/3468264.3468532 10.1145\/3468264.3468532","DOI":"10.1145\/3468264.3468532"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1145\/2934664"}],"container-title":["Proceedings of the ACM on Software Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3643761","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3643761","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3643761","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,4]],"date-time":"2026-02-04T07:58:48Z","timestamp":1770191928000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3643761"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,7,12]]},"references-count":41,"journal-issue":{"issue":"FSE","published-print":{"date-parts":[[2024,7,12]]}},"alternative-id":["10.1145\/3643761"],"URL":"https:\/\/doi.org\/10.1145\/3643761","relation":{},"ISSN":["2994-970X"],"issn-type":[{"value":"2994-970X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,7,12]]}}}