{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,19]],"date-time":"2026-03-19T02:21:48Z","timestamp":1773886908174,"version":"3.50.1"},"reference-count":46,"publisher":"Association for Computing Machinery (ACM)","issue":"2","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2024,10]]},"abstract":"<jats:p>\n            GPU-accelerated relational query execution engines have parallelized the execution of a pipeline, a sequence of operators. For the parallelization, the engines evenly partition the tuples in a table that will be scanned by the pipeline's first operator (a scan), and each thread executes the pipeline for the tuples in a partition. However, this approach leads to load imbalances since an operator returns a varying number of output tuples per input tuple, particularly under non-uniform data distributions such as skewed join key values. The load imbalances are classified into\n            <jats:italic>intra- and inter-warp load imbalances<\/jats:italic>\n            (intra-WLIs and inter-WLIs) since 1) threads are grouped into warps and 2) every thread in a warp evaluates the same operator for an input tuple concurrently following a\n            <jats:italic>single-instruction-multiple-thread<\/jats:italic>\n            manner. In contrast, threads in different warps can evaluate different operators concurrently. Although load balancing techniques have been proposed, however, they fail to solve the load imbalances on various workloads. In this paper, we propose a query execution engine, Themis, named after the deity of fairness, which symbolizes balanced workloads within our context. Themis minimizes intra-WLIs and inter-WLIs across various workloads. First, Themis minimizes intra-WLIs by redistributing tuples between the threads in a warp and making the threads evaluate an operator only when all of them hold inputs. Second, Themis mitigates the inter-WLIs by redistributing the tuples of warps with heavy workloads to idle warps. To check whether a warp's workload is heavy, we propose a method to approximate the sizes of warps' workloads. Based on these approximations, Themis adaptively adjusts the threshold for determining a warp's workload as heavy. In a recent benchmark JCC-H, which introduces skewed join key distributions to TPC-H, Themis significantly alleviates the inter-WLIs and intra-WLIs, outperforming the runner-up by up to 379x.\n          <\/jats:p>","DOI":"10.14778\/3705829.3705856","type":"journal-article","created":{"date-parts":[[2025,2,28]],"date-time":"2025-02-28T23:21:06Z","timestamp":1740784866000},"page":"426-438","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["Themis: A GPU-Accelerated Relational Query Execution Engine"],"prefix":"10.14778","volume":"18","author":[{"given":"Kijae","family":"Hong","sequence":"first","affiliation":[{"name":"POSTECH"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Kyoungmin","family":"Kim","sequence":"additional","affiliation":[{"name":"EPFL"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Young-Koo","family":"Lee","sequence":"additional","affiliation":[{"name":"Kyunghee University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yang-Sae","family":"Moon","sequence":"additional","affiliation":[{"name":"Kangwon National University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sourav S","family":"Bhowmick","sequence":"additional","affiliation":[{"name":"Nanyang Technological University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wook-Shin","family":"Han","sequence":"additional","affiliation":[{"name":"GSAI, POSTECH"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,2,28]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"GPU Computing Gems Jade Edition","author":"Alcantara Dan A","unstructured":"Dan A Alcantara, Vasily Volkov, Shubhabrata Sengupta, Michael Mitzenmacher, John D Owens, and Nina Amenta. 2012. Building an efficient hash table on the GPU. In GPU Computing Gems Jade Edition. Elsevier, 39--53."},{"key":"e_1_2_1_2_1","volume-title":"Performance Evaluation and Benchmarking for the Analytics Era: 9th TPC Technology Conference, TPCTC 2017","author":"Boncz Peter","year":"2018","unstructured":"Peter Boncz, Angelos-Christos Anatiotis, and Steffen Kl\u00e4be. 2018. JCC-H: adding join crossing correlations with skew to TPC-H. In Performance Evaluation and Benchmarking for the Analytics Era: 9th TPC Technology Conference, TPCTC 2017, Munich, Germany, August 28, 2017, Revised Selected Papers 9. Springer, 103--119."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-018-0512-y"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.14778\/3632093.3632107"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-017-0476-3"},{"key":"e_1_2_1_7_1","unstructured":"Funke. 2022. Github repository of DogQC. https:\/\/github.com\/Henning1\/dogqc"},{"key":"e_1_2_1_8_1","volume-title":"Proceedings of the 2018 International Conference on Management of Data. 1603--1618","author":"Funke Henning","year":"2018","unstructured":"Henning Funke, Sebastian Bre\u00df, Stefan Noll, Volker Markl, and Jens Teubner. 2018. Pipelined query processing in coprocessor environments. In Proceedings of the 2018 International Conference on Management of Data. 1603--1618."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.14778\/3380750.3380758"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3389699"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2020.3035564"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/1376616.1376670"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/3208040.3208062"},{"key":"e_1_2_1_14_1","volume-title":"2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 27--40","author":"Hong Changwan","year":"2017","unstructured":"Changwan Hong, Aravind Sukumaran-Rajam, Jinsung Kim, and P Sadayappan. 2017. MultiGraph: Efficient graph processing on GPUs. In 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 27--40."},{"key":"e_1_2_1_15_1","volume-title":"Dissecting the NVIDIA volta GPU architecture via microbenchmarking. arXiv preprint arXiv:1804.06826","author":"Jia Zhe","year":"2018","unstructured":"Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P Scarpazza. 2018. Dissecting the NVIDIA volta GPU architecture via microbenchmarking. arXiv preprint arXiv:1804.06826 (2018)."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079079.3079080"},{"key":"e_1_2_1_17_1","volume-title":"Proceedings of the 2022 International Conference on Management of Data. 2506--2508","author":"Justen David","year":"2022","unstructured":"David Justen. 2022. Cost-efficiency and Performance Robustness in Serverless Data Exchange. In Proceedings of the 2022 International Conference on Management of Data. 2506--2508."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/2830772.2830796"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.2015.15"},{"key":"e_1_2_1_20_1","volume-title":"2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 524--533","author":"Khorasani Farzad","year":"2016","unstructured":"Farzad Khorasani, Bryan Rowe, Rajiv Gupta, and Laxmi N Bhuyan. 2016. Eliminating intra-warp load imbalance in irregular nested patterns via collaborative task engagement. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 524--533."},{"key":"e_1_2_1_21_1","volume-title":"Kyoungmin Kim and Wook-Shin Han","author":"Lee Yang-Sae Moon Young-Koo","year":"2024","unstructured":"Young-Koo Lee Yang-Sae Moon Sourav S Bhowmick Kijae Hong, Kyoungmin Kim and Wook-Shin Han. 2024. The full paper of Themis. https:\/\/drive.google.com\/drive\/folders\/1FoDxT4uS25iezcEgOY2ZMMHMQD0N0ej5"},{"key":"e_1_2_1_22_1","volume-title":"Proceedings of the Conference on Innovative Data Systems Research (CIDR).","author":"Kossmann Jan","year":"2022","unstructured":"Jan Kossmann, Daniel Lindner, Felix Naumann, and Thorsten Papenbrock. 2022. Workload-driven, lazy discovery of data dependencies for query optimization. In Proceedings of the Conference on Innovative Data Systems Research (CIDR)."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-021-00708-y"},{"key":"e_1_2_1_24_1","volume-title":"Proceedings of the 14th international workshop on data management on new hardware. 1--8.","author":"Lang Harald","year":"2018","unstructured":"Harald Lang, Andreas Kipf, Linnea Passing, Peter Boncz, Thomas Neumann, and Alfons Kemper. 2018. Make the most out of your SIMD investments: counter control flow divergence in compiled query pipelines. In Proceedings of the 14th international workshop on data management on new hardware. 1--8."},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/2588555.2610507"},{"key":"e_1_2_1_26_1","volume-title":"2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS). IEEE, 135--142","author":"Li Zhihao","year":"2017","unstructured":"Zhihao Li, Haipeng Jia, and Yunquan Zhang. 2017. HartSift: A high-accuracy and real-time SIFT based on GPU. In 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS). IEEE, 135--142."},{"key":"e_1_2_1_27_1","volume-title":"Network motif discovery: A GPU approach","author":"Lin Wenqing","year":"2016","unstructured":"Wenqing Lin, Xiaokui Xiao, Xing Xie, and Xiao-Li Li. 2016. Network motif discovery: A GPU approach. IEEE transactions on knowledge and data engineering 29, 3 (2016), 513--528."},{"key":"e_1_2_1_28_1","volume-title":"Advances in Databases and Information Systems: 7th East European Conference, ADBIS 2003, Dresden, Germany, September 3--6, 2003. Proceedings 7. Springer, 236--252","author":"Morzy Miko\u0142aj","year":"2003","unstructured":"Miko\u0142aj Morzy, Tadeusz Morzy, Alexandros Nanopoulos, and Yannis Manolopoulos. 2003. Hierarchical bitmap index: An efficient and scalable indexing technique for set-valued attributes. In Advances in Databases and Information Systems: 7th East European Conference, ADBIS 2003, Dresden, Germany, September 3--6, 2003. Proceedings 7. Springer, 236--252."},{"key":"e_1_2_1_29_1","volume-title":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 123--133","author":"Nisa Israt","year":"2019","unstructured":"Israt Nisa, Jiajia Li, Aravind Sukumaran-Rajam, Richard Vuduc, and Ponnuswamy Sadayappan. 2019. Load-balanced sparse mttkrp on gpus. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 123--133."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3296957.3173180"},{"key":"e_1_2_1_31_1","unstructured":"NVIDIA. 2020. NVIDIA-ampere-GA102-GPU-Architecture-Whitepaper-V1. https:\/\/images.nvidia.com\/aem-dam\/en-zz\/Solutions\/geforce\/ampere\/pdf\/NVIDIA-ampere-GA102-GPU-Architecture-Whitepaper-V1.pdf"},{"key":"e_1_2_1_32_1","unstructured":"NVIDIA. 2022. CUDA C++ Programming Guide. https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html#global-memory-5-x"},{"key":"e_1_2_1_33_1","first-page":"50","article-title":"The star schema benchmark (SSB)","volume":"200","author":"Neil Patrick E","year":"2007","unstructured":"Patrick E ONeil, Elizabeth J ONeil, and Xuedong Chen. 2007. The star schema benchmark (SSB). Pat 200, 0 (2007), 50.","journal-title":"Pat"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.14778\/3425879.3425890"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2915224"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.14778\/3007328.3007336"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1007\/s00778-020-00621-w"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3318464.3380595"},{"key":"e_1_2_1_39_1","volume-title":"2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS). IEEE, 466--475","author":"Shen Yijie","year":"2020","unstructured":"Yijie Shen, Jin Xiong, and Dejun Jiang. 2020. SrSpark: Skew-resilient spark based on adaptive parallel processing. In 2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS). IEEE, 466--475."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.14778\/3357377.3357379"},{"key":"e_1_2_1_41_1","unstructured":"TPC. 2022. TPC-H. https:\/\/www.tpc.org\/tpch\/"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/3464389"},{"key":"e_1_2_1_43_1","doi-asserted-by":"crossref","first-page":"905","DOI":"10.14778\/3574245.3574272","article-title":"SkinnerMT: Parallelizing for Efficiency and Robustness in Adaptive Query Processing on Multicore Platforms","volume":"16","author":"Wei Ziyun","year":"2022","unstructured":"Ziyun Wei and Immanuel Trummer. 2022. SkinnerMT: Parallelizing for Efficiency and Robustness in Adaptive Query Processing on Multicore Platforms. Proceedings of the VLDB Endowment 16, 4 (2022), 905--917.","journal-title":"Proceedings of the VLDB Endowment"},{"key":"e_1_2_1_44_1","volume-title":"2012 45th Annual IEEE\/ACM International Symposium on Microarchitecture. IEEE, 107--118","author":"Wu Haicheng","year":"2012","unstructured":"Haicheng Wu, Gregory Diamos, Srihari Cadambi, and Sudhakar Yalamanchili. 2012. Kernel weaver: Automatically fusing database primitives for efficient gpu computation. In 2012 45th Annual IEEE\/ACM International Symposium on Microarchitecture. IEEE, 107--118."},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/2581122.2544166"},{"key":"e_1_2_1_46_1","volume-title":"International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures","volume":"10","author":"Wu Haicheng","year":"2014","unstructured":"Haicheng Wu, Daniel Zinn, Molham Aref, and Sudhakar Yalamanchili. 2014. Multipredicate join algorithms for accelerating relational graph processing on GPUs. In International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures, Vol. 10."},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476214"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3705829.3705856","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,2,28]],"date-time":"2025-02-28T23:24:57Z","timestamp":1740785097000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3705829.3705856"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,10]]},"references-count":46,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2024,10]]}},"alternative-id":["10.14778\/3705829.3705856"],"URL":"https:\/\/doi.org\/10.14778\/3705829.3705856","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2024,10]]},"assertion":[{"value":"2025-02-28","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}