{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,23]],"date-time":"2026-03-23T23:09:41Z","timestamp":1774307381561,"version":"3.50.1"},"reference-count":40,"publisher":"Association for Computing Machinery (ACM)","issue":"12","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2024,8]]},"abstract":"<jats:p>The popularity of heterogeneous CPU-GPU processing has increased considerably in recent years. To efficiently utilize heterogeneous resources, data processing systems depend on an appropriate workload placement strategy to assign the right amount of compute to the right processor. However, finding an optimal placement strategy is not trivial due to various complex and conflicting tradeoffs related to the characteristics of processors, the nature of the workload, and data locality. In addition, placement decisions impact workload runtime and performance cost, and also depend on the availability of potentially different implementations for CPUs and GPUs, which adds extra complexity in such heterogeneous environments. In this tutorial, we review and compare state-of-the-art strategies for workload placement on heterogeneous CPU-GPU architectures, along with runtime prediction techniques and methods to support multi-device code. We also discuss open issues and identify potentially promising future research directions.<\/jats:p>","DOI":"10.14778\/3685800.3685845","type":"journal-article","created":{"date-parts":[[2024,11,8]],"date-time":"2024-11-08T17:25:21Z","timestamp":1731086721000},"page":"4241-4244","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":6,"title":["Workload Placement on Heterogeneous CPU-GPU Systems"],"prefix":"10.14778","volume":"17","author":[{"given":"Marcos N. L.","family":"Carvalho","sequence":"first","affiliation":[{"name":"UPC, NKUA, Athena RC"}]},{"given":"Alkis","family":"Simitsis","sequence":"additional","affiliation":[{"name":"Athena Research Center"}]},{"given":"Anna","family":"Queralt","sequence":"additional","affiliation":[{"name":"UPC, BSC"}]},{"given":"Oscar","family":"Romero","sequence":"additional","affiliation":[{"name":"UPC"}]}],"member":"320","published-online":{"date-parts":[[2024,11,8]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"crossref","unstructured":"N. Boeschen et al. 2022. Gacco-a gpu-accelerated oltp dbms. In ACM SIGMOD.","DOI":"10.1145\/3514221.3517876"},{"key":"e_1_2_1_2_1","doi-asserted-by":"crossref","unstructured":"A. Kamatar et al. 2020. Locality-aware scheduling for scalable heterogeneous environments. In IEEE\/ACM ROSS.","DOI":"10.1109\/ROSS51935.2020.00011"},{"key":"e_1_2_1_3_1","doi-asserted-by":"crossref","unstructured":"A. K. Ziogas et al. 2021. NPBench: A benchmarking suite for high-performance NumPy. In ACM ICS.","DOI":"10.1145\/3447818.3460360"},{"key":"e_1_2_1_4_1","doi-asserted-by":"crossref","unstructured":"A. Shanbhag et al. 2020. A study of the fundamental performance characteristics of GPUs and CPUs for database analytics. In ACM SIGMOD.","DOI":"10.1145\/3318464.3380595"},{"key":"e_1_2_1_5_1","doi-asserted-by":"crossref","unstructured":"A. Shanbhag et al. 2022. Tile-based lightweight integer compression in GPU. In ACM SIGMOD.","DOI":"10.1145\/3514221.3526132"},{"key":"e_1_2_1_6_1","doi-asserted-by":"crossref","unstructured":"B. He et al. 2009. Relational query coprocessing on graphics processors. ACM TODS 34 4 (2009).","DOI":"10.1145\/1620585.1620588"},{"key":"e_1_2_1_7_1","volume-title":"Sigmoid: An auto-tuned load balancing algorithm for heterogeneous systems","author":"P\u00e9rez B.","year":"2021","unstructured":"B. P\u00e9rez et al. 2021. Sigmoid: An auto-tuned load balancing algorithm for heterogeneous systems. Elsevier JPDC 157 (2021)."},{"key":"e_1_2_1_8_1","doi-asserted-by":"crossref","unstructured":"B. Yogatama et al. 2022. Orchestrating data placement and query execution in heterogeneous CPU-GPU DBMS. PVLDB 15 11 (2022).","DOI":"10.14778\/3551793.3551809"},{"key":"e_1_2_1_9_1","doi-asserted-by":"crossref","unstructured":"Cui et al. 2024. CGgraph: An Ultra-fast Graph Processing System on Modern Commodity CPU-GPU Co-processor. PVLDB (2024).","DOI":"10.14778\/3648160.3648179"},{"key":"e_1_2_1_10_1","doi-asserted-by":"crossref","unstructured":"C. Chen et al. 2018. GFlink: An in-memory computing architecture on heterogeneous CPU-GPU clusters for big data. IEEE TPDS 29 6 (2018).","DOI":"10.1109\/TPDS.2018.2794343"},{"key":"e_1_2_1_11_1","doi-asserted-by":"crossref","unstructured":"C. Lutz et al. 2020. Pump up the volume: Processing large data on GPUs with fast interconnects. In ACM SIGMOD.","DOI":"10.1145\/3318464.3389705"},{"key":"e_1_2_1_12_1","volume-title":"Parla: A python orchestration system for heterogeneous architectures","author":"Lee H.","year":"2022","unstructured":"H. Lee et al. 2022. Parla: A python orchestration system for heterogeneous architectures. In IEEE SC."},{"key":"e_1_2_1_13_1","unstructured":"H. Nicholson et al. 2023. HetCache: Synergising NVMe Storage and GPU-acceleration for Memory-Efficient Analytics. CIDR (2023)."},{"key":"e_1_2_1_14_1","doi-asserted-by":"crossref","unstructured":"H. Zhang et al. 2020. Learning-driven interference-aware workload parallelization for streaming applications in heterogeneous cluster. IEEE TPDS 32 1 (2020).","DOI":"10.1109\/TPDS.2020.3008725"},{"key":"e_1_2_1_15_1","unstructured":"J. Ren et al. 2021. {Zero-offload}: Democratizing {billion-scale} model training. In USENIX ATC."},{"key":"e_1_2_1_16_1","volume-title":"Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv","author":"Abadi M.","year":"2016","unstructured":"M. Abadi et al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv (2016)."},{"key":"e_1_2_1_17_1","doi-asserted-by":"crossref","unstructured":"M. Gowanlock et al. 2019. Accelerating the unacceleratable: Hybrid CPU\/GPU algorithms for memory-bound database primitives. In ACM DaMoN.","DOI":"10.1145\/3329785.3329926"},{"key":"e_1_2_1_18_1","volume-title":"NL Carvalho et al","author":"M.","year":"2024","unstructured":"M. NL Carvalho et al. 2024. Performance Analysis of Distributed GPU-Accelerated Task-Based Workflows. EDBT (2024)."},{"key":"e_1_2_1_19_1","doi-asserted-by":"crossref","unstructured":"M. Xekalaki et al. 2022. Enabling Transparent Acceleration of Big Data Frameworks Using Heterogeneous Hardware. PVLDB 15 13 (2022).","DOI":"10.14778\/3565838.3565842"},{"key":"e_1_2_1_20_1","volume-title":"Readys: A reinforcement learning based strategy for heterogeneous dynamic scheduling","author":"Grinsztajn N.","year":"2021","unstructured":"N. Grinsztajn et al. 2021. Readys: A reinforcement learning based strategy for heterogeneous dynamic scheduling. In IEEE CLUSTER."},{"key":"e_1_2_1_21_1","doi-asserted-by":"crossref","unstructured":"P. Chrysogelos et al. 2019. HetExchange: Encapsulating heterogeneous CPU-GPU parallelism in JIT compiled engines. PVLDB 12 5 (2019).","DOI":"10.14778\/3303753.3303760"},{"key":"e_1_2_1_22_1","unstructured":"R. Appuswamy et al. 2017. The case for heterogeneous HTAP. In CIDR."},{"key":"e_1_2_1_23_1","volume-title":"Placeto: Efficient progressive device placement optimization. In NeurIPS.","author":"Addanki R.","year":"2018","unstructured":"R. Addanki et al. 2018. Placeto: Efficient progressive device placement optimization. In NeurIPS."},{"key":"e_1_2_1_24_1","doi-asserted-by":"crossref","unstructured":"R. Lee et al. 2021. The art of balance: a RateupDB\u2122 experience of building a CPU\/GPU hybrid database product. PVLDB 14 12 (2021).","DOI":"10.14778\/3476311.3476378"},{"key":"e_1_2_1_25_1","doi-asserted-by":"crossref","unstructured":"S. Bre\u00df et al. 2014. Ocelot\/hype: Optimized data processing on heterogeneous hardware. PVLDB 7 13 (2014).","DOI":"10.14778\/2733004.2733042"},{"key":"e_1_2_1_26_1","doi-asserted-by":"crossref","unstructured":"S. Bre\u00df et al. 2016. Robust query processing in co-processor-accelerated databases. In ACM SIGMOD.","DOI":"10.1145\/2882903.2882936"},{"key":"e_1_2_1_27_1","volume-title":"Ben-Nun et al","author":"T.","year":"2019","unstructured":"T. Ben-Nun et al. 2019. Stateful dataflow multigraphs: A data-centric model for performance portability on heterogeneous architectures. In IEEE SC."},{"key":"e_1_2_1_28_1","volume-title":"EDBT Workshops.","author":"Karnagel T.","year":"2015","unstructured":"T. Karnagel et al. 2015. Local vs. Global Optimization: Operator Placement Strategies in Heterogeneous Environments.. In EDBT Workshops."},{"key":"e_1_2_1_29_1","doi-asserted-by":"crossref","unstructured":"T. Karnagel et al. 2017. Adaptive work placement for query processing on heterogeneous computing resources. PVLDB 10 7 (2017).","DOI":"10.14778\/3067421.3067423"},{"key":"e_1_2_1_30_1","doi-asserted-by":"crossref","unstructured":"T. Park et al. 2023. Orchestrating Large-Scale SpGEMMs using Dynamic Block Distribution and Data Transfer Minimization on Heterogeneous Systems. In ICDE.","DOI":"10.1109\/ICDE55515.2023.00189"},{"key":"e_1_2_1_31_1","doi-asserted-by":"crossref","unstructured":"V. Ravi et al. 2012. Scheduling concurrent applications on a cluster of cpu-gpu nodes. In IEEE\/ACM CCGRID.","DOI":"10.1109\/CCGrid.2012.78"},{"key":"e_1_2_1_32_1","doi-asserted-by":"crossref","unstructured":"V. Rosenfeld et al. 2022. Query processing on heterogeneous CPU\/GPU systems. ACM CSUR 55 1 (2022).","DOI":"10.1145\/3485126"},{"key":"e_1_2_1_33_1","doi-asserted-by":"crossref","unstructured":"Z. Li et al. 2021. Efficient algorithms for task mapping on heterogeneous CPU\/GPU platforms for fast completion time. Elsevier J. Syst. Arch. 114 (2021).","DOI":"10.1016\/j.sysarc.2020.101936"},{"key":"e_1_2_1_34_1","doi-asserted-by":"crossref","unstructured":"Z. Li et al. 2023. CoTrain: Efficient Scheduling for Large-Model Training upon GPU and CPU in Parallel. In ACM ICPP.","DOI":"10.1145\/3605573.3605647"},{"key":"e_1_2_1_35_1","volume-title":"A static task partitioning approach for heterogeneous systems using OpenCL","author":"Grewe D.","unstructured":"D. Grewe and MFP O'Boyle. 2011. A static task partitioning approach for heterogeneous systems using OpenCL. In Springer CC."},{"key":"e_1_2_1_36_1","volume-title":"IEEE ICDE Workshops.","author":"Lee S.","unstructured":"S. Lee and S. Park. 2021. Performance analysis of big data ETL process over CPU-GPU heterogeneous architectures. In IEEE ICDE Workshops."},{"key":"e_1_2_1_37_1","doi-asserted-by":"crossref","unstructured":"S. Mittal and J. S. Vetter. 2015. A survey of CPU-GPU heterogeneous computing techniques. ACM CSUR 47 4 (2015).","DOI":"10.1145\/2788396"},{"key":"e_1_2_1_38_1","unstructured":"NVIDIA. 2024. NVIDIA H200. https:\/\/www.nvidia.com\/en-eu\/data-center\/h200\/"},{"key":"e_1_2_1_39_1","volume-title":"VLDB PhD Workshop.","author":"Pirk H.","year":"2012","unstructured":"H. Pirk. 2012. Efficient cross-device query processing. In VLDB PhD Workshop."},{"key":"e_1_2_1_40_1","doi-asserted-by":"crossref","unstructured":"Y. Wen and M. FP O'Boyle. 2017. Merge or separate? Multi-job scheduling for OpenCL kernels on CPU\/GPU platforms. In ACM GPGPU.","DOI":"10.1145\/3038228.3038235"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3685800.3685845","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,12,31]],"date-time":"2024-12-31T05:29:19Z","timestamp":1735622959000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3685800.3685845"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,8]]},"references-count":40,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2024,8]]}},"alternative-id":["10.14778\/3685800.3685845"],"URL":"https:\/\/doi.org\/10.14778\/3685800.3685845","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2024,8]]},"assertion":[{"value":"2024-11-08","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}