{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,2]],"date-time":"2025-11-02T09:38:04Z","timestamp":1762076284615,"version":"build-2065373602"},"reference-count":38,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2023,3,1]],"date-time":"2023-03-01T00:00:00Z","timestamp":1677628800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc-sa\/4.0\/"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2023,6,30]]},"abstract":"<jats:p>Heterogeneous programming models are becoming increasingly popular to support the ever-evolving hardware architectures, especially for new and emerging specialized accelerators optimizing specific tasks. While such programs provide performance portability of the existing applications across various heterogeneous architectures to some extent, short-running device kernels can affect an application performance due to overheads of data transfer, synchronization, and kernel launch. While in applications with one or two short-running kernels the overhead can be negligible, it can be noticeable when these short-running kernels dominate the overall number of kernels in an application, as it is the case in graph-based neural network models, where there are several small memory-bound nodes alongside few large compute-bound nodes.<\/jats:p>\n          <jats:p>\n            To reduce the overhead, combining several kernels into a single, more optimized kernel is an active area of research. However, this task can be time-consuming and error-prone given the huge set of potential combinations. This can push programmers to seek a tradeoff between (a) task-specific kernels with low overhead but hard to maintain and (b) smaller modular kernels with higher overhead but easier to maintain. While there are DSL-based approaches, such as those provided for machine learning frameworks, which offer the possibility of such a fusion, they are limited to a particular domain and exploit specific knowledge of that domain and, as a consequence, are hard to port elsewhere. This study explores the feasibility of a user-driven\n            <jats:italic>kernel fusion<\/jats:italic>\n            through an extension to the SYCL API to address the automation of kernel fusion. The proposed solution requires programmers to define the subgraph regions that are potentially suitable for fusion without any modification to the kernel code or the function signature. We evaluate the performance benefit of our approach on common neural networks and study the performance improvement in detail.\n          <\/jats:p>","DOI":"10.1145\/3571284","type":"journal-article","created":{"date-parts":[[2022,11,18]],"date-time":"2022-11-18T12:01:26Z","timestamp":1668772886000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":4,"title":["User-driven Online Kernel Fusion for SYCL"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3302-8339","authenticated-orcid":false,"given":"V\u00edctor","family":"P\u00e9rez","sequence":"first","affiliation":[{"name":"Codeplay Software Ltd., Scotland, UK"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1918-3911","authenticated-orcid":false,"given":"Lukas","family":"Sommer","sequence":"additional","affiliation":[{"name":"Codeplay Software Ltd., Scotland, UK"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1726-4662","authenticated-orcid":false,"given":"Victor","family":"Lom\u00fcller","sequence":"additional","affiliation":[{"name":"Codeplay Software Ltd., Scotland, UK"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1142-3039","authenticated-orcid":false,"given":"Kumudha","family":"Narasimhan","sequence":"additional","affiliation":[{"name":"Codeplay Software Ltd., Scotland, UK"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3520-9598","authenticated-orcid":false,"given":"Mehdi","family":"Goli","sequence":"additional","affiliation":[{"name":"Codeplay Software Ltd., Scotland, UK"}]}],"member":"320","published-online":{"date-parts":[[2023,3]]},"reference":[{"key":"e_1_3_2_2_2","unstructured":"Intel Corporation. 2019. OpenVINO toolkit. Retrieved from: https:\/\/software.intel.com\/en-us\/openvino-toolkit."},{"key":"e_1_3_2_3_2","unstructured":"Nvidia Corporation. 2022. NVIDIA CUDA programming model. Retrieved from: http:\/\/www.nvidia.com\/CUDA."},{"key":"e_1_3_2_4_2","first-page":"265","volume-title":"12th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201916)","author":"Abadi Mart\u00edn","year":"2016","unstructured":"Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard et\u00a0al. 2016. TensorFlow: A system for Large-Scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201916). 265\u2013283."},{"key":"e_1_3_2_5_2","volume-title":"International Workshop on OpenCL","author":"Alpay Aksel","year":"2020","unstructured":"Aksel Alpay and Vincent Heuveline. 2020. SYCL beyond OpenCL: The architecture, current state and future direction of hipSYCL. In International Workshop on OpenCL."},{"key":"e_1_3_2_6_2","article-title":"ONNX: Open neural network exchange","author":"Bai Junjie","year":"2019","unstructured":"Junjie Bai, Fang Lu, Ke Zhang et\u00a0al. 2019. ONNX: Open neural network exchange. GitHub Repository (2019).","journal-title":"GitHub Repository"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1145\/3318170.3318183"},{"key":"e_1_3_2_8_2","first-page":"1","volume-title":"International Workshop on OpenCL","author":"Burns Rod","year":"2019","unstructured":"Rod Burns, John Lawson, Duncan McBain, and Daniel Soutar. 2019. Accelerated neural networks on OpenCL devices using SYCL-DNN. In International Workshop on OpenCL. 1\u20134."},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/3318170.3318183"},{"key":"e_1_3_2_10_2","article-title":"Accelerated neural networks on OpenCL devices using SYCL-DNN","volume":"1904","author":"Burns Rod","year":"2019","unstructured":"Rod Burns, John W. Lawson, Duncan McBain, and Daniel Soutar. 2019. Accelerated neural networks on OpenCL devices using SYCL-DNN. CoRR abs\/1904.04174 (2019).","journal-title":"CoRR"},{"key":"e_1_3_2_11_2","first-page":"578","volume-title":"13th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201918)","author":"Chen Tianqi","year":"2018","unstructured":"Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze et\u00a0al. 2018. TVM: An automated End-to-End optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201918). 578\u2013594."},{"key":"e_1_3_2_12_2","unstructured":"ONNX Runtime developers. 2022. ONNX Runtime. Retrieved from: https:\/\/onnxruntime.ai\/."},{"key":"e_1_3_2_13_2","doi-asserted-by":"crossref","unstructured":"Anastasios Doumoulakis Ronan Keryell and Kenneth O\u2019Brien. 2017. SYCL C++ and OpenCL interoperability experimentation with triSYCL.","DOI":"10.1145\/3078155.3078188"},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11227-015-1483-z"},{"key":"e_1_3_2_15_2","article-title":"Eigen: A C++ linear algebra library","author":"Guennebaud Gael","year":"2014","unstructured":"Gael Guennebaud, Benoit Jacob et\u00a0al. 2014. Eigen: A C++ linear algebra library. Retrieved from: http:\/\/eigen. tuxfamily. org.","journal-title":"Retrieved from: http:\/\/eigen. tuxfamily. org."},{"key":"e_1_3_2_16_2","article-title":"Deep residual learning for image recognition","volume":"1512","author":"He Kaiming","year":"2015","unstructured":"Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. CoRR abs\/1512.03385 (2015).","journal-title":"CoRR"},{"key":"e_1_3_2_17_2","unstructured":"Intel. 2022. DPC++ compiler. Retrieved from: https:\/\/github.com\/intel\/llvm."},{"key":"e_1_3_2_18_2","unstructured":"Intel. 2022. Intel oneAPI Math Kernel Library (oneMKL). Retrieved from: https:\/\/docs.oneapi.com\/versions\/latest\/onemkl\/index.html."},{"key":"e_1_3_2_19_2","unstructured":"Intel. 2022. OneAPI Deep Neural Network Library (oneDNN). Retrieved from: https:\/\/01.org\/onednn."},{"key":"e_1_3_2_20_2","unstructured":"Intel. 2022. OneDPL: oneAPI DPC++ Library. Retrieved from: https:\/\/github.com\/oneapi-src\/oneDPL."},{"key":"e_1_3_2_21_2","article-title":"SPIR-V specification","author":"Kessenich John","year":"2021","unstructured":"John Kessenich, Boaz Ouriel, and Raun Krisch. 2021. SPIR-V specification. Khronos Group (2021).","journal-title":"Khronos Group"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-57675-2_39"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/3388333.3388669"},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/LLVMHPCHiPar51896.2020.00010"},{"key":"e_1_3_2_25_2","unstructured":"John W. Lawson Mehdi Goli Duncan McBain Daniel Soutar and Louis Sugy. 2019. Cross-platform performance portability using highly parametrized SYCL kernels."},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/CGO53902.2022.9741270"},{"key":"e_1_3_2_27_2","unstructured":"Codeplay Software Ltd.2022. ComputeCPP compiler. Retrieved from: https:\/\/developer.codeplay.com\/products\/computecpp\/ce\/home."},{"key":"e_1_3_2_28_2","unstructured":"Codeplay Software Ltd.2022. SYCL-BLAS: An implementation of BLAS using the SYCL open standard. Retrieved from: https:\/\/github.com\/CodeplaySoftware\/SYCL-BLAS."},{"key":"e_1_3_2_29_2","unstructured":"Codeplay Software Ltd.2022. The SYCL-DNN neural network acceleration library. Retrieved from: https:\/\/github.com\/CodeplaySoftware\/SYCL-DNN."},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/CGO.2019.8661188"},{"key":"e_1_3_2_31_2","article-title":"models","author":"(ONNX) Open Neural Network Exchange","year":"2022","unstructured":"Open Neural Network Exchange (ONNX). 2022. models. GitHub repository. Retrieved from: https:\/\/github.com\/onnx\/models.","journal-title":"GitHub repository"},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/CGO.2019.8661176"},{"key":"e_1_3_2_33_2","article-title":"Glow: Graph lowering compiler techniques for neural networks","author":"Rotem Nadav","year":"2018","unstructured":"Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Garret Catron, Summer Deng, Roman Dzhabarov, Nick Gibson, James Hegeman, Meghan Lele, Roman Levenstein et\u00a0al. 2018. Glow: Graph lowering compiler techniques for neural networks. arXiv preprint arXiv:1805.00907 (2018).","journal-title":"arXiv preprint arXiv:1805.00907"},{"key":"e_1_3_2_34_2","unstructured":"Maria Rovatsou Lee Howes and Ronan Keryell. 2021. SYCL 2020 specification (revision 4). (2021)."},{"key":"e_1_3_2_35_2","volume-title":"3rd International Conference on Learning Representations","author":"Simonyan Karen","year":"2015","unstructured":"Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations. Retrieved from: http:\/\/arxiv.org\/abs\/1409.1556."},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/MCSE.2010.69"},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2014.21"},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2009.10924"},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2103.05288"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3571284","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3571284","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T18:08:22Z","timestamp":1750183702000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3571284"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,3]]},"references-count":38,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2023,6,30]]}},"alternative-id":["10.1145\/3571284"],"URL":"https:\/\/doi.org\/10.1145\/3571284","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2023,3]]},"assertion":[{"value":"2022-05-31","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-10-24","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-03-01","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}