{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,17]],"date-time":"2025-10-17T14:31:27Z","timestamp":1760711487336,"version":"3.41.0"},"reference-count":53,"publisher":"Association for Computing Machinery (ACM)","issue":"2","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2025,6,30]]},"abstract":"<jats:p>With the rising demand for computational power and the increasing variety of computational scenarios, considerable interest has emerged in transforming existing CUDA programs into more general-purpose OpenCL programs, enabling them to run across diverse hardware platforms. However, manual methods, typically designed for specific applications, lack flexibility. Current automated conversion techniques also face considerable challenges, particularly in handling diverse programming interfaces, memory management, and so on, and are insufficient for converting large-scale, complex CUDA projects. In this article, we propose a novel source-to-source program transformation framework, TransCL, which automates the conversion of CUDA programs in four key aspects: source code, execution model, programming model, and memory model. To achieve this, we abstract a set of conversion rules aligned with the latest CUDA standards, develop a transcoder, implement an OpenCL-compatible programming interface library, and establish a memory mapping mechanism between CUDA and OpenCL. Experiments demonstrate that TransCL provides a high level of automation in converting CUDA-based applications and is effective in handling large, complex projects such as TensorFlow. Moreover, the converted AI framework successfully conducted model training for the first time. The experiment also validates that the converted program can execute correctly across multiple platforms and demonstrate good performance.<\/jats:p>","DOI":"10.1145\/3718987","type":"journal-article","created":{"date-parts":[[2025,2,20]],"date-time":"2025-02-20T10:47:56Z","timestamp":1740048476000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["TransCL: An Automatic CUDA-to-OpenCL Programs Transformation Framework"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0009-0009-5732-2207","authenticated-orcid":false,"given":"Changqing","family":"Shi","sequence":"first","affiliation":[{"name":"Software College, Nankai University","place":["Tianjin, China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-9778-9575","authenticated-orcid":false,"given":"Yufei","family":"Sun","sequence":"additional","affiliation":[{"name":"Nankai University","place":["Tianjin, China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0008-1786-2519","authenticated-orcid":false,"given":"Rui","family":"Chen","sequence":"additional","affiliation":[{"name":"Nankai University","place":["Tianjin, China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-9499-1299","authenticated-orcid":false,"given":"Jiahao","family":"Wang","sequence":"additional","affiliation":[{"name":"Nankai University","place":["Tianjin, China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-7580-0998","authenticated-orcid":false,"given":"Qiang","family":"Guo","sequence":"additional","affiliation":[{"name":"Haihe Laboratory","place":["Tianjin, China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0349-1100","authenticated-orcid":false,"given":"Chunye","family":"Gong","sequence":"additional","affiliation":[{"name":"National University of Defense Technology","place":["Changsha, China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9763-7435","authenticated-orcid":false,"given":"Yicheng","family":"Sui","sequence":"additional","affiliation":[{"name":"Nankai University","place":["Tianjin, China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-2841-3152","authenticated-orcid":false,"given":"Yutong","family":"Jin","sequence":"additional","affiliation":[{"name":"Nankai University","place":["Tianjin, China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6729-925X","authenticated-orcid":false,"given":"Yuzhi","family":"Zhang","sequence":"additional","affiliation":[{"name":"Nankai University","place":["Tianjin, China"]},{"name":"Haihe Laboratory","place":["Tianjin, China"]}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,6,28]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"Mart\u00edn Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen Craig Citro Greg S. Corrado Andy Davis Jeffrey Dean Matthieu Devin et\u00a0al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. Retrieved Dec 12 2024 from https:\/\/github.com\/tensorflow\/tensorflow"},{"key":"e_1_3_1_3_2","volume-title":"A Software Library Containing BLAS Functions Written in OpenCL","author":"Devices Inc. Advanced Micro","year":"2017","unstructured":"Inc. Advanced Micro Devices. 2017. A Software Library Containing BLAS Functions Written in OpenCL. Retrieved October 19, 2023 from https:\/\/github.com\/clMathLibraries\/clBLAS"},{"key":"e_1_3_1_4_2","volume-title":"Heterogeneous-Compute Interface for Portability","author":"Devices Inc Advanced Micro","year":"2024","unstructured":"Inc Advanced Micro Devices. 2024. Heterogeneous-Compute Interface for Portability. Retrieved June 14, 2024 from https:\/\/github.com\/ROCm\/hcc"},{"key":"e_1_3_1_5_2","volume-title":"HIP Runtime API Reference","author":"Devices Inc Advanced Micro","year":"2024","unstructured":"Inc Advanced Micro Devices. 2024. HIP Runtime API Reference. Retrieved June 17, 2024 from https:\/\/rocm.docs.amd.com\/projects\/HIP\/en\/latest\/doxygen\/html\/index.html"},{"key":"e_1_3_1_6_2","volume-title":"HIPCC Documentation","author":"Devices Inc Advanced Micro","year":"2024","unstructured":"Inc Advanced Micro Devices. 2024. HIPCC Documentation. Retrieved June 19, 2024 from https:\/\/rocm.docs.amd.com\/projects\/HIPCC\/en\/latest\/"},{"key":"e_1_3_1_7_2","volume-title":"HIPIFY: Convert CUDA to Portable C++ Code","author":"Devices Inc Advanced Micro","year":"2024","unstructured":"Inc Advanced Micro Devices. 2024. HIPIFY: Convert CUDA to Portable C++ Code. Retrieved June 15, 2024 from https:\/\/github.com\/ROCm\/HIPIFY"},{"key":"e_1_3_1_8_2","first-page":"1","volume-title":"Proceedings of the 2008 IEEE International Symposium on Parallel and Distributed Processing","author":"Barrachina Sergio","year":"2008","unstructured":"Sergio Barrachina, Maribel Castillo, Francisco D. Igual, Rafael Mayo, and Enrique S. Quintana-Orti. 2008. Evaluation and tuning of the level 3 CUBLAS for graphics processors. In Proceedings of the 2008 IEEE International Symposium on Parallel and Distributed Processing. IEEE, 1\u20138."},{"key":"e_1_3_1_9_2","first-page":"1","volume-title":"Proceedings of the International Workshop on OpenCL 2013 & 2014","author":"Cao Chongxiao","year":"2014","unstructured":"Chongxiao Cao, Jack Dongarra, Peng Du, Mark Gates, Piotr Luszczek, and Stanimire Tomov. 2014. clMAGMA: High performance dense linear algebra with OpenCL. In Proceedings of the International Workshop on OpenCL 2013 & 2014. 1\u20139."},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2009.5306797"},{"key":"e_1_3_1_11_2","unstructured":"Sharan Chetlur Cliff Woolley Philippe Vandermersch Jonathan Cohen John Tran Bryan Catanzaro and Evan Shelhamer. 2014. cudnn: A GPU-accelerated library of primitives for deep neural networks. Retrieved Dec 12 2024 from https:\/\/docs.nvidia.com\/deeplearning\/cudnn\/latest"},{"key":"e_1_3_1_12_2","volume-title":"An OpenCL Library for Generating Random Numbers in Parallel","author":"Ciglari\u010d Tadej","year":"2020","unstructured":"Tadej Ciglari\u010d. 2020. An OpenCL Library for Generating Random Numbers in Parallel. Retrieved July 1, 2020 from https:\/\/github.com\/bstatcomp\/RandomCL"},{"volume-title":"Libclang Tutorial","year":"2024","key":"e_1_3_1_13_2","unstructured":"Clang. 2024. Libclang Tutorial. Retrieved June 20, 2024 from https:\/\/clang.llvm.org\/docs\/LibClang.html"},{"volume-title":"LibTooling","year":"2024","key":"e_1_3_1_14_2","unstructured":"Clang. 2024. LibTooling. Retrieved June 20, 2024 from https:\/\/clang.llvm.org\/docs\/LibTooling.html"},{"key":"e_1_3_1_15_2","first-page":"632","volume-title":"Proceedings of the International Conference on Computational Science","author":"Falgout Robert D.","year":"2002","unstructured":"Robert D. Falgout and Ulrike Meier Yang. 2002. hypre: A library of high performance preconditioners. In Proceedings of the International Conference on Computational Science. Springer, 632\u2013641."},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1007\/s42514-020-00039-4"},{"key":"e_1_3_1_17_2","first-page":"1","volume-title":"Proceedings of the 4th International Workshop on OpenCL","author":"Gu Junli","year":"2016","unstructured":"Junli Gu, Yibing Liu, Yuan Gao, and Maohua Zhu. 2016. OpenCL caffe: Accelerating and enabling a cross platform machine learning framework. In Proceedings of the 4th International Workshop on OpenCL. 1\u20135."},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.cpc.2010.12.052"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_1_20_2","unstructured":"Jehandad Khan Paul Fultz Artem Tamazov Daniel Lowell Chao Liu Michael Melesse Murali Nandhimandalam Kamil Nasyrov Ilya Perminov Tejash Shah et\u00a0al. 2019. MIOpen: An open source library for deep learning primitives. Retrieved Dec. 12 2024 from https:\/\/github.com\/ROCm\/MIOpen"},{"volume-title":"Sycl Integrates Opencl Devices with Modern c++","year":"2019","key":"e_1_3_1_21_2","unstructured":"Khronos. 2019. Sycl Integrates Opencl Devices with Modern c++. The Khronos Group."},{"volume-title":"The OpenCL Specification","year":"2021","key":"e_1_3_1_22_2","unstructured":"Khronos. 2021. The OpenCL Specification. Retrieved August 24, 2021 from https:\/\/registry.khronos.org\/OpenCL\/specs\/opencl-1.2.pdf"},{"volume-title":"SYCL\u2122 2020 Specification","year":"2023","key":"e_1_3_1_23_2","unstructured":"Khronos. 2023. SYCL\u2122 2020 Specification. Retrieved October 19, 2023 from https:\/\/registry.khronos.org\/SYCL\/specs\/sycl-2020\/html\/sycl-2020.html"},{"volume-title":"Industry Support for OpenCL","year":"2024","key":"e_1_3_1_24_2","unstructured":"Khronos. 2024. Industry Support for OpenCL. Retrieved November 19, 2024 from https:\/\/www.khronos.org\/opencl\/"},{"volume-title":"OpenCL 3.0","year":"2024","key":"e_1_3_1_25_2","unstructured":"Khronos. 2024. OpenCL 3.0. Retrieved November 19, 2024 from https:\/\/en.wikipedia.org\/wiki\/OpenCL#cite_note-191"},{"key":"e_1_3_1_26_2","first-page":"1","volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis","author":"Kim Junghyun","year":"2015","unstructured":"Junghyun Kim, Thanh Tuan Dao, Jaehoon Jung, Jinyoung Joo, and Jaejin Lee. 2015. Bridging OpenCL and CUDA: A comparative analysis and translation. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1\u201312."},{"key":"e_1_3_1_27_2","first-page":"285","volume-title":"Proceedings of the 2019 IEEE\/ACM International Symposium on Code Generation and Optimization (CGO\u201919)","author":"Kim Yonghae","year":"2019","unstructured":"Yonghae Kim and Hyesoon Kim. 2019. Translating CUDA to OpenCL for hardware generation using neural machine translation. In Proceedings of the 2019 IEEE\/ACM International Symposium on Code Generation and Optimization (CGO\u201919). IEEE, 285\u2013286."},{"key":"e_1_3_1_28_2","volume-title":"CMake Documentation 3.13","author":"Kitware Inc.","year":"2018","unstructured":"Inc. Kitware. 2018. CMake Documentation 3.13. Retrieved June 23, 2018 from https:\/\/cmake.org\/cmake\/help\/v3.13\/"},{"key":"e_1_3_1_29_2","volume-title":"A Software Library Containing Sparse Functions Written in OpenCL","author":"Knox Kent","year":"2016","unstructured":"Kent Knox. 2016. A Software Library Containing Sparse Functions Written in OpenCL. Retrieved September 1, 2016 from https:\/\/github.com\/clMathLibraries\/clSPARSE"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11280-020-00778-y"},{"key":"e_1_3_1_31_2","first-page":"1","volume-title":"Proceedings of the BSD Conference","volume":"5","author":"Lattner Chris","year":"2008","unstructured":"Chris Lattner. 2008. LLVM and Clang: Next generation compiler technology. In Proceedings of the BSD Conference. Vol. 5, 1\u201320."},{"key":"e_1_3_1_32_2","doi-asserted-by":"crossref","first-page":"75","DOI":"10.1109\/CGO.2004.1281665","volume-title":"Proceedings of the International Symposium on Code Generation and Optimization, 2004. CGO 2004.","author":"Lattner Chris","year":"2004","unstructured":"Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the International Symposium on Code Generation and Optimization, 2004. CGO 2004. IEEE, 75\u201386."},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1007\/s42514-022-00095-y"},{"key":"e_1_3_1_34_2","first-page":"300","volume-title":"Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems","author":"Martinez Gabriel","year":"2011","unstructured":"Gabriel Martinez, Mark Gardner, and Wu-chun Feng. 2011. CU2CL: A CUDA-to-OpenCL translator for multi-and many-core architectures. In Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed Systems. IEEE, 300\u2013307."},{"key":"e_1_3_1_35_2","unstructured":"Duane Merrill. 2015. CUDA UnBound (CUB) Library. Retrieved July 1 2024 from https:\/\/nvlabs.github.io\/cub"},{"key":"e_1_3_1_36_2","volume-title":"A Software Library Containing FFT Functions Written in OpenCL","author":"Natarajan Bragadeesh","year":"2016","unstructured":"Bragadeesh Natarajan. 2016. A Software Library Containing FFT Functions Written in OpenCL. Retrieved September 1, 2016 from https:\/\/github.com\/clMathLibraries\/clFFT"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/3204919.3204924"},{"volume-title":"The API Reference Guide for cuFFT","year":"2024","key":"e_1_3_1_38_2","unstructured":"NVIDIA. 2024. The API Reference Guide for cuFFT. Retrieved July 1, 2024 from https:\/\/docs.nvidia.com\/cuda\/cufft\/index.html"},{"volume-title":"The API Reference Guide for cuRAND","year":"2024","key":"e_1_3_1_39_2","unstructured":"NVIDIA. 2024. The API Reference Guide for cuRAND. Retrieved July 1, 2024 from https:\/\/docs.nvidia.com\/cuda\/curand\/index.html"},{"volume-title":"The API Reference Guide for cuSOLVER","year":"2024","key":"e_1_3_1_40_2","unstructured":"NVIDIA. 2024. The API Reference Guide for cuSOLVER. Retrieved July 1, 2024 from https:\/\/docs.nvidia.com\/cuda\/cusolver\/index.html"},{"volume-title":"The API Reference Guide for cuSPARSE","year":"2024","key":"e_1_3_1_41_2","unstructured":"NVIDIA. 2024. The API Reference Guide for cuSPARSE. Retrieved July 1, 2024 from https:\/\/docs.nvidia.com\/cuda\/cusparse\/index.html"},{"volume-title":"CUDA Toolkit Documentation","year":"2024","key":"e_1_3_1_42_2","unstructured":"NVIDIA. 2024. CUDA Toolkit Documentation. Retrieved June 15, 2024 from https:\/\/docs.nvidia.com\/cuda\/index.html"},{"volume-title":"NVIDIA CUDA Compiler Driver NVCC","year":"2024","key":"e_1_3_1_43_2","unstructured":"NVIDIA. 2024. NVIDIA CUDA Compiler Driver NVCC. Retrieved May 21, 2024 from https:\/\/docs.nvidia.com\/cuda\/cuda-compiler-driver-nvcc\/index.html"},{"volume-title":"AI and Compute","year":"2018","key":"e_1_3_1_44_2","unstructured":"OpenAI. 2018. AI and Compute. Retrieved October 19, 2023 from https:\/\/openai.com\/blog\/ai-and-compute"},{"volume-title":"OctaneRender\u2122 - The World\u2019s First and Fastest GPU-accelerated, Unbiased, Physically Correct Renderer","year":"2023","key":"e_1_3_1_45_2","unstructured":"OTOY. 2023. OctaneRender\u2122 - The World\u2019s First and Fastest GPU-accelerated, Unbiased, Physically Correct Renderer. Retrieved October 19, 2023 from https:\/\/home.otoy.com\/render\/octane-render\/"},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1145\/3318170.3318176"},{"key":"e_1_3_1_47_2","unstructured":"Hugh Perkins. 2016. Cltorch: A hardware-agnostic backend for the torch deep neural network library based on OpenCL. Retrieved Dec 12 2024 from https:\/\/github.com\/hughperkins\/cltorch"},{"key":"e_1_3_1_48_2","doi-asserted-by":"publisher","DOI":"10.1145\/3078155.3078156"},{"key":"e_1_3_1_49_2","volume-title":"CUDA by Example: An Introduction to General-Purpose GPU Programming","author":"Sanders Jason","year":"2010","unstructured":"Jason Sanders and Edward Kandrot. 2010. CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley Professional."},{"key":"e_1_3_1_50_2","article-title":"oclCUB: An OpenCL parallel computing library for deep learning operators","author":"Shi Changqing","year":"2024","unstructured":"Changqing Shi, Yufei Sun, Yicheng Sui, Yuqiao Chen, Haotian Wang, and Yuzhi Zhang. 2024. oclCUB: An OpenCL parallel computing library for deep learning operators. CCF Transactions on High Performance Computing 6, 3 (2024), 319\u2013329.","journal-title":"CCF Transactions on High Performance Computing"},{"key":"e_1_3_1_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/MCSE.2010.69"},{"volume-title":"An Automatic Language Transformation Framework for CUDA to OpenCL","year":"2024","key":"e_1_3_1_52_2","unstructured":"TransCL. 2024. An Automatic Language Transformation Framework for CUDA to OpenCL. Retrieved July 5, 2024 from https:\/\/github.com\/SCCQ\/TransCL\/"},{"key":"e_1_3_1_53_2","article-title":"Attention is all you need","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems .","journal-title":"Proceedings of the 31st International Conference on Neural Information Processing Systems"},{"key":"e_1_3_1_54_2","volume-title":"The Definitive Guide to GCC","author":"Hagen William Von","year":"2011","unstructured":"William Von Hagen. 2011. The Definitive Guide to GCC. Apress."}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3718987","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,28]],"date-time":"2025-06-28T11:52:57Z","timestamp":1751111577000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3718987"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,28]]},"references-count":53,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2025,6,30]]}},"alternative-id":["10.1145\/3718987"],"URL":"https:\/\/doi.org\/10.1145\/3718987","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2025,6,28]]},"assertion":[{"value":"2024-10-02","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-01-22","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-06-28","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}