{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T05:05:25Z","timestamp":1750309525002,"version":"3.41.0"},"reference-count":38,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2025,2,11]],"date-time":"2025-02-11T00:00:00Z","timestamp":1739232000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Parallel Comput."],"published-print":{"date-parts":[[2025,3,31]]},"abstract":"<jats:p>Tensor transposition is a fundamental operation in tensor calculations with various applications. However, a naive implementation that copies each element from the source tensor to the transposed position in the target tensor requires double space, making it unsuitable for large-scale tensors on memory-limited accelerators, like Graphic Processing Units (GPUs). In this article, we propose an algorithm and its implementation, called EITHOT, for In-place Transposition of High Order Tensors on GPUs, which requires only 5% additional memory at most for large high order tensors. To achieve this, EITHOT uses a newly proposed method, called permutation decomposition, to factorize a transposition of a high-order tensor into a sequence of low-order tensor transpositions. Then, based on the estimated extra memory requirements, EITHOT divides a large tensor into smaller tensors and transposes each smaller tensor separately. Finally, the transposed smaller tensors are combined to form the desired result. The GPU implementation optimizes memory access performance using the cooperative groups programming model. Our experiments demonstrate that EITHOT delivers competitive performance compared to the state-of-the-art out-of-place GPU implementations. Furthermore, EITHOT can handle nearly double the size of tensors compared to out-of-place methods, making it suitable for various transpositions of N-order tensors.<\/jats:p>","DOI":"10.1145\/3711871","type":"journal-article","created":{"date-parts":[[2025,1,9]],"date-time":"2025-01-09T11:40:16Z","timestamp":1736422816000},"page":"1-22","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["EITHOT: Efficient In-place Transposition of High Order Tensors on GPUs"],"prefix":"10.1145","volume":"12","author":[{"ORCID":"https:\/\/orcid.org\/0009-0003-7582-2762","authenticated-orcid":false,"given":"Chun-Yu","family":"Wu","sequence":"first","affiliation":[{"name":"National Tsing-Hua University Department of Computer Science, Hsinchu, Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-3675-5099","authenticated-orcid":false,"given":"Chih-Chieh","family":"Tu","sequence":"additional","affiliation":[{"name":"National Tsing-Hua University Department of Computer Science, Hsinchu Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-0620-6442","authenticated-orcid":false,"given":"Kai-Jung","family":"Cheng","sequence":"additional","affiliation":[{"name":"National Tsing-Hua University Department of Computer Science, Hsinchu Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3940-4478","authenticated-orcid":false,"given":"Che-Rung","family":"Lee","sequence":"additional","affiliation":[{"name":"National Tsing-Hua University Department of Computer Science, Hsinchu Taiwan"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,2,11]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"2022. NVIDIA Multi-Instance GPU User Guide. Retrieved from https:\/\/docs.nvidia.com\/datacenter\/tesla\/mig-user-guide\/index.html."},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.procs.2016.05.302"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1145\/1186785.1186794"},{"issue":"2","key":"e_1_3_1_5_2","doi-asserted-by":"crossref","first-page":"135","DOI":"10.1145\/567806.567807","article-title":"An updated set of basic linear algebra subprograms (BLAS)","volume":"28","author":"Blackford L. Susan","year":"2002","unstructured":"L. Susan Blackford, Antoine Petitet, Roldan Pozo, Karin Remington, R. Clint Whaley, James Demmel, Jack Dongarra, Iain Duff, Sven Hammarling, Greg Henry, et\u00a0al. 2002. An updated set of basic linear algebra subprograms (BLAS). ACM Transactions on Mathematical Software 28, 2 (2002), 135\u2013151.","journal-title":"ACM Transactions on Mathematical Software"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/2692916.2555253"},{"key":"e_1_3_1_7_2","volume-title":"Proceedings of the 1st International Forum on Next-Generation Multicore\/Manycore Technologies.","author":"El-Moursy Ali","year":"2008","unstructured":"Ali El-Moursy, Ahmed El-Mahdy, and Hisham El-Shishiny. 2008. An efficient in-place 3D transpose for multicore processors with software managed memory hierarchy. In Proceedings of the 1st International Forum on Next-Generation Multicore\/Manycore Technologies.Association for Computing Machinery, New York, NY, USA, Article 10, 6 pages. DOI:10.1145\/1463768.1463781"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","unstructured":"Muhammad Elsayed Ali Saleh El-shehaby and Mohamed S. Abougabal. 2015. NDPA: A generalized efficient parallel in-place N-Dimensional Permutation Algorithm. Alexandria Engineering Journal 54 3 (2015) 473\u2013480. 10.1016\/j.aej.2015.03.024","DOI":"10.1016\/j.aej.2015.03.024"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.3389\/fams.2022.806549"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","unstructured":"Fred Gustavson Lars Karlsson and Bo K\u00e5gstr\u00f6m. 2012. Parallel and cache-efficient in-place matrix storage format conversion. 38 3 (2012) 32 pages. DOI:10.1145\/2168773.2168775","DOI":"10.1145\/2168773.2168775"},{"key":"e_1_3_1_11_2","unstructured":"Fred Gehrung Gustavson and John A. Gunnels. 2014. Method and structure for cache aware transposition via rectangular subsections. (2014)."},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.5071"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2015.2412549"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","unstructured":"So Hirata. 2003. Tensor contraction engine: Abstraction and automated parallel implementation of configuration-interaction coupled-cluster and many-body perturbation theories. The Journal of Physical Chemistry A 107 46 (01 Nov 2003) 9887\u20139897. 10.1021\/jp034596z","DOI":"10.1021\/jp034596z"},{"key":"e_1_3_1_15_2","unstructured":"Namgyu Ho Sangmin Bae Taehyeon Kim Hyunjik Jo Yireun Kim Tal Schuster Adam Fisch James Thorne and Se-Young Yun. 2024. Block Transformer: Global-to-Local Language Modeling for Fast Inference. (2024). arxiv:cs.CL\/2406.02657https:\/\/arxiv.org\/abs\/2406.02657"},{"key":"e_1_3_1_16_2","unstructured":"Antti-Pekka Hynninen and Dmitry I. Lyakh. 2017. cuTT: A High-Performance Tensor Transpose Library for CUDA Compatible GPUs. (2017). arxiv:cs.MS\/1705.01598"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10766-015-0366-5"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1137\/07070111X"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","unstructured":"Dmitry I. Lyakh. 2015. An efficient tensor transpose algorithm for multicore CPU Intel Xeon Phi and NVidia Tesla GPU. Computer Physics Communications 189 (2015) 84\u201391. 10.1016\/j.cpc.2014.12.013","DOI":"10.1016\/j.cpc.2014.12.013"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1137\/16M108968X"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1137\/16M108968X"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.14778\/3425879.3425883"},{"key":"e_1_3_1_23_2","volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.","author":"Narayanan Deepak","year":"2021","unstructured":"Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021. Efficient large-scale language model training on GPU clusters using megatron-LM. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.Association for Computing Machinery, New York, NY, USA, Article 58, 15 pages. DOI:10.1145\/3458817.3476209"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1103\/revmodphys.80.1083"},{"key":"e_1_3_1_25_2","unstructured":"Alexander Novikov Dmitry Podoprikhin Anton Osokin and Dmitry Vetrov. 2015. Tensorizing Neural Networks. (2015). arxiv:cs.LG\/1509.06569"},{"key":"e_1_3_1_26_2","volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.","author":"Rajbhandari Samyam","year":"2020","unstructured":"Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. ZeRO: Memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.IEEE Press, Article 20, 16 pages."},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/hipc.2016.031"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2014.06.002"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2024.3374513"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1145\/3157733"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/2935323.2935328"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","unstructured":"Paul Springer Tong Su and Paolo Bientinesi. 2017. HPTT: A high-performance tensor transposition C++ library.Proceedings of the 4th ACM SIGPLAN International Workshop on Libraries Languages and Compilers for Array Programming. Association for Computing Machinery New York NY USA 56\u201362. DOI:10.1145\/3091966.3091968","DOI":"10.1145\/3091966.3091968"},{"key":"e_1_3_1_33_2","doi-asserted-by":"crossref","first-page":"207","DOI":"10.1145\/2555243.2555266","volume-title":"Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.","author":"Sung I-Jui","year":"2014","unstructured":"I-Jui Sung, Juan G\u00f3mez-Luna, Jos\u00e9 Mar\u00eda Gonz\u00e1lez-Linares, Nicol\u00e1s Guil, and Wen-Mei W. Hwu. 2014. In-place transposition of rectangular matrices on accelerators. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.Association for Computing Machinery, New York, NY, USA, 207\u2013218. DOI:10.1145\/2555243.2555266"},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jco.2009.02.008"},{"key":"e_1_3_1_35_2","first-page":"3930","volume-title":"Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing","author":"Ulicny Matej","year":"2021","unstructured":"Matej Ulicny, Vladimir A. Krylov, and Rozenn Dahyot. 2021. Tensor reordering for CNN compression. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. 3930\u20133934. DOI:10.1109\/ICASSP39728.2021.9413944"},{"key":"e_1_3_1_36_2","doi-asserted-by":"crossref","first-page":"447","DOI":"10.1007\/3-540-47969-4_30","volume-title":"Computer Vision \u2014 ECCV 2002.","author":"Vasilescu M. Alex O.","year":"2002","unstructured":"M. Alex O. Vasilescu and Demetri Terzopoulos. 2002. Multilinear analysis of image ensembles: TensorFaces. In Computer Vision \u2014 ECCV 2002.Anders Heyden, Gunnar Sparr, Mads Nielsen, and Peter Johansen (Eds.), Springer, Berlin, 447\u2013460."},{"key":"e_1_3_1_37_2","first-page":"578","volume-title":"Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium","author":"Vedurada J.","year":"2018","unstructured":"J. Vedurada, A. Suresh, A. S. Rajam, J. Kim, C. Hong, A. Panyala, S. Krishnamoorthy, V. K. Nandivada, R. K. Srivastava, and P. Sadayappan. 2018. TTLG - An efficient tensor transposition library for GPUs. In Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium. 578\u2013588. DOI:10.1109\/IPDPS.2018.00067"},{"key":"e_1_3_1_38_2","first-page":"558","volume-title":"Proceedings of the 2023 IEEE 39th International Conference on Data Engineering","author":"Wang Qiange","year":"2023","unstructured":"Qiange Wang, Xin Ai, Yanfeng Zhang, Jing Chen, and Ge Yu. 2023. HyTGraph: GPU-accelerated graph processing with hybrid transfer management. In Proceedings of the 2023 IEEE 39th International Conference on Data Engineering. 558\u2013571. DOI: DOI:10.1109\/ICDE55515.2023.00049"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1002\/wcms.82"}],"container-title":["ACM Transactions on Parallel Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3711871","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3711871","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:18:09Z","timestamp":1750295889000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3711871"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,2,11]]},"references-count":38,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025,3,31]]}},"alternative-id":["10.1145\/3711871"],"URL":"https:\/\/doi.org\/10.1145\/3711871","relation":{},"ISSN":["2329-4949","2329-4957"],"issn-type":[{"type":"print","value":"2329-4949"},{"type":"electronic","value":"2329-4957"}],"subject":[],"published":{"date-parts":[[2025,2,11]]},"assertion":[{"value":"2023-12-12","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-11-30","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-02-11","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}