{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,24]],"date-time":"2026-03-24T11:59:29Z","timestamp":1774353569379,"version":"3.50.1"},"reference-count":52,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2023,7,22]],"date-time":"2023-07-22T00:00:00Z","timestamp":1689984000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100012166","name":"National Key R&D Program of China","doi-asserted-by":"crossref","award":["2021YFB0300203"],"award-info":[{"award-number":["2021YFB0300203"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"crossref"}]},{"name":"GHfund D","award":["202302044639"],"award-info":[{"award-number":["202302044639"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2023,9,30]]},"abstract":"<jats:p>Fast Fourier transform (FFT) is widely used in computing applications in large-scale parallel programs, and data communication is the main performance bottleneck of FFT and seriously affects its parallel efficiency. To tackle this problem, we propose a new large-scale FFT framework, MFFT, which optimizes parallel FFT with a new mixed-precision optimization technique, adopting the \u201chigh precision computation, low precision communication\u201d strategy. To enable \u201clow precision communication\u201d, we propose a shared-exponent floating-point number compression technique, which reduces the volume of data communication, while maintaining higher accuracy. In addition, we apply a two-phase normalization technique to further reduce the round-off error. Based on the mixed-precision MFFT framework, we apply several optimization techniques to improve the performance, such as streaming of GPU kernels, MPI message combination, kernel optimization, and memory optimization. We evaluate MFFT on a system with 4,096 GPUs. The results show that shared-exponent MFFT is 1.23 \u00d7 faster than that of double-precision MFFT on average, and double-precision MFFT achieves performance 3.53\u00d7 and 9.48\u00d7 on average higher than open source library 2Decomp&amp;FFT (CPU-based version) and heFFTe (AMD GPU-based version), respectively. The parallel efficiency of double-precision MFFT increased from 53.2% to 78.1% compared with 2Decomp&amp;FFT, and shared-exponent MFFT further increases the parallel efficiency to 83.8%.<\/jats:p>","DOI":"10.1145\/3605148","type":"journal-article","created":{"date-parts":[[2023,6,19]],"date-time":"2023-06-19T16:18:15Z","timestamp":1687191495000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":7,"title":["MFFT: A GPU Accelerated Highly Efficient Mixed-Precision Large-Scale FFT Framework"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4719-1049","authenticated-orcid":false,"given":"Yuwen","family":"Zhao","sequence":"first","affiliation":[{"name":"Institute of Software, Chinese Academy of Sciences, China and University of Chinese Academy of Sciences, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7344-7493","authenticated-orcid":false,"given":"Fangfang","family":"Liu","sequence":"additional","affiliation":[{"name":"Institute of Software, Chinese Academy of Sciences, China and State Key Laboratory of Computer Science, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1795-4498","authenticated-orcid":false,"given":"Wenjing","family":"Ma","sequence":"additional","affiliation":[{"name":"Institute of Software, Chinese Academy of Sciences, China and State Key Laboratory of Computer Science, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6326-9926","authenticated-orcid":false,"given":"Huiyuan","family":"Li","sequence":"additional","affiliation":[{"name":"Institute of Software, Chinese Academy of Sciences, China and State Key Laboratory of Computer Science, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0723-8364","authenticated-orcid":false,"given":"Yuanchi","family":"Peng","sequence":"additional","affiliation":[{"name":"Institute of Software, Chinese Academy of Sciences, China and University of Chinese Academy of Sciences, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-7841-7946","authenticated-orcid":false,"given":"Cui","family":"Wang","sequence":"additional","affiliation":[{"name":"Institute of Software, Chinese Academy of Sciences, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2023,7,22]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"2018. NVIDIA APEX.https:\/\/github.com\/NVIDIA\/apex."},{"key":"e_1_3_1_3_2","unstructured":"2019. CUFFT library. https:\/\/docs.nvidia.com\/pdf\/CUFFT_Library.pdf."},{"key":"e_1_3_1_4_2","unstructured":"2021. rocFFT Documentation. https:\/\/rocfft.readthedocs.io\/en\/rocm-4.2.0\/."},{"key":"e_1_3_1_5_2","unstructured":"2022. heFFTe.https:\/\/bitbucket.org\/icl\/heffte."},{"key":"e_1_3_1_6_2","unstructured":"2022. Large-scale atomic\/molecular massively parallel simulator. https:\/\/lammps.sandia.gov\/."},{"key":"e_1_3_1_7_2","unstructured":"2022. VkFFT. https:\/\/github.com\/DTolm\/VkFFT."},{"key":"e_1_3_1_8_2","doi-asserted-by":"crossref","first-page":"293","DOI":"10.1145\/1274971.1275011","volume-title":"Proceedings of the 21st Annual International Conference on Supercomputing","author":"Ali Ayaz","year":"2007","unstructured":"Ayaz Ali, Lennart Johnsson, and Jaspal Subhlok. 2007. Scheduling FFT computation on SMP and multicore systems. In Proceedings of the 21st Annual International Conference on Supercomputing. 293\u2013301."},{"key":"e_1_3_1_9_2","doi-asserted-by":"crossref","first-page":"209","DOI":"10.1016\/j.commatsci.2014.02.027","article-title":"Validation of a numerical method based on fast fourier transforms for heterogeneous thermoelastic materials by comparison with analytical solutions","volume":"87","author":"Anglin B. S.","year":"2014","unstructured":"B. S. Anglin, R. A. Lebensohn, and A. D. Rollett. 2014. Validation of a numerical method based on fast fourier transforms for heterogeneous thermoelastic materials by comparison with analytical solutions. Computational Materials Science 87 (2014), 209\u2013217.","journal-title":"Computational Materials Science"},{"key":"e_1_3_1_10_2","first-page":"262","volume-title":"International Conference on Computational Science","author":"Ayala Alan","year":"2020","unstructured":"Alan Ayala, Stanimire Tomov, Azzam Haidar, and Jack Dongarra. 2020. heFFTe: Highly efficient FFT for exascale. In International Conference on Computational Science. Springer, 262\u2013275."},{"key":"e_1_3_1_11_2","doi-asserted-by":"crossref","first-page":"12","DOI":"10.1109\/ExaMPI49596.2019.00007","volume-title":"2019 IEEE\/ACM Workshop on Exascale MPI (ExaMPI)","author":"Ayala Alan","year":"2019","unstructured":"Alan Ayala, Stanimire Tomov, Xi Luo, Hejer Shaeik, Azzam Haidar, George Bosilca, and Jack Dongarra. 2019. Impacts of multi-GPU MPI collective communications on large FFT computation. In 2019 IEEE\/ACM Workshop on Exascale MPI (ExaMPI). IEEE, 12\u201318."},{"key":"e_1_3_1_12_2","unstructured":"Alan Ayala Stanimire Tomov Piotr Luszczek S\u00e9bastien Cayrols Gerald Ragghianti and Jack Dongarra. 2022. Analysis of the Communication and Computation Cost of FFT Libraries towards Exascale. Technical Report ICL-UT-22-07. https:\/\/icl.utk.edu\/files\/publications\/2022\/icl-utk-1558-2022.pdf."},{"key":"e_1_3_1_13_2","doi-asserted-by":"crossref","unstructured":"George E. P. Box and Mervin E. Muller. 1958. A note on the generation of random normal deviates. Annals of Mathematical Statistics 29 (1958) 610\u2013611.","DOI":"10.1214\/aoms\/1177706645"},{"key":"e_1_3_1_14_2","unstructured":"S\u00e9bastien Cayrols Jiali Li George Bosilca Stanimire Tomov Alan Ayala and Jack Dongarra. 2022. Mixed precision and approximate 3D FFTs: Speed for accuracy trade-off with GPU-aware MPI and run-time data compression. Technical Report ICL-UT-22-04. https:\/\/icl.utk.edu\/files\/publications\/2022\/icl-utk-1558-2022.pdf."},{"key":"e_1_3_1_15_2","volume-title":"The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC\u201918). ACM Student Research Poster, Dallas, TX","author":"Cheng Xiaohe","year":"2018","unstructured":"Xiaohe Cheng, Anumeena Sorna, Eduardo D\u2019Azevedo, Kwai Wong, and Stanimire Tomov. 2018. Accelerating 2D FFT: Exploit GPU tensor cores through mixed-precision. In The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC\u201918). ACM Student Research Poster, Dallas, TX."},{"key":"e_1_3_1_16_2","first-page":"1","volume-title":"2010 ACM\/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis","author":"Doi Jun","year":"2010","unstructured":"Jun Doi and Yasushi Negishi. 2010. Overlapping methods of all-to-all communication and FFT algorithms for torus-connected massively parallel supercomputers. In 2010 ACM\/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis(SC\u201910). IEEE, 1\u20139."},{"issue":"8","key":"e_1_3_1_17_2","doi-asserted-by":"crossref","first-page":"257","DOI":"10.1145\/2038037.1941589","article-title":"Auto-tuning of fast Fourier transform on graphics processors","volume":"46","author":"Dotsenko Yuri","year":"2011","unstructured":"Yuri Dotsenko, Sara S. Baghsorkhi, Brandon Lloyd, and Naga K. Govindaraju. 2011. Auto-tuning of fast Fourier transform on graphics processors. ACM SIGPLAN Notices 46, 8 (2011), 257\u2013266.","journal-title":"ACM SIGPLAN Notices"},{"key":"e_1_3_1_18_2","first-page":"488","volume-title":"26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","author":"Durrani Sultan","year":"2021","unstructured":"Sultan Durrani, Muhammad Saad Chughtai, Abdul Dakkak, Wen-mei Hwu, and Lawrence Rauchwerger. 2021. FFT blitz: The tensor cores strike back. In 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 488\u2013489."},{"issue":"1","key":"e_1_3_1_19_2","doi-asserted-by":"crossref","first-page":"153","DOI":"10.1016\/j.cpc.2013.08.028","article-title":"A decomposition method with minimum communication amount for parallelization of multi-dimensional FFTs","volume":"185","author":"Duy Truong Vinh Truong","year":"2014","unstructured":"Truong Vinh Truong Duy and Taisuke Ozaki. 2014. A decomposition method with minimum communication amount for parallelization of multi-dimensional FFTs. Computer Physics Communications 185, 1 (2014), 153\u2013164.","journal-title":"Computer Physics Communications"},{"key":"e_1_3_1_20_2","unstructured":"David Elam and Cesar Iovescu. A block floating point implementation for an n-point FFT on the TMS320C55x DSP. Application report September 2003. TMS320C5000 Software Applications. https:\/\/www.ti.com\/lit\/an\/spra948\/spra948.pdf."},{"issue":"2","key":"e_1_3_1_21_2","doi-asserted-by":"crossref","first-page":"216","DOI":"10.1109\/JPROC.2004.840301","article-title":"The design and implementation of FFTW3","volume":"93","author":"Frigo Matteo","year":"2005","unstructured":"Matteo Frigo and Steven G. Johnson. 2005. The design and implementation of FFTW3. Proc. IEEE 93, 2 (2005), 216\u2013231.","journal-title":"Proc. IEEE"},{"key":"e_1_3_1_22_2","article-title":"AccFFT: A library for distributed-memory FFT on CPU and GPU architectures","author":"Gholami Amir","year":"2016","unstructured":"Amir Gholami, Judith Hill, Dhairya Malhotra, and George Biros. 2016. AccFFT: A library for distributed-memory FFT on CPU and GPU architectures. arXiv:1506.07933v3 (2016).","journal-title":"arXiv:1506.07933v3"},{"key":"e_1_3_1_23_2","first-page":"1","volume-title":"SC\u201908: 2008 ACM\/IEEE Conference on Supercomputing","author":"Govindaraju Naga K.","year":"2008","unstructured":"Naga K. Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton Smith, and John Manferdelli. 2008. High performance discrete fFourier transforms on graphics processors. In SC\u201908: 2008 ACM\/IEEE Conference on Supercomputing. IEEE, 1\u201312."},{"key":"e_1_3_1_24_2","unstructured":"Ruobing Han Yang You and James Demmel. 2019. Auto-precision scaling for distributed deep learning. CoRR abs\/1911.08907 (2019). arXiv:1911.08907 http:\/\/arxiv.org\/abs\/1911.08907."},{"key":"e_1_3_1_25_2","doi-asserted-by":"crossref","unstructured":"Wei Hu Xinming Qin Qingcai Jiang Junshi Chen Hong An Weile Jia Fang Li Xin Liu Dexun Chen Fangfang Liu Yuwen Zhao and Jinlong Yang. 2021. High performance computing of DGDFT for tens of thousands of atoms using millions of cores on Sunway TaihuLight. Science Bulletin 66 2 (2021) 111\u2013119.","DOI":"10.1016\/j.scib.2020.06.025"},{"key":"e_1_3_1_26_2","first-page":"1","volume-title":"International Conference for High Performance Computing, Networking, Storage and Analysis (SC20)","author":"Jia Weile","year":"2020","unstructured":"Weile Jia, Han Wang, Mohan Chen, Denghui Lu, Lin Lin, Roberto Car, E. Weinan, and Linfeng Zhang. 2020. Pushing the limit of molecular dynamics with ab initio accuracy to 100 million atoms with machine learning. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC20). IEEE, 1\u201314."},{"key":"e_1_3_1_27_2","doi-asserted-by":"crossref","first-page":"717","DOI":"10.1109\/SC.2018.00060","volume-title":"International Conference for High Performance Computing, Networking, Storage and Analysis (SC18)","author":"Joubert Wayne","year":"2018","unstructured":"Wayne Joubert, Deborah Weighill, David Kainer, Sharlee Climer, Amy Justice, Kjiersten Fagnan, and Daniel Jacobson. 2018. Attacking the opioid epidemic: Determining the epistatic and pleiotropic genetic architectures for chronic pain and opioid addiction. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC18). IEEE, 717\u2013730."},{"key":"e_1_3_1_28_2","first-page":"1","volume-title":"2021 IEEE International Conference on Cluster Computing (CLUSTER)","author":"Li Binrui","year":"2021","unstructured":"Binrui Li, Shenggan Cheng, and James Lin. 2021. tcFFT: A fast half-precision FFT library for NVIDIA tensor cores. In 2021 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 1\u201311."},{"key":"e_1_3_1_29_2","article-title":"tcFFT: Accelerating half-precision FFT through tensor cores","author":"Li Binrui","year":"2021","unstructured":"Binrui Li, Shenggan Cheng, and James Lin. 2021. tcFFT: Accelerating half-precision FFT through tensor cores. arXiv:2104.11471v1 (2021).","journal-title":"arXiv:2104.11471v1"},{"key":"e_1_3_1_30_2","first-page":"1","volume-title":"Cray user Group 2010 Conference","author":"Li Ning","year":"2010","unstructured":"Ning Li and Sylvain Laizet. 2010. 2DECOMP & FFT - a highly scalable 2d decomposition library and FFT interface. In Cray user Group 2010 Conference. 1\u201313."},{"key":"e_1_3_1_31_2","first-page":"1","volume-title":"International Conference for High Performance Computing, Networking, Storage and Analysis","author":"Li Zhihao","year":"2019","unstructured":"Zhihao Li, Haipeng Jia, Yunquan Zhang, Tun Chen, Liang Yuan, Luning Cao, and Xiao Wang. 2019. AutoFFT: A template-based FFT codes auto-generation framework for ARM and X86 CPUs. In International Conference for High Performance Computing, Networking, Storage and Analysis. 1\u201315."},{"issue":"8","key":"e_1_3_1_32_2","doi-asserted-by":"crossref","first-page":"1925","DOI":"10.1109\/TPDS.2020.2977629","article-title":"Automatic generation of high-performance FFT kernels on arm and X86 CPUs","volume":"31","author":"Li Zhihao","year":"2020","unstructured":"Zhihao Li, Haipeng Jia, Yunquan Zhang, Tun Chen, Liang Yuan, and Richard Vuduc. 2020. Automatic generation of high-performance FFT kernels on arm and X86 CPUs. IEEE Transactions on Parallel and Distributed Systems 31, 8 (2020), 1925\u20131941.","journal-title":"IEEE Transactions on Parallel and Distributed Systems"},{"issue":"12","key":"e_1_3_1_33_2","doi-asserted-by":"crossref","first-page":"2674","DOI":"10.1109\/TVCG.2014.2346458","article-title":"Fixed-rate compressed floating-point arrays","volume":"20","author":"Lindstrom Peter","year":"2014","unstructured":"Peter Lindstrom. 2014. Fixed-rate compressed floating-point arrays. IEEE Transactions on Visualization and Computer Graphics 20, 12 (2014), 2674\u20132683.","journal-title":"IEEE Transactions on Visualization and Computer Graphics"},{"issue":"6","key":"e_1_3_1_34_2","doi-asserted-by":"crossref","first-page":"989","DOI":"10.1007\/s11390-014-1484-z","article-title":"Memory efficient two-pass 3D FFT algorithm for Intel\u00ae Xeon PhiTM coprocessor","volume":"29","author":"Liu Yiqun","year":"2014","unstructured":"Yiqun Liu, Yan Li, Yunquan Zhang, and Xianyi Zhang. 2014. Memory efficient two-pass 3D FFT algorithm for Intel\u00ae Xeon PhiTM coprocessor. Journal of Computer Science and Technology 29, 6 (2014), 989\u20131002.","journal-title":"Journal of Computer Science and Technology"},{"key":"e_1_3_1_35_2","first-page":"1","volume-title":"International Conference for High Performance Computing, Networking, Storage and Analysis","author":"Liu Yong","year":"2021","unstructured":"Yong Liu, Xin Liu, Fang Li, Haohuan Fu, Yuling Yang, Jiawei Song, Pengpeng Zhao, Zhen Wang, Dajia Peng, Huarong Chen, Chu Guo, and Heliang Huang. 2021. Closing the \u201cquantum supremacy\u201d gap: Achieving real-time simulation of a random quantum circuit using a new sunway supercomputer. In International Conference for High Performance Computing, Networking, Storage and Analysis. 1\u201312."},{"key":"e_1_3_1_36_2","unstructured":"Paulius Micikevicius Sharan Narang Jonah Alben Gregory F. Diamos Erich Elsen David Garc\u00eda Boris Ginsburg Michael Houston Oleksii Kuchaiev Ganesh Venkatesh and Hao Wu. 2017. Mixed Precision Training. CoRR abs\/1710.03740 (2017). arXiv:1710.03740 http:\/\/arxiv.org\/abs\/1710.03740."},{"key":"e_1_3_1_37_2","doi-asserted-by":"crossref","first-page":"57","DOI":"10.1145\/2159430.2159437","volume-title":"5th Annual Workshop on General Purpose Processing with Graphics Processing Units","author":"Nukada Akira","year":"2012","unstructured":"Akira Nukada, Yutaka Maruyama, and Satoshi Matsuoka. 2012. High performance 3-D FFT using multiple CUDA GPUs. In 5th Annual Workshop on General Purpose Processing with Graphics Processing Units. 57\u201363."},{"key":"e_1_3_1_38_2","first-page":"1","volume-title":"Conference on High Performance Computing Networking, Storage and Analysis","author":"Nukada Akira","year":"2009","unstructured":"Akira Nukada and Satoshi Matsuoka. 2009. Auto-tuning 3-D FFT library for CUDA GPUs. In Conference on High Performance Computing Networking, Storage and Analysis. IEEE, 1\u201310."},{"key":"e_1_3_1_39_2","first-page":"1","volume-title":"2008 ACM\/IEEE Conference on Supercomputing (SC\u201908)","author":"Nukada Akira","year":"2008","unstructured":"Akira Nukada, Yasuhiko Ogata, Toshio Endo, and Satoshi Matsuoka. 2008. Bandwidth intensive 3-D FFT kernel for GPUs using CUDA. In 2008 ACM\/IEEE Conference on Supercomputing (SC\u201908). IEEE, 1\u201311."},{"key":"e_1_3_1_40_2","first-page":"1","volume-title":"International Conference on High Performance Computing, Networking, Storage and Analysis (SC\u201912)","author":"Nukada Akira","year":"2012","unstructured":"Akira Nukada, Kento Sato, and Satoshi Matsuoka. 2012. Scalable multi-GPU 3-D FFT for TSUBAME 2.0 supercomputer. In International Conference on High Performance Computing, Networking, Storage and Analysis (SC\u201912). IEEE, 1\u201310."},{"issue":"4","key":"e_1_3_1_41_2","first-page":"C192\u2013C209","article-title":"P3DFFT: A framework for parallel computations of fourier transforms in three dimensions","volume":"34","author":"Pekurovsky Dmitry","year":"2012","unstructured":"Dmitry Pekurovsky. 2012. P3DFFT: A framework for parallel computations of fourier transforms in three dimensions. SIAM Journal on Scientific Computing 34, 4 (2012), C192\u2013C209.","journal-title":"SIAM Journal on Scientific Computing"},{"issue":"3","key":"e_1_3_1_42_2","first-page":"C213\u2013C236","article-title":"PFFT: An extension of FFTW to massively parallel architectures","volume":"35","author":"Pippig Michael","year":"2013","unstructured":"Michael Pippig. 2013. PFFT: An extension of FFTW to massively parallel architectures. SIAM Journal on Scientific Computing 35, 3 (2013), C213\u2013C236.","journal-title":"SIAM Journal on Scientific Computing"},{"key":"e_1_3_1_43_2","volume-title":"fftMPI, a Library for Performing 2D and 3D FFTs in Parallel","author":"Plimpton Steven","year":"2018","unstructured":"Steven Plimpton, Axel Kohlmeyer, Paul Coffman, and Phil Blood. 2018. fftMPI, a Library for Performing 2D and 3D FFTs in Parallel. Technical Report. Sandia National Lab. (SNL-NM), Albuquerque, NM."},{"key":"e_1_3_1_44_2","doi-asserted-by":"crossref","unstructured":"Markus P\u00fcschel Jos\u00e9 M. F. Moura Jeremy R. Johnson David Padua Manuela M. Veloso Bryan W. Singer Jianxin Xiong Franz Franchetti Aca Ga\u010di\u0107 Yevgen Voronenko Kang Chen Robert W. Johnson and Nicholas Rizzolo. 2005. SPIRAL: Code Generation for DSP Transforms. Proc. IEEE 93 2 (2005) 232\u2013275. 10.1109\/JPROC.2004.840306","DOI":"10.1109\/JPROC.2004.840306"},{"key":"e_1_3_1_45_2","first-page":"1","volume-title":"International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201919)","author":"Ravikumar Kiran","year":"2019","unstructured":"Kiran Ravikumar, David Appelhans, and P. K. Yeung. 2019. GPU acceleration of extreme scale pseudo-spectral simulations of turbulence using asynchronism. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC\u201919). 1\u201322."},{"key":"e_1_3_1_46_2","first-page":"181","volume-title":"19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","author":"Song Sukhyun","year":"2014","unstructured":"Sukhyun Song and Jeffrey K. Hollingsworth. 2014. Designing and auto-tuning parallel 3-D FFT for computation-communication overlap. In 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 181\u2013192."},{"key":"e_1_3_1_47_2","doi-asserted-by":"crossref","first-page":"3","DOI":"10.1109\/HiPCW.2018.8634417","volume-title":"2018 IEEE 25th International Conference on High Performance Computing Workshops (HiPCW)","author":"Sorna Anumeena","year":"2018","unstructured":"Anumeena Sorna, Xiaohe Cheng, Eduardo D\u2019azevedo, Kwai Won, and Stanimire Tomov. 2018. Optimizing the fast fourier transform using mixed precision on tensor core hardware. In 2018 IEEE 25th International Conference on High Performance Computing Workshops (HiPCW). IEEE, 3\u20137."},{"key":"e_1_3_1_48_2","unstructured":"Daisuke Takahashi. 2014. FFTE: A fast fourier transform package. http:\/\/www.ffte.jp\/ (2014)."},{"key":"e_1_3_1_49_2","doi-asserted-by":"crossref","first-page":"167","DOI":"10.1007\/978-3-319-06486-4_7","volume-title":"High-Performance Computing on the Intel\u00ae Xeon Phi\u2122","author":"Wang Endong","year":"2014","unstructured":"Endong Wang, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, and Yajuan Wang. 2014. Intel math kernel library. In High-Performance Computing on the Intel\u00ae Xeon Phi\u2122. Springer, 167\u2013188."},{"issue":"1","key":"e_1_3_1_50_2","doi-asserted-by":"crossref","first-page":"003","DOI":"10.1088\/1674-4527\/21\/1\/3","article-title":"A hybrid fast multipole method for cosmological n-body simulations","volume":"21","author":"Wang Qiao","year":"2021","unstructured":"Qiao Wang. 2021. A hybrid fast multipole method for cosmological n-body simulations. Research in Astronomy and Astrophysics 21, 1 (2021), 003.","journal-title":"Research in Astronomy and Astrophysics"},{"issue":"11","key":"e_1_3_1_51_2","doi-asserted-by":"crossref","first-page":"281","DOI":"10.1088\/1674-4527\/21\/11\/281","article-title":"PhotoNs-GPU: A GPU accelerated cosmological simulation code","volume":"21","author":"Wang Qiao","year":"2021","unstructured":"Qiao Wang and Chen Meng. 2021. PhotoNs-GPU: A GPU accelerated cosmological simulation code. Research in Astronomy and Astrophysics 21, 11 (2021), 281.","journal-title":"Research in Astronomy and Astrophysics"},{"issue":"10","key":"e_1_3_1_52_2","first-page":"3184","article-title":"General implementation of 1-D FFT on the Sunway 26010 processor","volume":"31","author":"Zhao Yuwen","year":"2020","unstructured":"Yuwen Zhao, Yulong Ao, Chao Yang, Fangfang Liu, Wanwang Yin, and Rongfen Lin. 2020. General implementation of 1-D FFT on the Sunway 26010 processor. Journal of Software 31, 10 (2020), 3184\u20133196.","journal-title":"Journal of Software"},{"key":"e_1_3_1_53_2","first-page":"1","volume-title":"International Conference for High Performance Computing, Networking, Storage and Analysis","author":"Ziogas Alexandros Nikolaos","year":"2019","unstructured":"Alexandros Nikolaos Ziogas, Tal Ben-Nun, Guillermo Indalecio Fern\u00e1ndez, Timo Schneider, Mathieu Luisier, and Torsten Hoefler. 2019. A data-centric approach to extreme-scale ab initio dissipative quantum transport simulations. In International Conference for High Performance Computing, Networking, Storage and Analysis. 1\u201313."}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3605148","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3605148","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T00:03:54Z","timestamp":1750291434000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3605148"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,7,22]]},"references-count":52,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2023,9,30]]}},"alternative-id":["10.1145\/3605148"],"URL":"https:\/\/doi.org\/10.1145\/3605148","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,7,22]]},"assertion":[{"value":"2022-11-29","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-05-30","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-07-22","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}