{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,21]],"date-time":"2025-09-21T07:14:13Z","timestamp":1758438853183,"version":"3.44.0"},"reference-count":43,"publisher":"Association for Computing Machinery (ACM)","issue":"3","funder":[{"name":"JST, CREST","award":["JPMJCR22M2"],"award-info":[{"award-number":["JPMJCR22M2"]}]},{"DOI":"10.13039\/501100009427","name":"Telecommunications Advancement Foundation","doi-asserted-by":"crossref","award":["KJ25030011"],"award-info":[{"award-number":["KJ25030011"]}],"id":[{"id":"10.13039\/501100009427","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2025,9,30]]},"abstract":"<jats:p>The 3D Non-Local Means (NLM) algorithm has become a crucial preprocessing technique for 3D image datasets due to its effectiveness in denoising while preserving fine details. This method has been proven to be highly efficient in high-demand tasks within industrial applications such as medical imaging and remote sensing. The 3D NLM algorithm computes the filtered value for each voxel by calculating the weighted average of all voxels within a 3D search window, where the weights are determined by the similarity between pairs of 3D template windows. Therefore, the computational burden becomes significant, especially in embedded GPUs with limited computational power and memory resources. To address this issue, we propose an efficient GPU parallel kernel to minimize redundant computations and memory accesses. The kernel integrates three nested reuse strategies to handle redundant computations in three dimensions: for columns, we leverage the fast data exchange mechanism to reuse column computation results via on-chip registers; for rows, we use a sliding window strategy, utilizing GPU global memory as an intermediary to store and reuse similarity values between filtered rows; and for channels, we introduce a zigzag scanning strategy that enables simultaneous computation across multiple channels and employs on-chip registers to facilitate channel computation reuse. Experimental results demonstrate that our kernel achieves an average speedup of 7.7x on the embedded Jetson AGX Xavier platform across a range of 3D image datasets compared to existing methods, showcasing exceptional performance.<\/jats:p>","DOI":"10.1145\/3744909","type":"journal-article","created":{"date-parts":[[2025,6,16]],"date-time":"2025-06-16T07:17:43Z","timestamp":1750058263000},"page":"1-22","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["3D GNLM: Efficient 3D Non-Local Means Kernel with Nested Reuse Strategies for Embedded GPUs"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-6933-6491","authenticated-orcid":false,"given":"Xiang","family":"Li","sequence":"first","affiliation":[{"name":"Nanjing University","place":["Nanjing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4447-0480","authenticated-orcid":false,"given":"Qiong","family":"Chang","sequence":"additional","affiliation":[{"name":"School of Computing, Institute of Science Tokyo","place":["Meguro, Japan"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1753-7317","authenticated-orcid":false,"given":"Yun","family":"Li","sequence":"additional","affiliation":[{"name":"Nanjing University","place":["Nanjing, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3038-7678","authenticated-orcid":false,"given":"Jun","family":"Miyazaki","sequence":"additional","affiliation":[{"name":"School of Computing, Institute of Science Tokyo","place":["Meguro, Japan"]}]}],"member":"320","published-online":{"date-parts":[[2025,9,19]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.1002\/mp.14024"},{"key":"e_1_3_1_3_2","doi-asserted-by":"crossref","unstructured":"Hossein Arabi and Habib Zaidi. 2021. Non-local mean denoising using multiple PET reconstructions. Annals of Nuclear Medicine 35 2 (2021) 176\u2013186.","DOI":"10.1007\/s12149-020-01550-y"},{"key":"e_1_3_1_4_2","doi-asserted-by":"crossref","unstructured":"Satyakam Baraha Ajit Kumar Sahoo and Sowjanya Modalavalasa. 2022. A systematic review on recent developments in nonlocal and variational methods for SAR image despeckling. Signal Processing 196 C Article 108521 (2022).","DOI":"10.1016\/j.sigpro.2022.108521"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2005.38"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.5201\/ipol.2011.bcm_nlm"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2023.03.004"},{"key":"e_1_3_1_8_2","doi-asserted-by":"crossref","unstructured":"Kaixin Chen Xiao Lin Xing Hu Jiayao Wang Han Zhong and Linhua Jiang. 2020. An enhanced adaptive non-local means algorithm for Rician noise reduction in magnetic resonance brain images. BMC Medical Imaging 20 1 Article 2 (2020).","DOI":"10.1186\/s12880-019-0407-4"},{"issue":"4","key":"e_1_3_1_9_2","first-page":"S425","article-title":"BrainWeb: Online interface to a 3D MRI simulated brain database","volume":"5","author":"Cocosco Chris A.","year":"1997","unstructured":"Chris A. Cocosco, V. Kollokian, R. K. -S Kwan, and A. C. Evans. 1997. BrainWeb: Online interface to a 3D MRI simulated brain database. NeuroImage 5, 4 (1997), S425.","journal-title":"NeuroImage"},{"key":"e_1_3_1_10_2","unstructured":"Nvidia Corporation. 2021. CUDA C Programming Guide. (2021). [Online]. Available: https:\/\/docs.nvidia.com\/cuda\/archive\/11.2.0\/cuda-c-programming-guide\/. Accessed: Feb. 22 2025."},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMI.2007.906087"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.3390\/rs12061006"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/LSP.2018.2850222"},{"issue":"1","key":"e_1_3_1_14_2","first-page":"523862","article-title":"3D data denoising via nonlocal means filter by using parallel GPU strategies","volume":"2014","author":"Cuomo Salvatore","year":"2014","unstructured":"Salvatore Cuomo, Pasquale De Michele, and Francesco Piccialli. 2014. 3D data denoising via nonlocal means filter by using parallel GPU strategies. Computational and Mathematical Methods in Medicine 2014, 1 (2014), 523862.","journal-title":"Computational and Mathematical Methods in Medicine"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/3PGCIC.2015.77"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11554-020-00945-4"},{"key":"e_1_3_1_17_2","doi-asserted-by":"crossref","unstructured":"Manoj Diwakar Pardeep Kumar and Amit Kumar Singh. 2020. CT image denoising using NLM and its method noise thresholding. Multimedia Tools and Applications 79 21\u201322 (2020) 14449\u201314464.","DOI":"10.1007\/s11042-018-6897-1"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1186\/s42492-019-0016-7"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ejmp.2021.07.028"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.5201\/ipol.2022.346"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.inffus.2019.09.003"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11554-016-0566-2"},{"key":"e_1_3_1_23_2","unstructured":"Khronos Group. 2023. OpenCL Specification. (2023). [Online]. Available: https:\/\/registry.khronos.org\/OpenCL\/specs\/3.0-unified\/pdf\/OpenCL_API.pdf. Accessed: Feb. 22 2025."},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.5114\/pjr.2023.130815"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.mri.2016.04.008"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICTA53157.2021.9661666"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.3233\/APC200098"},{"issue":"1","key":"e_1_3_1_28_2","first-page":"921303","article-title":"GPU-based block-wise nonlocal means denoising for 3D ultrasound images","volume":"2013","author":"Li Liu","year":"2013","unstructured":"Liu Li, Wenguang Hou, Xuming Zhang, and Mingyue Ding. 2013. GPU-based block-wise nonlocal means denoising for 3D ultrasound images. Computational and Mathematical Methods in Medicine 2013, 1 (2013), 921303.","journal-title":"Computational and Mathematical Methods in Medicine"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00727"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1145\/3689339"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2021.3084813"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/LSP.2005.859509"},{"key":"e_1_3_1_33_2","first-page":"495","volume-title":"Proceedings of the 2013 Federated Conference on Computer Science and Information Systems (FedCSIS)","author":"Palma Giuseppe","year":"2013","unstructured":"Giuseppe Palma, Marco Comerci, Bruno Alfano, Salvatore Cuomo, Pasquale De Michele, Francesco Piccialli, and Pasquale Borrelli. 2013. 3D non-local means denoising via multi-GPU. In Proceedings of the 2013 Federated Conference on Computer Science and Information Systems (FedCSIS). IEEE, 495\u2013498."},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.3390\/rs14235933"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.dsp.2016.07.017"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-981-13-9042-5_66"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neunet.2020.07.025"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00813"},{"key":"e_1_3_1_39_2","first-page":"225","volume-title":"Proceedings of the International Conference on Aerospace System Science and Engineering","author":"Xu Meng","year":"2021","unstructured":"Meng Xu, Han Pan, Xia Wu, and Zhongliang Jing. 2021. Hyperspectral and multispectral image fusion via regularization on non-local structure tensor total variation. In Proceedings of the International Conference on Aerospace System Science and Engineering. Springer, 225\u2013238."},{"key":"e_1_3_1_40_2","unstructured":"Hao Zhang Feng Li Shilong Liu Lei Zhang Hang Su Jun Zhu Lionel M. Ni and Heung-Yeung Shum. 2023. DINO: DETR with improved denoising anchor boxes for end-to-end object detection. In Proceedings of the Eleventh International Conference on Learning Representations (ICLR)."},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.1002\/mp.12097"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1088\/0031-9155\/61\/3\/1332"},{"key":"e_1_3_1_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00068"},{"key":"e_1_3_1_44_2","doi-asserted-by":"crossref","unstructured":"Xiang Li Qiong Chang Yun Li and Jun Miyazaki. 2025. Efficient parallel implementation of non-local means algorithm on GPU. In Proceedings of the 17th Workshop on General Purpose Processing Using GPU (GPGPU\u201925). Association for Computing Machinery 55\u201361.","DOI":"10.1145\/3725798.3725807"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3744909","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,20]],"date-time":"2025-09-20T00:49:57Z","timestamp":1758329397000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3744909"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,19]]},"references-count":43,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2025,9,30]]}},"alternative-id":["10.1145\/3744909"],"URL":"https:\/\/doi.org\/10.1145\/3744909","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2025,9,19]]},"assertion":[{"value":"2025-03-10","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-06-09","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-09-19","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}