{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,24]],"date-time":"2025-10-24T16:48:49Z","timestamp":1761324529023,"version":"3.41.0"},"reference-count":39,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2023,7,19]],"date-time":"2023-07-19T00:00:00Z","timestamp":1689724800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100012166","name":"National Key R&D Program of China","doi-asserted-by":"crossref","award":["2021YFB0301300"],"award-info":[{"award-number":["2021YFB0301300"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Major Program of Guangdong Basic and Applied Research","award":["2019B030302002"],"award-info":[{"award-number":["2019B030302002"]}]},{"name":"Guangdong Province Special Support Program for Cultivating High-level Talents","award":["2021TQ06X160"],"award-info":[{"award-number":["2021TQ06X160"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2023,9,30]]},"abstract":"<jats:p>The tremendous success of convolutional neural network (CNN) has made it ubiquitous in many fields of human endeavor. Many applications such as biomedical analysis and scientific data analysis involve analyzing volumetric data. This spawns huge demand for 3D-CNN. Although accelerators such as GPU may provide higher throughput on deep learning applications, they may not be available in all scenarios. CPU, especially many-core CPU with non-uniform memory access (NUMA) architecture, remains an attractive choice for deep learning inference in many scenarios.<\/jats:p><jats:p>In this article, we propose a distributed inference solution for 3D-CNN that targets on the emerging ARM many-core CPU platform. A hierarchical partition approach is claimed to accelerate 3D-CNN inference by exploiting characteristics of memory and cache on ARM many-core CPU. Based on the hierarchical model partition approach, other optimization techniques such as NUMA-aware thread scheduling and optimization of 3D-img2row convolution are designed to exploit the potential of ARM many-core CPU for 3D-CNN. We evaluate our proposed inference solution with several classic 3D-CNNs: C3D, 3D-resnet34, 3D-resnet50, 3D-vgg11, and P3D. Our experimental results show that our solution can boost the performance of the 3D-CNN inference, and achieve much better scalability, with a negligible fluctuation in accuracy. When employing our 3D-CNN inference solution on ACL libraries, it can outperform naive ACL implementations by 11\u00d7 to 50\u00d7 on ARM many-core processor. When employing our 3D-CNN inference solution on NCNN libraries, it can outperform the naive NCNN implementations by 5.2\u00d7 to 14.2\u00d7 on ARM many-core processor.<\/jats:p>","DOI":"10.1145\/3605149","type":"journal-article","created":{"date-parts":[[2023,6,18]],"date-time":"2023-06-18T08:06:45Z","timestamp":1687075605000},"page":"1-21","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Hierarchical Model Parallelism for Optimizing Inference on Many-core Processor via Decoupled 3D-CNN Structure"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1417-3012","authenticated-orcid":false,"given":"Jiazhi","family":"Jiang","sequence":"first","affiliation":[{"name":"School of Computer Science and Engineering, Sun Yat-sen University, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-0280-2757","authenticated-orcid":false,"given":"Zijian","family":"Huang","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, Sun Yat-sen University, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5582-1031","authenticated-orcid":false,"given":"Dan","family":"Huang","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, Sun Yat-sen University, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4707-9492","authenticated-orcid":false,"given":"Jiangsu","family":"Du","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, Sun Yat-sen University, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7943-3172","authenticated-orcid":false,"given":"Lin","family":"Chen","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, Sun Yat-sen University, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9318-5715","authenticated-orcid":false,"given":"Ziguan","family":"Chen","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, Sun Yat-sen University, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5315-3375","authenticated-orcid":false,"given":"Yutong","family":"Lu","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, Sun Yat-sen University, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2023,7,19]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"2022. ARM Compute Library. Retrieved from https:\/\/github.com\/ARM-software\/ComputeLibrary."},{"key":"e_1_3_1_3_2","unstructured":"2022. Tencent NCNN. Retrieved from https:\/\/github.com\/Tencent\/ncnn."},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1145\/3320060"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-23678-5_39"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10766-018-00625-8"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1145\/3472456.3475737"},{"key":"e_1_3_1_8_2","article-title":"cuDNN: Efficient primitives for deep learning","author":"Chetlur Sharan","year":"2014","unstructured":"Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).","journal-title":"arXiv preprint arXiv:1410.0759"},{"key":"e_1_3_1_9_2","first-page":"815","volume-title":"International Conference on Machine Learning","author":"Cho Minsik","year":"2017","unstructured":"Minsik Cho and Daniel Brand. 2017. MEC: Memory-efficient convolution for deep neural network. In International Conference on Machine Learning. PMLR, 815\u2013824."},{"key":"e_1_3_1_10_2","first-page":"1337","volume-title":"International Conference on Machine Learning","author":"Coates Adam","year":"2013","unstructured":"Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Andrew Ng. 2013. Deep learning with COTS HPC systems. In International Conference on Machine Learning. PMLR, 1337\u20131345."},{"issue":"7","key":"e_1_3_1_11_2","first-page":"1665","article-title":"Model parallelism optimization for distributed inference via decoupled CNN structure","volume":"32","author":"Du Jiangsu","year":"2020","unstructured":"Jiangsu Du, Xin Zhu, Minghua Shen, Yunfei Du, Yutong Lu, Nong Xiao, and Xiangke Liao. 2020. Model parallelism optimization for distributed inference via decoupled CNN structure. IEEE Trans. Parallel Distrib. Syst. 32, 7 (2020), 1665\u20131676.","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11432-016-5588-7"},{"key":"e_1_3_1_13_2","first-page":"3154","volume-title":"IEEE International Conference on Computer Vision Workshops","author":"Hara Kensho","year":"2017","unstructured":"Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2017. Learning spatio-temporal features with 3D residual networks for action recognition. In IEEE International Conference on Computer Vision Workshops. 3154\u20133160."},{"key":"e_1_3_1_14_2","first-page":"533","volume-title":"IEEE 39th International Conference on Computer Design (ICCD\u201921)","author":"Hu Zhenbo","year":"2021","unstructured":"Zhenbo Hu, Xiangyu Zou, Wen Xia, Yuhong Zhao, Weizhe Zhang, and Donglei Wu. 2021. Smart-DNN: Efficiently reducing the memory requirements of running deep neural networks on resource-constrained platforms. In IEEE 39th International Conference on Computer Design (ICCD\u201921). IEEE, 533\u2013541."},{"key":"e_1_3_1_15_2","first-page":"1019","volume-title":"IEEE International Conference on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA\/BDCloud\/SocialCom\/SustainCom\u201921)","author":"Huang Xiandong","year":"2021","unstructured":"Xiandong Huang, Qinglin Wang, Shuyu Lu, Ruochen Hao, Songzhu Mei, and Jie Liu. 2021. NUMA-aware FFT-based convolution on ARMv8 many-core CPUs. In IEEE International Conference on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA\/BDCloud\/SocialCom\/SustainCom\u201921). IEEE, 1019\u20131026."},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1145\/3545008.3545022"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2022.102954"},{"key":"e_1_3_1_18_2","volume-title":"IEEE\/CVF International Conference on Computer Vision (ICCV) Workshops","author":"Kopuklu Okan","year":"2019","unstructured":"Okan Kopuklu, Neslihan Kose, Ahmet Gunduz, and Gerhard Rigoll. 2019. Resource efficient 3D convolutional neural networks. In IEEE\/CVF International Conference on Computer Vision (ICCV) Workshops."},{"key":"e_1_3_1_19_2","article-title":"ImageNet classification with deep convolutional neural networks","volume":"25","author":"Krizhevsky Alex","year":"2012","unstructured":"Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25 (2012).","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1145\/2508834.2513149"},{"issue":"3","key":"e_1_3_1_21_2","first-page":"580","article-title":"FeatherCNN: Fast inference computation with TensorGEMM on ARM architectures","volume":"31","author":"Lan Haidong","year":"2019","unstructured":"Haidong Lan, Jintao Meng, Christian Hundt, Bertil Schmidt, Minwen Deng, Xiaoning Wang, Weiguo Liu, Yu Qiao, and Shengzhong Feng. 2019. FeatherCNN: Fast inference computation with TensorGEMM on ARM architectures. IEEE Trans. Parallel Distrib. Syst. 31, 3 (2019), 580\u2013594.","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.435"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/3472456.3472496"},{"key":"e_1_3_1_24_2","first-page":"893","volume-title":"24th International Conference on Architectural Support for Programming Languages and Operating Systems","author":"Luo Qinyi","year":"2019","unstructured":"Qinyi Luo, Jinkun Lin, Youwei Zhuo, and Xuehai Qian. 2019. Hop: Heterogeneity-aware decentralized training. In 24th International Conference on Architectural Support for Programming Languages and Operating Systems. 893\u2013907."},{"key":"e_1_3_1_25_2","first-page":"1396","volume-title":"Design, Automation & Test in Europe Conference & Exhibition (DATE\u201917)","author":"Mao Jiachen","year":"2017","unstructured":"Jiachen Mao, Xiang Chen, Kent W. Nixon, Christopher Krieger, and Yiran Chen. 2017. MoDNN: Local distributed mobile computing system for deep neural network. In Design, Automation & Test in Europe Conference & Exhibition (DATE\u201917). IEEE, 1396\u20131401."},{"key":"e_1_3_1_26_2","volume-title":"IEEE\/CVF International Conference on Computer Vision Workshops","author":"Materzynska Joanna","year":"2019","unstructured":"Joanna Materzynska, Guillaume Berger, Ingo Bax, and Roland Memisevic. 2019. The Jester dataset: A large-scale video dataset of human gestures. In IEEE\/CVF International Conference on Computer Vision Workshops."},{"key":"e_1_3_1_27_2","doi-asserted-by":"crossref","first-page":"819","DOI":"10.1109\/SC.2018.00068","volume-title":"International Conference for High Performance Computing, Networking, Storage and Analysis","author":"Mathuriya Amrita","year":"2018","unstructured":"Amrita Mathuriya, Deborah Bard, Peter Mendygral, Lawrence Meadows, James Arnemann, Lei Shao, Siyu He, Tuomas K\u00e4rn\u00e4, Diana Moise, Simon J. Pennycook et\u00a0al. 2018. CosmoFlow: Using deep learning to learn the universe at scale. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 819\u2013829."},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.sysarc.2021.102041"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2021.3071762"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.sysarc.2019.101635"},{"key":"e_1_3_1_31_2","first-page":"1","volume-title":"27th ACM Symposium on Operating Systems Principles","author":"Narayanan Deepak","year":"2019","unstructured":"Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized pipeline parallelism for DNN training. In 27th ACM Symposium on Operating Systems Principles. 1\u201315."},{"issue":"7","key":"e_1_3_1_32_2","first-page":"1641","article-title":"The case for strong scaling in deep learning: Training large 3D CNNs with hybrid parallelism","volume":"32","author":"Oyama Yosuke","year":"2020","unstructured":"Yosuke Oyama, Naoya Maruyama, Nikoli Dryden, Erin McCarthy, Peter Harrington, Jan Balewski, Satoshi Matsuoka, Peter Nugent, and Brian Van Essen. 2020. The case for strong scaling in deep learning: Training large 3D CNNs with hybrid parallelism. IEEE Trans. Parallel Distrib. Syst. 32, 7 (2020), 1641\u20131652.","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"key":"e_1_3_1_33_2","first-page":"369","volume-title":"Science and Information Conference","author":"Popovych Sergiy","year":"2019","unstructured":"Sergiy Popovych, Davit Buniatyan, Aleksandar Zlateski, Kai Li, and H. Sebastian Seung. 2019. PZNet: Efficient 3D ConvNet inference on manycore CPUs. In Science and Information Conference. Springer, 369\u2013383."},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.5555\/3433701.3433763"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.3390\/s20185097"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.510"},{"key":"e_1_3_1_37_2","unstructured":"Zhang Xianyi Wang Qian and Zaheer Chothia. 2012. OpenBLAS. Retrieved from http:\/\/xianyi.github.io\/OpenBLAS."},{"key":"e_1_3_1_38_2","first-page":"191","volume-title":"IEEE 18th International Symposium on Biomedical Imaging (ISBI\u201921)","author":"Yang Jiancheng","year":"2021","unstructured":"Jiancheng Yang, Rui Shi, and Bingbing Ni. 2021. MedMNIST classification decathlon: A lightweight AutoML benchmark for medical image analysis. In IEEE 18th International Symposium on Biomedical Imaging (ISBI\u201921). IEEE, 191\u2013195."},{"key":"e_1_3_1_39_2","first-page":"1","volume-title":"International Conference for High Performance Computing, Networking, Storage and Analysis","author":"Yang Weiling","year":"2021","unstructured":"Weiling Yang, Jianbin Fang, Dezun Dong, Xing Su, and Zheng Wang. 2021. LIBSHALOM: Optimizing small and irregular-shaped matrix multiplications on ARMv8 multi-cores. In International Conference for High Performance Computing, Networking, Storage and Analysis. 1\u201314."},{"key":"e_1_3_1_40_2","first-page":"801","volume-title":"IEEE International Parallel and Distributed Processing Symposium (IPDPS\u201916)","author":"Zlateski Aleksandar","year":"2016","unstructured":"Aleksandar Zlateski, Kisuk Lee, and H. Sebastian Seung. 2016. ZNN\u2013A fast and scalable algorithm for training 3D convolutional networks on multi-core and many-core shared memory machines. In IEEE International Parallel and Distributed Processing Symposium (IPDPS\u201916). IEEE, 801\u2013811."}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3605149","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3605149","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T00:03:54Z","timestamp":1750291434000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3605149"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,7,19]]},"references-count":39,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2023,9,30]]}},"alternative-id":["10.1145\/3605149"],"URL":"https:\/\/doi.org\/10.1145\/3605149","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2023,7,19]]},"assertion":[{"value":"2022-11-30","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-06-12","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-07-19","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}