{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,20]],"date-time":"2025-12-20T22:11:59Z","timestamp":1766268719760,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":54,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,6,27]],"date-time":"2022-06-27T00:00:00Z","timestamp":1656288000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Samsung Research Funding & Incubation Center","award":["SRFC-TD2003-01"],"award-info":[{"award-number":["SRFC-TD2003-01"]}]},{"name":"IITP ITRC (Institute of Information & Communications Technology Planning & Evaluation Information Technology Research Center)","award":["IITP-2021-0-02048"],"award-info":[{"award-number":["IITP-2021-0-02048"]}]},{"name":"Samsung Research Funding & Incubation Center","award":["SRFC-TB1803-03"],"award-info":[{"award-number":["SRFC-TB1803-03"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,6,27]]},"DOI":"10.1145\/3498361.3538940","type":"proceedings-article","created":{"date-parts":[[2022,6,16]],"date-time":"2022-06-16T16:21:53Z","timestamp":1655396513000},"page":"222-234","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":14,"title":["mGEMM"],"prefix":"10.1145","author":[{"given":"Jongseok","family":"Park","sequence":"first","affiliation":[{"name":"Seoul National University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Kyungmin","family":"Bin","sequence":"additional","affiliation":[{"name":"Seoul National University"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Kyunghan","family":"Lee","sequence":"additional","affiliation":[{"name":"Seoul National University"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2022,6,27]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"Low-memory gemm-based convolution algorithms for deep neural networks. &lt;u&gt;arXiv preprint arXiv:1709.03395&lt;\/u&gt","author":"Anderson A.","year":"2017","unstructured":"Anderson , A. , Vasudevan , A. , Keane , C. , and Gregg , D . Low-memory gemm-based convolution algorithms for deep neural networks. &lt;u&gt;arXiv preprint arXiv:1709.03395&lt;\/u&gt ; ( 2017 ). Anderson, A., Vasudevan, A., Keane, C., and Gregg, D. Low-memory gemm-based convolution algorithms for deep neural networks. &lt;u&gt;arXiv preprint arXiv:1709.03395&lt;\/u&gt; (2017)."},{"volume-title":"for Armv8-A architecture profile.&lt;\/u&gt","author":"Arm Architecture Reference Manual","key":"e_1_3_2_1_2_1","unstructured":"ARM. &lt;u&gt; Arm Architecture Reference Manual Armv8 , for Armv8-A architecture profile.&lt;\/u&gt ; ARM. ARM. &lt;u&gt;Arm Architecture Reference Manual Armv8, for Armv8-A architecture profile.&lt;\/u&gt; ARM."},{"key":"e_1_3_2_1_3_1","volume-title":"Inference engine for cpus, gpus and npus. https:\/\/github.com\/ARM-software\/armnn","author":"Armnn","year":"2021","unstructured":"ARM. Armnn : Inference engine for cpus, gpus and npus. https:\/\/github.com\/ARM-software\/armnn , 2021 . ARM. Armnn : Inference engine for cpus, gpus and npus. https:\/\/github.com\/ARM-software\/armnn, 2021."},{"key":"e_1_3_2_1_4_1","volume-title":"Collection of software functions for the arm family of cpus and gpus. https:\/\/github.com\/ARM-software\/ComputeLibrary","author":"Compute","year":"2021","unstructured":"ARM. Compute library : Collection of software functions for the arm family of cpus and gpus. https:\/\/github.com\/ARM-software\/ComputeLibrary , 2021 . ARM. Compute library: Collection of software functions for the arm family of cpus and gpus. https:\/\/github.com\/ARM-software\/ComputeLibrary, 2021."},{"key":"e_1_3_2_1_5_1","volume-title":"High performance convolutional neural networks for document processing. In &lt;u&gt;Tenth International Workshop on Frontiers in Handwriting Recognition&lt;\/u&gt","author":"Chellapilla K.","year":"2006","unstructured":"Chellapilla , K. , Puri , S. , and Simard , P . High performance convolutional neural networks for document processing. In &lt;u&gt;Tenth International Workshop on Frontiers in Handwriting Recognition&lt;\/u&gt ; ( 2006 ), Suvisoft . Chellapilla, K., Puri, S., and Simard, P. High performance convolutional neural networks for document processing. In &lt;u&gt;Tenth International Workshop on Frontiers in Handwriting Recognition&lt;\/u&gt; (2006), Suvisoft."},{"key":"e_1_3_2_1_6_1","first-page":"815","volume-title":"Mec: Memory-efficient convolution for deep neural network. In &lt;u&gt;Proceedings of ICML&lt;\/u&gt","author":"Cho M.","year":"2017","unstructured":"Cho , M. , and Brand , D . Mec: Memory-efficient convolution for deep neural network. In &lt;u&gt;Proceedings of ICML&lt;\/u&gt ; ( 2017 ), pp. 815 -- 824 . Cho, M., and Brand, D. Mec: Memory-efficient convolution for deep neural network. In &lt;u&gt;Proceedings of ICML&lt;\/u&gt; (2017), pp. 815--824."},{"key":"e_1_3_2_1_7_1","volume-title":"Openmp: an industry standard api for shared-memory programming. &lt;u&gt","author":"Dagum L.","year":"1998","unstructured":"Dagum , L. , and Menon , R . Openmp: an industry standard api for shared-memory programming. &lt;u&gt ; IEEE Computational Science and Engineering 5&lt;\/u&gt;, 1 ( 1998 ), 46--55. Dagum, L., and Menon, R. Openmp: an industry standard api for shared-memory programming. &lt;u&gt;IEEE Computational Science and Engineering 5&lt;\/u&gt;, 1 (1998), 46--55."},{"key":"e_1_3_2_1_8_1","volume-title":"A set of level 3 basic linear algebra subprograms. &lt;u&gt;ACM Transactions on Mathematical Software (TOMS) 16&lt;\/u&gt;, 1","author":"Dongarra J. J.","year":"1990","unstructured":"Dongarra , J. J. , Du Croz , J. , Hammarling , S. , and Duff , I. S . A set of level 3 basic linear algebra subprograms. &lt;u&gt;ACM Transactions on Mathematical Software (TOMS) 16&lt;\/u&gt;, 1 ( 1990 ), 1--17. Dongarra, J. J., Du Croz, J., Hammarling, S., and Duff, I. S. A set of level 3 basic linear algebra subprograms. &lt;u&gt;ACM Transactions on Mathematical Software (TOMS) 16&lt;\/u&gt;, 1 (1990), 1--17."},{"key":"e_1_3_2_1_9_1","volume-title":"The indirect convolution algorithm. &lt;u&gt;arXiv preprint arXiv:1907.02129&lt;\/u&gt","author":"Dukhan M.","year":"2019","unstructured":"Dukhan , M. The indirect convolution algorithm. &lt;u&gt;arXiv preprint arXiv:1907.02129&lt;\/u&gt ; ( 2019 ). Dukhan, M. The indirect convolution algorithm. &lt;u&gt;arXiv preprint arXiv:1907.02129&lt;\/u&gt; (2019)."},{"key":"e_1_3_2_1_10_1","volume-title":"Accelerating tensorflow lite with xnnpack integration. https:\/\/blog.tensorflow.org\/2020\/07\/accelerating-tensorflow-lite-xnnpack-integration.html","author":"Dukhan M.","year":"2020","unstructured":"Dukhan , M. Accelerating tensorflow lite with xnnpack integration. https:\/\/blog.tensorflow.org\/2020\/07\/accelerating-tensorflow-lite-xnnpack-integration.html , 2020 . Dukhan, M. Accelerating tensorflow lite with xnnpack integration. https:\/\/blog.tensorflow.org\/2020\/07\/accelerating-tensorflow-lite-xnnpack-integration.html, 2020."},{"key":"e_1_3_2_1_11_1","first-page":"830","volume-title":"Anatomy of high-performance deep learning convolutions on simd architectures. In &lt;u&gt","author":"Georganas E.","year":"2018","unstructured":"Georganas , E. , Avancha , S. , Banerjee , K. , Kalamkar , D. , Henry , G. , Pabst , H. , and Heinecke , A . Anatomy of high-performance deep learning convolutions on simd architectures. In &lt;u&gt ; Proceedings of IEEE SC '18&lt;\/u&gt; ( 2018 ), pp. 830 -- 841 . Georganas, E., Avancha, S., Banerjee, K., Kalamkar, D., Henry, G., Pabst, H., and Heinecke, A. Anatomy of high-performance deep learning convolutions on simd architectures. In &lt;u&gt;Proceedings of IEEE SC'18&lt;\/u&gt; (2018), pp. 830--841."},{"key":"e_1_3_2_1_12_1","volume-title":"Xnnpack: Highly optimized library of floating-point neural network inference operators for arm, webassembly, and x86 platforms. https:\/\/github.com\/google\/XNNPACK","author":"Google","year":"2021","unstructured":"Google . Xnnpack: Highly optimized library of floating-point neural network inference operators for arm, webassembly, and x86 platforms. https:\/\/github.com\/google\/XNNPACK , 2021 . Google. Xnnpack: Highly optimized library of floating-point neural network inference operators for arm, webassembly, and x86 platforms. https:\/\/github.com\/google\/XNNPACK, 2021."},{"key":"e_1_3_2_1_13_1","volume-title":"Anatomy of high-performance matrix multiplication. &lt;u&gt;ACM Transactions on Mathematical Software (TOMS) 34&lt;\/u&gt;, 3","author":"Goto K.","year":"2008","unstructured":"Goto , K. , and Geijn , R. A. V. D. Anatomy of high-performance matrix multiplication. &lt;u&gt;ACM Transactions on Mathematical Software (TOMS) 34&lt;\/u&gt;, 3 ( 2008 ), 1--25. Goto, K., and Geijn, R. A. V. D. Anatomy of high-performance matrix multiplication. &lt;u&gt;ACM Transactions on Mathematical Software (TOMS) 34&lt;\/u&gt;, 3 (2008), 1--25."},{"key":"e_1_3_2_1_14_1","first-page":"51","volume-title":"A family of highperformance matrix multiplication algorithms. In &lt;u&gt;Proceedings of the International Conference on Computational Science (ICCS)&lt;\/u&gt","author":"Gunnels J. A.","year":"2001","unstructured":"Gunnels , J. A. , Henry , G. M. , and Van De Geijn , R. A. A family of highperformance matrix multiplication algorithms. In &lt;u&gt;Proceedings of the International Conference on Computational Science (ICCS)&lt;\/u&gt ; ( 2001 ), Springer , pp. 51 -- 60 . Gunnels, J. A., Henry, G. M., and Van De Geijn, R. A. A family of highperformance matrix multiplication algorithms. In &lt;u&gt;Proceedings of the International Conference on Computational Science (ICCS)&lt;\/u&gt; (2001), Springer, pp. 51--60."},{"key":"e_1_3_2_1_15_1","first-page":"2515","volume-title":"Memory-optimal direct convolutions for maximizing classification accuracy in embedded applications. In &lt;u&gt;Proceedings of ICML&lt;\/u&gt","author":"Gural A.","year":"2019","unstructured":"Gural , A. , and Murmann , B . Memory-optimal direct convolutions for maximizing classification accuracy in embedded applications. In &lt;u&gt;Proceedings of ICML&lt;\/u&gt ; ( 2019 ), pp. 2515 -- 2524 . Gural, A., and Murmann, B. Memory-optimal direct convolutions for maximizing classification accuracy in embedded applications. In &lt;u&gt;Proceedings of ICML&lt;\/u&gt; (2019), pp. 2515--2524."},{"key":"e_1_3_2_1_16_1","first-page":"204","volume-title":"Latency and throughput characterization of convolutional neural networks for mobile computer vision. In &lt;u&gt;Proceedings of the ACM Multimedia Systems Conference (MMSys)&lt;\/u&gt","author":"Hanhirova J.","year":"2018","unstructured":"Hanhirova , J. , K\u00e4m\u00e4r\u00e4inen , T. , Sepp\u00e4l\u00e4 , S. , Siekkinen , M. , Hirvisalo , V. , and Yl\u00e4-J\u00e4\u00e4ski , A. Latency and throughput characterization of convolutional neural networks for mobile computer vision. In &lt;u&gt;Proceedings of the ACM Multimedia Systems Conference (MMSys)&lt;\/u&gt ; ( 2018 ), pp. 204 -- 215 . Hanhirova, J., K\u00e4m\u00e4r\u00e4inen, T., Sepp\u00e4l\u00e4, S., Siekkinen, M., Hirvisalo, V., and Yl\u00e4-J\u00e4\u00e4ski, A. Latency and throughput characterization of convolutional neural networks for mobile computer vision. In &lt;u&gt;Proceedings of the ACM Multimedia Systems Conference (MMSys)&lt;\/u&gt; (2018), pp. 204--215."},{"key":"e_1_3_2_1_17_1","first-page":"770","volume-title":"Deep residual learning for image recognition. In &lt;u&gt","author":"He K.","year":"2016","unstructured":"He , K. , Zhang , X. , Ren , S. , and Sun , J . Deep residual learning for image recognition. In &lt;u&gt ; Proceedings of IEEE CVPR &lt;\/u&gt; ( 2016 ), pp. 770 -- 778 . He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In &lt;u&gt;Proceedings of IEEE CVPR&lt;\/u&gt; (2016), pp. 770--778."},{"key":"e_1_3_2_1_18_1","first-page":"981","volume-title":"Libxsmm: Accelerating small matrix multiplications by runtime code generation. In &lt;u&gt","author":"Heinecke A.","year":"2016","unstructured":"Heinecke , A. , Henry , G. , Hutchinson , M. , and Pabst , H . Libxsmm: Accelerating small matrix multiplications by runtime code generation. In &lt;u&gt ; Proceedings of IEEE SC '16&lt;\/u&gt; ( 2016 ), pp. 981 -- 991 . Heinecke, A., Henry, G., Hutchinson, M., and Pabst, H. Libxsmm: Accelerating small matrix multiplications by runtime code generation. In &lt;u&gt;Proceedings of IEEE SC'16&lt;\/u&gt; (2016), pp. 981--991."},{"key":"e_1_3_2_1_19_1","volume-title":"Mobilenets: Efficient convolutional neural networks for mobile vision applications. &lt;u&gt;arXiv preprint arXiv:1704.04861&lt;\/u&gt","author":"Howard A. G.","year":"2017","unstructured":"Howard , A. G. , Zhu , M. , Chen , B. , Kalenichenko , D. , Wang , W. , Weyand , T. , Andreetto , M. , and Adam , H . Mobilenets: Efficient convolutional neural networks for mobile vision applications. &lt;u&gt;arXiv preprint arXiv:1704.04861&lt;\/u&gt ; ( 2017 ). Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. &lt;u&gt;arXiv preprint arXiv:1704.04861&lt;\/u&gt; (2017)."},{"key":"e_1_3_2_1_20_1","volume-title":"Evaluating fft-based algorithms for strided convolutions on armv8 architectures. &lt;u&gt;Performance Evaluation 152&lt;\/u&gt","author":"Huang X.","year":"2021","unstructured":"Huang , X. , Wang , Q. , Lu , S. , Hao , R. , Mei , S. , and Liu , J . Evaluating fft-based algorithms for strided convolutions on armv8 architectures. &lt;u&gt;Performance Evaluation 152&lt;\/u&gt ; ( 2021 ), 102248. Huang, X., Wang, Q., Lu, S., Hao, R., Mei, S., and Liu, J. Evaluating fft-based algorithms for strided convolutions on armv8 architectures. &lt;u&gt;Performance Evaluation 152&lt;\/u&gt; (2021), 102248."},{"key":"e_1_3_2_1_21_1","unstructured":"Intel. Intel oneapi math kernel library (mkl): Library of optimized math routines for science engineering and financial applications. https:\/\/software.intel.com\/content\/www\/us\/en\/develop\/tools\/oneapi\/components\/onemkl 2021.  Intel. Intel oneapi math kernel library (mkl): Library of optimized math routines for science engineering and financial applications. https:\/\/software.intel.com\/content\/www\/us\/en\/develop\/tools\/oneapi\/components\/onemkl 2021."},{"key":"e_1_3_2_1_22_1","first-page":"2117","volume-title":"Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In &lt;u&gt","author":"Jain S. D.","year":"2017","unstructured":"Jain , S. D. , Xiong , B. , and Grauman , K . Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In &lt;u&gt ; Proceedings of IEEE CVPR &lt;\/u&gt; ( 2017 ), pp. 2117 -- 2126 . Jain, S. D., Xiong, B., and Grauman, K. Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In &lt;u&gt;Proceedings of IEEE CVPR&lt;\/u&gt; (2017), pp. 2117--2126."},{"key":"e_1_3_2_1_23_1","volume-title":"Caffe: Convolutional architecture for fast feature embedding. &lt;u&gt;arXiv preprint arXiv:1408.5093&lt;\/u&gt","author":"Jia Y.","year":"2014","unstructured":"Jia , Y. , Shelhamer , E. , Donahue , J. , Karayev , S. , Long , J. , Girshick , R. , Guadarrama , S. , and Darrell , T . Caffe: Convolutional architecture for fast feature embedding. &lt;u&gt;arXiv preprint arXiv:1408.5093&lt;\/u&gt ; ( 2014 ). Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. Caffe: Convolutional architecture for fast feature embedding. &lt;u&gt;arXiv preprint arXiv:1408.5093&lt;\/u&gt; (2014)."},{"key":"e_1_3_2_1_24_1","volume-title":"Imagenet classification with deep convolutional neural networks. &lt;u&gt;Advances in Neural Information Processing Systems 25&lt;\/u&gt","author":"Krizhevsky A.","year":"2012","unstructured":"Krizhevsky , A. , Sutskever , I. , and Hinton , G. E . Imagenet classification with deep convolutional neural networks. &lt;u&gt;Advances in Neural Information Processing Systems 25&lt;\/u&gt ; ( 2012 ), 1097--1105. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. &lt;u&gt;Advances in Neural Information Processing Systems 25&lt;\/u&gt; (2012), 1097--1105."},{"key":"e_1_3_2_1_25_1","first-page":"4013","volume-title":"Fast algorithms for convolutional neural networks. In &lt;u&gt","author":"Lavin A.","year":"2016","unstructured":"Lavin , A. , and Gray , S . Fast algorithms for convolutional neural networks. In &lt;u&gt ; Proceedings of IEEE CVPR &lt;\/u&gt; ( 2016 ), pp. 4013 -- 4021 . Lavin, A., and Gray, S. Fast algorithms for convolutional neural networks. In &lt;u&gt;Proceedings of IEEE CVPR&lt;\/u&gt; (2016), pp. 4013--4021."},{"key":"e_1_3_2_1_26_1","first-page":"1025","volume-title":"Optimizing cnn model inference on cpus. In &lt;u&gt;Proceedings of the USENIX Annual Technical Conference (USENIX ATC)&lt;\/u&gt","author":"Liu Y.","year":"2019","unstructured":"Liu , Y. , Wang , Y. , Yu , R. , Li , M. , Sharma , V. , and Wang , Y . Optimizing cnn model inference on cpus. In &lt;u&gt;Proceedings of the USENIX Annual Technical Conference (USENIX ATC)&lt;\/u&gt ; ( 2019 ), pp. 1025 -- 1040 . Liu, Y., Wang, Y., Yu, R., Li, M., Sharma, V., and Wang, Y. Optimizing cnn model inference on cpus. In &lt;u&gt;Proceedings of the USENIX Annual Technical Conference (USENIX ATC)&lt;\/u&gt; (2019), pp. 1025--1040."},{"key":"e_1_3_2_1_27_1","volume-title":"Analytical modeling is enough for high-performance blis. &lt;u&gt;ACM Transactions on Mathematical Software (TOMS) 43&lt;\/u&gt;, 2","author":"Low T. M.","year":"2016","unstructured":"Low , T. M. , Igual , F. D. , Smith , T. M. , and Quintana-Orti , E. S. Analytical modeling is enough for high-performance blis. &lt;u&gt;ACM Transactions on Mathematical Software (TOMS) 43&lt;\/u&gt;, 2 ( 2016 ), 1--18. Low, T. M., Igual, F. D., Smith, T. M., and Quintana-Orti, E. S. Analytical modeling is enough for high-performance blis. &lt;u&gt;ACM Transactions on Mathematical Software (TOMS) 43&lt;\/u&gt;, 2 (2016), 1--18."},{"key":"e_1_3_2_1_28_1","volume-title":"Fast training of convolutional networks through ffts. &lt;u&gt;arXiv preprint arXiv:1312.5851&lt;\/u&gt","author":"Mathieu M.","year":"2013","unstructured":"Mathieu , M. , Henaff , M. , and LeCun , Y. Fast training of convolutional networks through ffts. &lt;u&gt;arXiv preprint arXiv:1312.5851&lt;\/u&gt ; ( 2013 ). Mathieu, M., Henaff, M., and LeCun, Y. Fast training of convolutional networks through ffts. &lt;u&gt;arXiv preprint arXiv:1312.5851&lt;\/u&gt; (2013)."},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"crossref","unstructured":"Mogers N. Radu V. Li L. Turner J. O'Boyle M. and Dubach C. Automatic generation of specialized direct convolutions for mobile gpus. In &lt;u&gt;Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit (GPGPU)&lt;\/u&gt; (2020) pp. 41--50.  Mogers N. Radu V. Li L. Turner J. O'Boyle M. and Dubach C. Automatic generation of specialized direct convolutions for mobile gpus. In &lt;u&gt;Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit (GPGPU)&lt;\/u&gt; (2020) pp. 41--50.","DOI":"10.1145\/3366428.3380771"},{"key":"e_1_3_2_1_30_1","unstructured":"Monsoon-solutions. High Voltage Power Monitor. http:\/\/www.msoon.com\/LabEquipment\/PowerMonitor\/ 2019.  Monsoon-solutions. High Voltage Power Monitor. http:\/\/www.msoon.com\/LabEquipment\/PowerMonitor\/ 2019."},{"key":"e_1_3_2_1_31_1","first-page":"6","volume":"4","author":"Nethercote N.","year":"2007","unstructured":"Nethercote , N. , and Seward , J. Valgrind : A framework for heavyweight dynamic binary instrumentation. &lt;u&gt; ACM SIGPLAN Notices 4 2&lt;\/u&gt;, 6 ( 2007 ), 89--100. Nethercote, N., and Seward, J. Valgrind: A framework for heavyweight dynamic binary instrumentation. &lt;u&gt;ACM SIGPLAN Notices 42&lt;\/u&gt;, 6 (2007), 89--100.","journal-title":"ACM SIGPLAN Notices"},{"key":"e_1_3_2_1_32_1","unstructured":"Nvidia. Comparison of convolution methods for GPUs. http:\/\/ska-sdp.org\/sites\/default\/files\/attachments\/nvidia-sdp-directconvolution.pdf 2020.  Nvidia. Comparison of convolution methods for GPUs. http:\/\/ska-sdp.org\/sites\/default\/files\/attachments\/nvidia-sdp-directconvolution.pdf 2020."},{"key":"e_1_3_2_1_33_1","volume-title":"cublas: Industry standard blas apis highly optimized for nvidia gpus. https:\/\/developer.nvidia.com\/cublas","author":"Nvidia","year":"2021","unstructured":"Nvidia . cublas: Industry standard blas apis highly optimized for nvidia gpus. https:\/\/developer.nvidia.com\/cublas , 2021 . Nvidia. cublas: Industry standard blas apis highly optimized for nvidia gpus. https:\/\/developer.nvidia.com\/cublas, 2021."},{"key":"e_1_3_2_1_34_1","volume-title":"Optimizing Convolutional Layers. https:\/\/docs.nvidia.com\/deeplearning\/performance\/pdf\/Optimizing-Convolutional-Layers-User-Guide.pdf","author":"Nvidia","year":"2021","unstructured":"Nvidia . Optimizing Convolutional Layers. https:\/\/docs.nvidia.com\/deeplearning\/performance\/pdf\/Optimizing-Convolutional-Layers-User-Guide.pdf , 2021 . Nvidia. Optimizing Convolutional Layers. https:\/\/docs.nvidia.com\/deeplearning\/performance\/pdf\/Optimizing-Convolutional-Layers-User-Guide.pdf, 2021."},{"key":"e_1_3_2_1_35_1","first-page":"358","volume-title":"Accelerate non-unit stride convolutions with winograd algorithms. In &lt;u&gt;Proceedings of the IEEE Asia and South Pacific Design Automation Conference (ASP-DAC)&lt;\/u&gt","author":"Pan J.","year":"2021","unstructured":"Pan , J. , and Chen , D . Accelerate non-unit stride convolutions with winograd algorithms. In &lt;u&gt;Proceedings of the IEEE Asia and South Pacific Design Automation Conference (ASP-DAC)&lt;\/u&gt ; ( 2021 ), pp. 358 -- 364 . Pan, J., and Chen, D. Accelerate non-unit stride convolutions with winograd algorithms. In &lt;u&gt;Proceedings of the IEEE Asia and South Pacific Design Automation Conference (ASP-DAC)&lt;\/u&gt; (2021), pp. 358--364."},{"key":"e_1_3_2_1_36_1","volume-title":"Efficient memory management for deep neural net inference. &lt;u&gt;arXiv preprint arXiv:2001.03288&lt;\/u&gt","author":"Pisarchyk Y.","year":"2020","unstructured":"Pisarchyk , Y. , and Lee , J . Efficient memory management for deep neural net inference. &lt;u&gt;arXiv preprint arXiv:2001.03288&lt;\/u&gt ; ( 2020 ). Pisarchyk, Y., and Lee, J. Efficient memory management for deep neural net inference. &lt;u&gt;arXiv preprint arXiv:2001.03288&lt;\/u&gt; (2020)."},{"key":"e_1_3_2_1_37_1","unstructured":"Redmon J. Darknet: Open source neural networks in c. http:\/\/pjreddie.com\/darknet\/ 2013--2016.  Redmon J. Darknet: Open source neural networks in c. http:\/\/pjreddie.com\/darknet\/ 2013--2016."},{"key":"e_1_3_2_1_38_1","volume-title":"Yolov3: An incremental improvement. &lt;u&gt;arXiv preprint arXiv:1804.02767&lt;\/u&gt","author":"Redmon J.","year":"2018","unstructured":"Redmon , J. , and Farhadi , A . Yolov3: An incremental improvement. &lt;u&gt;arXiv preprint arXiv:1804.02767&lt;\/u&gt ; ( 2018 ). Redmon, J., and Farhadi, A. Yolov3: An incremental improvement. &lt;u&gt;arXiv preprint arXiv:1804.02767&lt;\/u&gt; (2018)."},{"key":"e_1_3_2_1_39_1","first-page":"4510","volume-title":"-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In &lt;u&gt","author":"Sandler M.","year":"2018","unstructured":"Sandler , M. , Howard , A. , Zhu , M. , Zhmoginov , A. , and Chen , L . -C. Mobilenetv2: Inverted residuals and linear bottlenecks. In &lt;u&gt ; Proceedings of IEEE CVPR &lt;\/u&gt; ( 2018 ), pp. 4510 -- 4520 . Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In &lt;u&gt;Proceedings of IEEE CVPR&lt;\/u&gt; (2018), pp. 4510--4520."},{"key":"e_1_3_2_1_40_1","volume-title":"Very deep convolutional networks for large-scale image recognition. &lt;u&gt;arXiv preprint arXiv:1409.1556&lt;\/u&gt","author":"Simonyan K.","year":"2014","unstructured":"Simonyan , K. , and Zisserman , A . Very deep convolutional networks for large-scale image recognition. &lt;u&gt;arXiv preprint arXiv:1409.1556&lt;\/u&gt ; ( 2014 ). Simonyan, K., and Zisserman, A. Very deep convolutional networks for large-scale image recognition. &lt;u&gt;arXiv preprint arXiv:1409.1556&lt;\/u&gt; (2014)."},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"crossref","unstructured":"Smith T. M. Van De Geijn R. Smelyanskiy M. Hammond J. R. and Van Zee F. G. Anatomy of high-performance many-threaded matrix multiplication. In &lt;u&gt;Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS)&lt;\/u&gt; (2014) pp. 1049--1059.  Smith T. M. Van De Geijn R. Smelyanskiy M. Hammond J. R. and Van Zee F. G. Anatomy of high-performance many-threaded matrix multiplication. In &lt;u&gt;Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS)&lt;\/u&gt; (2014) pp. 1049--1059.","DOI":"10.1109\/IPDPS.2014.110"},{"key":"e_1_3_2_1_42_1","volume-title":"The arm scalable vector extension. &lt;u&gt","author":"Stephens N.","year":"2017","unstructured":"Stephens , N. , Biles , S. , Boettcher , M. , Eapen , J. , Eyole , M. , Gabrielli , G. , The arm scalable vector extension. &lt;u&gt ; IEEE Micro 37&lt;\/u&gt;, 2 ( 2017 ), 26--39. Stephens, N., Biles, S., Boettcher, M., Eapen, J., Eyole, M., Gabrielli, G., et al. The arm scalable vector extension. &lt;u&gt;IEEE Micro 37&lt;\/u&gt;, 2 (2017), 26--39."},{"key":"e_1_3_2_1_43_1","first-page":"2820","volume-title":"Mnasnet: Platform-aware neural architecture search for mobile. In &lt;u&gt","author":"Tan M.","year":"2019","unstructured":"Tan , M. , Chen , B. , Pang , R. , Vasudevan , V. , Sandler , M. , Howard , A. , and Le , Q. V . Mnasnet: Platform-aware neural architecture search for mobile. In &lt;u&gt ; Proceedings of IEEE CVPR &lt;\/u&gt; ( 2019 ), pp. 2820 -- 2828 . Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., and Le, Q. V. Mnasnet: Platform-aware neural architecture search for mobile. In &lt;u&gt;Proceedings of IEEE CVPR&lt;\/u&gt; (2019), pp. 2820--2828."},{"key":"e_1_3_2_1_44_1","first-page":"6105","volume-title":"Efficientnet: Rethinking model scaling for convolutional neural networks. In &lt;u&gt;Proceedings of ICML&lt;\/u&gt","author":"Tan M.","year":"2019","unstructured":"Tan , M. , and Le , Q . Efficientnet: Rethinking model scaling for convolutional neural networks. In &lt;u&gt;Proceedings of ICML&lt;\/u&gt ; ( 2019 ), pp. 6105 -- 6114 . Tan, M., and Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In &lt;u&gt;Proceedings of ICML&lt;\/u&gt; (2019), pp. 6105--6114."},{"key":"e_1_3_2_1_45_1","first-page":"1066","volume-title":"Tensile: Auto-tuning gemm gpu assembly for all problem sizes. In &lt;u&gt;Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)&lt;\/u&gt","author":"Tanner D. E.","year":"2018","unstructured":"Tanner , D. E. Tensile: Auto-tuning gemm gpu assembly for all problem sizes. In &lt;u&gt;Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)&lt;\/u&gt ; ( 2018 ), pp. 1066 -- 1075 . Tanner, D. E. Tensile: Auto-tuning gemm gpu assembly for all problem sizes. In &lt;u&gt;Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)&lt;\/u&gt; (2018), pp. 1066--1075."},{"key":"e_1_3_2_1_46_1","first-page":"1","volume-title":"Parallel convolution algorithm using implicit matrix multiplication on multi-core cpus. In &lt;u&gt;Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN)&lt;\/u&gt","author":"Wang Q.","year":"2019","unstructured":"Wang , Q. , Mei , S. , Liu , J. , and Gong , C . Parallel convolution algorithm using implicit matrix multiplication on multi-core cpus. In &lt;u&gt;Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN)&lt;\/u&gt ; ( 2019 ), pp. 1 -- 7 . Wang, Q., Mei, S., Liu, J., and Gong, C. Parallel convolution algorithm using implicit matrix multiplication on multi-core cpus. In &lt;u&gt;Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN)&lt;\/u&gt; (2019), pp. 1--7."},{"key":"e_1_3_2_1_47_1","volume-title":"High-throughput cnn inference on embedded arm big. little multicore processors. &lt;u&gt","author":"Wang S.","year":"2019","unstructured":"Wang , S. , Ananthanarayanan , G. , Zeng , Y. , Goel , N. , Pathania , A. , and Mitra , T . High-throughput cnn inference on embedded arm big. little multicore processors. &lt;u&gt ;IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD) 39&lt;\/u&gt;, 10 ( 2019 ), 2254--2267. Wang, S., Ananthanarayanan, G., Zeng, Y., Goel, N., Pathania, A., and Mitra, T. High-throughput cnn inference on embedded arm big. little multicore processors. &lt;u&gt;IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD) 39&lt;\/u&gt;, 10 (2019), 2254--2267."},{"key":"e_1_3_2_1_48_1","first-page":"5","volume":"3","author":"Wang S.","year":"2020","unstructured":"Wang , S. , Pathania , A. , and Mitra , T. Neural network inference on mobile socs. &lt;u&gt; IEEE Design & Test 3 7&lt;\/u&gt;, 5 ( 2020 ), 50--57. Wang, S., Pathania, A., and Mitra, T. Neural network inference on mobile socs. &lt;u&gt;IEEE Design & Test 37&lt;\/u&gt;, 5 (2020), 50--57.","journal-title":"Test"},{"key":"e_1_3_2_1_49_1","volume-title":"Automated empirical optimizations of software and the atlas project. &lt;u&gt;Parallel computing 27&lt;\/u&gt;, 1-2","author":"Whaley R. C.","year":"2001","unstructured":"Whaley , R. C. , Petitet , A. , and Dongarra , J. J . Automated empirical optimizations of software and the atlas project. &lt;u&gt;Parallel computing 27&lt;\/u&gt;, 1-2 ( 2001 ), 3--35. Whaley, R. C., Petitet, A., and Dongarra, J. J. Automated empirical optimizations of software and the atlas project. &lt;u&gt;Parallel computing 27&lt;\/u&gt;, 1-2 (2001), 3--35."},{"key":"e_1_3_2_1_50_1","volume-title":"Openblas: An optimized blas library. https:\/\/www.openblas.net\/","author":"Xianyi Z.","year":"2021","unstructured":"Xianyi , Z. Openblas: An optimized blas library. https:\/\/www.openblas.net\/ , 2021 . Xianyi, Z. Openblas: An optimized blas library. https:\/\/www.openblas.net\/, 2021."},{"key":"e_1_3_2_1_51_1","volume-title":"Evolving cnn-lstm models for time series prediction using enhanced grey wolf optimizer. &lt;u&gt","author":"Xie H.","year":"2020","unstructured":"Xie , H. , Zhang , L. , and Lim , C. P . Evolving cnn-lstm models for time series prediction using enhanced grey wolf optimizer. &lt;u&gt ; IEEE Access 8&lt;\/u&gt; ( 2020 ), 161519--161541. Xie, H., Zhang, L., and Lim, C. P. Evolving cnn-lstm models for time series prediction using enhanced grey wolf optimizer. &lt;u&gt;IEEE Access 8&lt;\/u&gt; (2020), 161519--161541."},{"key":"e_1_3_2_1_52_1","first-page":"5776","volume-title":"High performance zero-memory overhead direct convolutions. In &lt;u&gt;Proceedings of ICML&lt;\/u&gt","author":"Zhang J.","year":"2018","unstructured":"Zhang , J. , Franchetti , F. , and Low , T. M . High performance zero-memory overhead direct convolutions. In &lt;u&gt;Proceedings of ICML&lt;\/u&gt ; ( 2018 ), pp. 5776 -- 5785 . Zhang, J., Franchetti, F., and Low, T. M. High performance zero-memory overhead direct convolutions. In &lt;u&gt;Proceedings of ICML&lt;\/u&gt; (2018), pp. 5776--5785."},{"key":"e_1_3_2_1_53_1","doi-asserted-by":"crossref","unstructured":"Zhang L. L. Han S. Wei J. Zheng N. Cao T. Yang Y. and Liu Y. Nn-meter: Towards accurate latency prediction of deep-learning model inference on diverse edge devices. In &lt;u&gt;Proceedings of the 19th Annual International Conference on Mobile Systems Applications and Services&lt;\/u&gt; (2021) pp. 81--93.  Zhang L. L. Han S. Wei J. Zheng N. Cao T. Yang Y. and Liu Y. Nn-meter: Towards accurate latency prediction of deep-learning model inference on diverse edge devices. In &lt;u&gt;Proceedings of the 19th Annual International Conference on Mobile Systems Applications and Services&lt;\/u&gt; (2021) pp. 81--93.","DOI":"10.1145\/3458864.3467882"},{"key":"e_1_3_2_1_54_1","first-page":"6795","volume-title":"High performance depthwise and pointwise convolutions on mobile devices. In &lt;u&gt;Proceedings of the AAAI Conference on Artificial Intelligence&lt;\/u&gt","author":"Zhang P.","year":"2020","unstructured":"Zhang , P. , Lo , E. , and Lu , B . High performance depthwise and pointwise convolutions on mobile devices. In &lt;u&gt;Proceedings of the AAAI Conference on Artificial Intelligence&lt;\/u&gt ; ( 2020 ), vol. 34 , pp. 6795 -- 6802 . Zhang, P., Lo, E., and Lu, B. High performance depthwise and pointwise convolutions on mobile devices. In &lt;u&gt;Proceedings of the AAAI Conference on Artificial Intelligence&lt;\/u&gt; (2020), vol. 34, pp. 6795--6802."}],"event":{"name":"MobiSys '22: The 20th Annual International Conference on Mobile Systems, Applications and Services","sponsor":["SIGMOBILE ACM Special Interest Group on Mobility of Systems, Users, Data and Computing","SIGOPS ACM Special Interest Group on Operating Systems"],"location":"Portland Oregon","acronym":"MobiSys '22"},"container-title":["Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3498361.3538940","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3498361.3538940","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T18:10:04Z","timestamp":1750183804000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3498361.3538940"}},"subtitle":["low-latency convolution with minimal memory overhead optimized for mobile devices"],"short-title":[],"issued":{"date-parts":[[2022,6,27]]},"references-count":54,"alternative-id":["10.1145\/3498361.3538940","10.1145\/3498361"],"URL":"https:\/\/doi.org\/10.1145\/3498361.3538940","relation":{},"subject":[],"published":{"date-parts":[[2022,6,27]]},"assertion":[{"value":"2022-06-27","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}