{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,10]],"date-time":"2026-01-10T01:57:33Z","timestamp":1768010253901,"version":"3.49.0"},"reference-count":32,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2026,1,9]],"date-time":"2026-01-09T00:00:00Z","timestamp":1767916800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2026,1,9]],"date-time":"2026-01-09T00:00:00Z","timestamp":1767916800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100005765","name":"Universidade de Lisboa","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100005765","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Cluster Comput"],"published-print":{"date-parts":[[2026,4]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>This work presents an open, efficient (fast) CUDA convolution neural network inference implementation specialized in some layers of popular nets like ResNet, VGG, and GoogLeNet. The proposed algorithm implements convolution directly instead of preprocessing with image to columns. Algorithm parameters are selected to meet constraints on global and shared memory access bandwidth, register usage, shared memory usage, and instructions per clock. Parallel arithmetic operations and memory access are achieved with several parallel blocks per streaming processor. Results comparing with state-of-the-art implementations are presented.<\/jats:p>","DOI":"10.1007\/s10586-025-05895-9","type":"journal-article","created":{"date-parts":[[2026,1,9]],"date-time":"2026-01-09T19:10:15Z","timestamp":1767985815000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Open CUDA convolution neural network inference implementation"],"prefix":"10.1007","volume":"29","author":[{"given":"Paulo","family":"Lopes","sequence":"first","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2026,1,9]]},"reference":[{"key":"5895_CR1","unstructured":"Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012)"},{"key":"5895_CR2","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770\u2013778 (2016)","DOI":"10.1109\/CVPR.2016.90"},{"key":"5895_CR3","unstructured":"Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)"},{"key":"5895_CR4","doi-asserted-by":"crossref","unstructured":"Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1\u20139 (2015)","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"5895_CR5","doi-asserted-by":"crossref","unstructured":"Kim, H., Nam, H., Jung, W., Lee, J.: Performance analysis of CNN frameworks for GPUs. In: 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 55\u201364 (2017). IEEE","DOI":"10.1109\/ISPASS.2017.7975270"},{"key":"5895_CR6","unstructured":"Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B., Shelhamer, E.: cuDNN: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014)"},{"key":"5895_CR7","doi-asserted-by":"crossref","unstructured":"Shatnawi, A., Al-Bdour, G., Al-Qurran, R., Al-Ayyoub, M.: A comparative study of open source deep learning frameworks. In: 2018 9th International Conference on Information and Communication Systems (ICICS), pp. 72\u201377 (2018)","DOI":"10.1109\/IACS.2018.8355444"},{"key":"5895_CR8","unstructured":"NVIDIA: NVIDIA CUDA Deep Neural Network library (2024). https:\/\/developer.nvidia.com\/cudnn Accessed 2024-07-19"},{"key":"5895_CR9","unstructured":"NVIDIA: CUDA Templates for Linear Algebra Subroutines (2024). https:\/\/github.com\/NVIDIA\/cutlass Accessed 2024-07-19"},{"key":"5895_CR10","unstructured":"Bikshandi, G., Shah, J.: A case study in CUDA kernel fusion: Implementing flashattention-2 on NVIDIA Hopper architecture using the cuTLASS library. arXiv preprint arXiv:2312.11918 (2023)"},{"key":"5895_CR11","unstructured":"Lopes, P.A.C.: Open Efficient CUDA Convolution Neural Network Inference Implementation Source Code (2024). https:\/\/github.com\/paclopes\/cuDconv"},{"key":"5895_CR12","unstructured":"NVIDIA: NVIDIA CUDA Basic Linear Algebra Subroutine library (2024). https:\/\/developer.nvidia.com\/cublas Accessed 2024-07-19"},{"key":"5895_CR13","unstructured":"MAGMA: Matrix Algebra on GPU and Multi-core Architectures (MAGMA) (2024). https:\/\/icl.utk.edu\/magma\/ Accessed 2024-07-19"},{"issue":"5\u20136","key":"5895_CR14","doi-asserted-by":"publisher","first-page":"232","DOI":"10.1016\/j.parco.2009.12.005","volume":"36","author":"S Tomov","year":"2010","unstructured":"Tomov, S., Dongarra, J., Baboulin, M.: Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Computing 36(5\u20136), 232\u2013240 (2010). https:\/\/doi.org\/10.1016\/j.parco.2009.12.005","journal-title":"Parallel Computing"},{"key":"5895_CR15","doi-asserted-by":"crossref","unstructured":"Fatahalian, K., Sugerman, J., Hanrahan, P.: Understanding the efficiency of GPU algorithms for matrix-matrix multiplication. In: Proceedings of the ACM SIGGRAPH\/EUROGRAPHICS Conference on Graphics Hardware, pp. 133\u2013137 (2004)","DOI":"10.1145\/1058129.1058148"},{"key":"5895_CR16","doi-asserted-by":"crossref","unstructured":"Volkov, V., Demmel, J.W.: Benchmarking GPUs to tune dense linear algebra. In: SC\u201908: Proceedings of the 2008 ACM\/IEEE Conference on Supercomputing, pp. 1\u201311 (2008). IEEE","DOI":"10.1109\/SC.2008.5214359"},{"issue":"11","key":"5895_CR17","doi-asserted-by":"publisher","first-page":"2045","DOI":"10.1109\/TPDS.2011.311","volume":"23","author":"J Kurzak","year":"2012","unstructured":"Kurzak, J., Tomov, S., Dongarra, J.: Autotuning GEMM kernels for the Fermi GPU. IEEE Transactions on Parallel and Distributed Systems 23(11), 2045\u20132057 (2012)","journal-title":"IEEE Transactions on Parallel and Distributed Systems"},{"issue":"2","key":"5895_CR18","doi-asserted-by":"publisher","first-page":"1459","DOI":"10.1007\/s10586-021-03494-y","volume":"25","author":"M Jord\u00e0","year":"2022","unstructured":"Jord\u00e0, M., Valero-Lara, P., Pe\u00f1a, A.J.: cuConv: CUDA implementation of convolution for CNN inference. Cluster Computing 25(2), 1459\u20131473 (2022)","journal-title":"Cluster Computing"},{"key":"5895_CR19","doi-asserted-by":"publisher","first-page":"495","DOI":"10.1016\/j.procs.2017.05.138","volume":"108","author":"J Dongarra","year":"2017","unstructured":"Dongarra, J., Hammarling, S., Higham, N.J., Relton, S.D., Valero-Lara, P., Zounon, M.: The design and performance of batched BLAS on modern high-performance computing systems. Procedia Computer Science 108, 495\u2013504 (2017)","journal-title":"Procedia Computer Science"},{"key":"5895_CR20","doi-asserted-by":"crossref","unstructured":"Brown, C., Abdelfattah, A., Tomov, S., Dongarra, J.: Design, optimization, and benchmarking of dense linear algebra algorithms on AMD GPUs. In: 2020 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1\u20137 (2020). IEEE","DOI":"10.1109\/HPEC43674.2020.9286214"},{"issue":"4","key":"5895_CR21","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3412380","volume":"46","author":"B Barabasz","year":"2020","unstructured":"Barabasz, B., Anderson, A., Soodhalter, K.M., Gregg, D.: Error analysis and improving the accuracy of Winograd convolution for deep neural networks. ACM Transactions on Mathematical Software 46(4), 1\u201333 (2020)","journal-title":"ACM Transactions on Mathematical Software"},{"key":"5895_CR22","doi-asserted-by":"crossref","unstructured":"Yan, D., Wang, W., Chu, X.: Optimizing batched Winograd convolution on GPUs. In: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 32\u201344 (2020)","DOI":"10.1145\/3332466.3374520"},{"key":"5895_CR23","doi-asserted-by":"crossref","unstructured":"Liu, J., Yang, D., Lai, J.: Optimizing Winograd-based convolution with tensor cores. In: Proceedings of the 50th International Conference on Parallel Processing, pp. 1\u201310 (2021)","DOI":"10.1145\/3472456.3472473"},{"key":"5895_CR24","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Zhang, P., Xu, Z., Yan, B., Wang, Q.: Im2col-Winograd: An efficient and flexible fused-Winograd convolution for nhwc format on GPUs. In: Proceedings of the 53rd International Conference on Parallel Processing, pp. 1072\u20131081 (2024)","DOI":"10.1145\/3673038.3673039"},{"issue":"7930","key":"5895_CR25","doi-asserted-by":"publisher","first-page":"47","DOI":"10.1038\/s41586-022-05172-4","volume":"610","author":"A Fawzi","year":"2022","unstructured":"Fawzi, A., Balog, M., Huang, A., Hubert, T., Romera-Paredes, B., Barekatain, M., Novikov, A., Ruiz, F.J., Schrittwieser, J., Swirszcz, G.: Discovering faster matrix multiplication algorithms with reinforcement learning. Nature 610(7930), 47\u201353 (2022)","journal-title":"Nature"},{"key":"5895_CR26","doi-asserted-by":"crossref","unstructured":"Andri, R., Bussolino, B., Cipolletta, A., Cavigelli, L., Wang, Z.: Going further with Winograd convolutions: Tap-wise quantization for efficient inference on 4x4 tiles. In: 2022 55th IEEE\/ACM International Symposium on Microarchitecture (MICRO), pp. 582\u2013598 (2022). IEEE","DOI":"10.1109\/MICRO56248.2022.00048"},{"key":"5895_CR27","doi-asserted-by":"crossref","unstructured":"Chen, Y.-T., Ou, Y.-F., Huang, C.-T.: A Winograd-based highly-parallel convolution engine for 8-bit CNN acceleration. In: 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS), pp. 395\u2013398 (2022). IEEE","DOI":"10.1109\/AICAS54282.2022.9869911"},{"issue":"1","key":"5895_CR28","doi-asserted-by":"publisher","first-page":"72","DOI":"10.1109\/TPDS.2016.2549523","volume":"28","author":"X Mei","year":"2016","unstructured":"Mei, X., Chu, X.: Dissecting GPU memory hierarchy through microbenchmarking. IEEE Trans. Parallel Distrib. Syst. 28(1), 72\u201386 (2016)","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"key":"5895_CR29","unstructured":"Jia, Z., Maggioni, M., Staiger, B., Scarpazza, D.P.: Dissecting the NVIDIA Volta GPU architecture via microbenchmarking. arXiv preprint arXiv:1804.06826 (2018)"},{"key":"5895_CR30","doi-asserted-by":"crossref","unstructured":"Raihan, M.A., Goli, N., Aamodt, T.M.: Modeling deep learning accelerator enabled GPUs. In: 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 79\u201392 (2019). IEEE","DOI":"10.1109\/ISPASS.2019.00016"},{"key":"5895_CR31","doi-asserted-by":"crossref","unstructured":"Abdelkhalik, H., Arafa, Y., Santhi, N., Badawy, A.-H.A.: Demystifying the NVIDIA Ampere architecture through microbenchmarking and instruction-level analysis. In: 2022 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1\u20138 (2022). IEEE","DOI":"10.1109\/HPEC55821.2022.9926299"},{"key":"5895_CR32","unstructured":"NVIDIA: Cuda C programming guide (2024). https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/ Accessed 2024-07-19"}],"container-title":["Cluster Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10586-025-05895-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10586-025-05895-9","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10586-025-05895-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,1,9]],"date-time":"2026-01-09T19:10:18Z","timestamp":1767985818000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10586-025-05895-9"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,1,9]]},"references-count":32,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2026,4]]}},"alternative-id":["5895"],"URL":"https:\/\/doi.org\/10.1007\/s10586-025-05895-9","relation":{},"ISSN":["1386-7857","1573-7543"],"issn-type":[{"value":"1386-7857","type":"print"},{"value":"1573-7543","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,1,9]]},"assertion":[{"value":"10 March 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"26 September 2025","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"24 December 2025","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"9 January 2026","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors have no relevant financial or non-financial interests to disclose.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflicts of Interest"}},{"value":"The authors declare no competing interests.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"105"}}