{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,24]],"date-time":"2026-04-24T02:03:23Z","timestamp":1776996203517,"version":"3.51.4"},"publisher-location":"New York, NY, USA","reference-count":40,"publisher":"ACM","license":[{"start":{"date-parts":[[2019,8,5]],"date-time":"2019-08-05T00:00:00Z","timestamp":1564963200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc-nd\/4.0\/"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2019,8,5]]},"DOI":"10.1145\/3337821.3337839","type":"proceedings-article","created":{"date-parts":[[2019,7,25]],"date-time":"2019-07-25T12:34:36Z","timestamp":1564058076000},"page":"1-10","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":30,"title":["A Unified Optimization Approach for CNN Model Inference on Integrated GPUs"],"prefix":"10.1145","author":[{"given":"Leyuan","family":"Wang","sequence":"first","affiliation":[{"name":"Amazon Web Services, East Palo Alto, CA, USA"}]},{"given":"Zhi","family":"Chen","sequence":"additional","affiliation":[{"name":"Amazon Web Services, East Palo Alto, CA, USA"}]},{"given":"Yizhi","family":"Liu","sequence":"additional","affiliation":[{"name":"Amazon Web Services, East Palo Alto, CA, USA"}]},{"given":"Yao","family":"Wang","sequence":"additional","affiliation":[{"name":"Amazon Web Services East Palo Alto, CA, USA"}]},{"given":"Lianmin","family":"Zheng","sequence":"additional","affiliation":[{"name":"Shanghai Jiaotong University, Shanghai, China"}]},{"given":"Mu","family":"Li","sequence":"additional","affiliation":[{"name":"Amazon Web Services, East Palo Alto, CA, USA"}]},{"given":"Yida","family":"Wang","sequence":"additional","affiliation":[{"name":"Amazon Web Services, East Palo Alto, CA, USA"}]}],"member":"320","published-online":{"date-parts":[[2019,8,5]]},"reference":[{"key":"e_1_3_2_1_1_1","first-page":"635","volume-title":"Int. Symp. on High-Performance Computer Architecture","author":"Anglada Marti","year":"2018","unstructured":"Marti Anglada , Enrique de Lucas , Joan-Manuel Parcerisa , Juan L. Arag\u00f3n , and Antonio Gonz\u00e1lez . Early visibility resolution for removing ineffectual computations in the graphics pipeline. In 25th . Int. Symp. on High-Performance Computer Architecture , pages 635 -- 646 , 2018 . Marti Anglada, Enrique de Lucas, Joan-Manuel Parcerisa, Juan L. Arag\u00f3n, and Antonio Gonz\u00e1lez. Early visibility resolution for removing ineffectual computations in the graphics pipeline. In 25th. Int. Symp. on High-Performance Computer Architecture, pages 635--646, 2018."},{"key":"e_1_3_2_1_2_1","volume-title":"https:\/\/www.arm.com\/why-arm\/technologies\/compute-library. {Online","author":"ARM COMPUTE","year":"2019","unstructured":"ARM COMPUTE LIBRARY. https:\/\/www.arm.com\/why-arm\/technologies\/compute-library. {Online ; accessed 13- Mar- 2019 }. ARM COMPUTE LIBRARY. https:\/\/www.arm.com\/why-arm\/technologies\/compute-library. {Online; accessed 13-Mar-2019}."},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2017.2752706"},{"key":"e_1_3_2_1_4_1","unstructured":"Sean Baxter. Moderngpu: Patterns and behaviors for GPU computing. http:\/\/moderngpu.github.io\/moderngpu 2013--2016.  Sean Baxter. Moderngpu: Patterns and behaviors for GPU computing. http:\/\/moderngpu.github.io\/moderngpu 2013--2016."},{"key":"e_1_3_2_1_5_1","first-page":"578","volume-title":"13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)","author":"Chen Tianqi","year":"2018","unstructured":"Tianqi Chen , Thierry Moreau , Ziheng Jiang , Lianmin Zheng , Eddie Yan , Haichen Shen , Meghan Cowan , Leyuan Wang , Yuwei Hu , Luis Ceze , Carlos Guestrin , and Arvind Krishnamurthy . TVM : An automated end-to-end optimizing compiler for deep learning . In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18) , pages 578 -- 594 , 2018 . Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578--594, 2018."},{"key":"e_1_3_2_1_6_1","first-page":"3389","volume-title":"Advances in Neural Information Processing Systems 31","author":"Chen Tianqi","year":"2018","unstructured":"Tianqi Chen , Lianmin Zheng , Eddie Yan , Ziheng Jiang , Thierry Moreau , Luis Ceze , Carlos Guestrin , and Arvind Krishnamurthy . Learning to optimize tensor programs. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors , Advances in Neural Information Processing Systems 31 , pages 3389 -- 3400 . Curran Associates, Inc. , 2018 . Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Learning to optimize tensor programs. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 3389--3400. Curran Associates, Inc., 2018."},{"key":"e_1_3_2_1_7_1","volume-title":"cudnn: Efficient primitives for deep learning. arXiv preprint arXiv.1410.0759","author":"Chetlur Sharan","year":"2014","unstructured":"Sharan Chetlur , Cliff Woolley , Philippe Vandermersch , Jonathan Cohen , John Tran , Bryan Catanzaro , and Evan Shelhamer . cudnn: Efficient primitives for deep learning. arXiv preprint arXiv.1410.0759 , 2014 . Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv.1410.0759, 2014."},{"key":"e_1_3_2_1_8_1","volume-title":"https:\/\/01.org\/cldnn. {Online","author":"Deep Neural Compute Library","year":"2019","unstructured":"Compute Library for Deep Neural Networks (clDNN). https:\/\/01.org\/cldnn. {Online ; accessed 11- Apr- 2019 }. Compute Library for Deep Neural Networks (clDNN). https:\/\/01.org\/cldnn. {Online; accessed 11-Apr-2019}."},{"key":"e_1_3_2_1_9_1","volume-title":"OpenVINO Toolkit Release Notes. https:\/\/software.intel.com\/en-us\/articles\/OpenVINO-RelNotes. {Online","author":"Deuermeyer Deanne","year":"2019","unstructured":"Deanne Deuermeyer and Andrey Z . OpenVINO Toolkit Release Notes. https:\/\/software.intel.com\/en-us\/articles\/OpenVINO-RelNotes. {Online ; accessed 3- Jan- 2019 }. Deanne Deuermeyer and Andrey Z. OpenVINO Toolkit Release Notes. https:\/\/software.intel.com\/en-us\/articles\/OpenVINO-RelNotes. {Online; accessed 3-Jan-2019}."},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/3276496"},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2014.24"},{"key":"e_1_3_2_1_12_1","volume-title":"Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs\/1510.00149","author":"Han Song","year":"2015","unstructured":"Song Han , Huizi Mao , and William J. Dally . Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs\/1510.00149 , 2015 . Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs\/1510.00149, 2015."},{"key":"e_1_3_2_1_13_1","first-page":"851","volume-title":"Parallel prefix sum (scan) with CUDA","author":"Harris Mark","year":"2007","unstructured":"Mark Harris , Shubhabrata Sengupta , and John D. Owens . Parallel prefix sum (scan) with CUDA . In Hubert Nguyen, editor, GPU Gems 3, chapter 39, pages 851 -- 876 . Addison Wesley , August 2007 . Mark Harris, Shubhabrata Sengupta, and John D. Owens. Parallel prefix sum (scan) with CUDA. In Hubert Nguyen, editor, GPU Gems 3, chapter 39, pages 851--876. Addison Wesley, August 2007."},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/7902.7903"},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079079.3079105"},{"key":"e_1_3_2_1_17_1","volume-title":"Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv.1704.04861","author":"Howard Andrew G","year":"2017","unstructured":"Andrew G Howard , Menglong Zhu , Bo Chen , Dmitry Kalenichenko , Weijun Wang , Tobias Weyand , Marco Andreetto , and Hartwig Adam . Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv.1704.04861 , 2017 . Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv.1704.04861, 2017."},{"key":"e_1_3_2_1_18_1","volume-title":"Document Revision 26)","author":"Howes Lee","year":"2014","unstructured":"Lee Howes and Aaftab Munshi . The OpenCL Specification (Version 2.0 , Document Revision 26) , October 2014 . http:\/\/www.khronos.org\/registry\/cl\/specs\/opencl-2.0.pdf. Lee Howes and Aaftab Munshi. The OpenCL Specification (Version 2.0, Document Revision 26), October 2014. http:\/\/www.khronos.org\/registry\/cl\/specs\/opencl-2.0.pdf."},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3081333.3081360"},{"key":"e_1_3_2_1_20_1","volume-title":"Squeezenet: Alexnet-level accuracy with 50x fewer parameters and &lt","author":"Iandola Forrest N","year":"2016","unstructured":"Forrest N Iandola , Song Han , Matthew W Moskewicz , Khalid Ashraf , William J Dally , and Kurt Keutzer . Squeezenet: Alexnet-level accuracy with 50x fewer parameters and &lt ; 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016 . Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and &lt; 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016."},{"key":"e_1_3_2_1_21_1","volume-title":"A performance comparison of cuda and opencl. arXiv preprint arXiv:1005.2581","author":"Karimi Kamran","year":"2010","unstructured":"Kamran Karimi , Neil G Dickson , and Firas Hamze . A performance comparison of cuda and opencl. arXiv preprint arXiv:1005.2581 , 2010 . Kamran Karimi, Neil G Dickson, and Firas Hamze. A performance comparison of cuda and opencl. arXiv preprint arXiv:1005.2581, 2010."},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2014.62"},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.5555\/2959355.2959378"},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2018.00059"},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46448-0_2"},{"key":"e_1_3_2_1_26_1","volume-title":"2019 USENIX Annual Technical Conference (USENIX ATC 19)","author":"Liu Yizhi","year":"2019","unstructured":"Yizhi Liu , Yao Wang , Ruofei Yu , Mu Li , Vin Sharma , and Yida Wang . Optimizing CNN model inference on CPUs . In 2019 USENIX Annual Technical Conference (USENIX ATC 19) , Renton, WA , 2019 . USENIX Association. Yizhi Liu, Yao Wang, Ruofei Yu, Mu Li, Vin Sharma, and Yida Wang. Optimizing CNN model inference on CPUs. In 2019 USENIX Annual Technical Conference (USENIX ATC 19), Renton, WA, 2019. USENIX Association."},{"key":"e_1_3_2_1_27_1","volume-title":"Embedded binarized neural networks. arXiv preprint arXiv:1709.02260","author":"McDanel Bradley","year":"2017","unstructured":"Bradley McDanel , Surat Teerapittayanon , and HT Kung . Embedded binarized neural networks. arXiv preprint arXiv:1709.02260 , 2017 . Bradley McDanel, Surat Teerapittayanon, and HT Kung. Embedded binarized neural networks. arXiv preprint arXiv:1709.02260, 2017."},{"key":"e_1_3_2_1_28_1","volume-title":"Cub: Flexible library of cooperative threadblock primitives and other utilities for CUDA kernel programming. https:\/\/nvlabs.github.io\/cub\/","author":"Merrill Duane","year":"2013","unstructured":"Duane Merrill . Cub: Flexible library of cooperative threadblock primitives and other utilities for CUDA kernel programming. https:\/\/nvlabs.github.io\/cub\/ , 2013 --2016. Duane Merrill. Cub: Flexible library of cooperative threadblock primitives and other utilities for CUDA kernel programming. https:\/\/nvlabs.github.io\/cub\/, 2013--2016."},{"key":"e_1_3_2_1_29_1","volume-title":"August","author":"NVIDIA Corporation","year":"2014","unstructured":"NVIDIA Corporation . NVIDIA CUDA C programming guide. PG-02829-001_v6.5 , August 2014 . NVIDIA Corporation. NVIDIA CUDA C programming guide. PG-02829-001_v6.5, August 2014."},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2008.917757"},{"key":"e_1_3_2_1_31_1","first-page":"833","volume-title":"Automation Test in Europe Conference Exhibition (DATE)","author":"Preu\u00dfer T. B.","year":"2018","unstructured":"T. B. Preu\u00dfer , G. Gambardella , N. Fraser , and M. Blott . Inference of quantized neural networks on heterogeneous all-programmable devices. In 2018 Design , Automation Test in Europe Conference Exhibition (DATE) , pages 833 -- 838 , 2018 . T. B. Preu\u00dfer, G. Gambardella, N. Fraser, and M. Blott. Inference of quantized neural networks on heterogeneous all-programmable devices. In 2018 Design, Automation Test in Europe Conference Exhibition (DATE), pages 833--838, 2018."},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/3107953"},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/2499370.2462176"},{"key":"e_1_3_2_1_34_1","volume-title":"Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767","author":"Redmon Joseph","year":"2018","unstructured":"Joseph Redmon and Ali Farhadi . Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 , 2018 . Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018."},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.5555\/1280094.1280110"},{"key":"e_1_3_2_1_36_1","volume-title":"Mnasnet: Platform-aware neural architecture search for mobile. arXiv preprint arXiv:1807.11626","author":"Tan Mingxing","year":"2018","unstructured":"Mingxing Tan , Bo Chen , Ruoming Pang , Vijay Vasudevan , and Quoc V Le . Mnasnet: Platform-aware neural architecture search for mobile. arXiv preprint arXiv:1807.11626 , 2018 . Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V Le. Mnasnet: Platform-aware neural architecture search for mobile. arXiv preprint arXiv:1807.11626, 2018."},{"key":"e_1_3_2_1_37_1","volume-title":"https:\/\/developer.nvidia.com\/tensorrt. {Online","author":"NVIDIA","year":"2019","unstructured":"NVIDIA TensorRT. https:\/\/developer.nvidia.com\/tensorrt. {Online ; accessed 11- Apr- 2019 }. NVIDIA TensorRT. https:\/\/developer.nvidia.com\/tensorrt. {Online; accessed 11-Apr-2019}."},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.3867"},{"key":"e_1_3_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2019.00048"},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00716"}],"event":{"name":"ICPP 2019: 48th International Conference on Parallel Processing","location":"Kyoto Japan","acronym":"ICPP 2019","sponsor":["University of Tsukuba University of Tsukuba"]},"container-title":["Proceedings of the 48th International Conference on Parallel Processing"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3337821.3337839","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3337821.3337839","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T23:54:25Z","timestamp":1750204465000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3337821.3337839"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,8,5]]},"references-count":40,"alternative-id":["10.1145\/3337821.3337839","10.1145\/3337821"],"URL":"https:\/\/doi.org\/10.1145\/3337821.3337839","relation":{},"subject":[],"published":{"date-parts":[[2019,8,5]]},"assertion":[{"value":"2019-08-05","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}