{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:32:33Z","timestamp":1750221153140,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":21,"publisher":"ACM","license":[{"start":{"date-parts":[[2018,1,23]],"date-time":"2018-01-23T00:00:00Z","timestamp":1516665600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"European Regional Development Fund","award":["NORTE-01-0145-FEDER-000020"],"award-info":[{"award-number":["NORTE-01-0145-FEDER-000020"]}]},{"name":"Funda\u00e7\u00e3o para a Ci\u00eancia e a Tecnologia","award":["PD\/BD\/105804\/2014"],"award-info":[{"award-number":["PD\/BD\/105804\/2014"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2018,1,23]]},"DOI":"10.1145\/3183767.3183777","type":"proceedings-article","created":{"date-parts":[[2018,3,19]],"date-time":"2018-03-19T12:53:23Z","timestamp":1521464003000},"page":"32-38","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Impact of Vectorization Over 16-bit Data-Types on GPUs"],"prefix":"10.1145","author":[{"given":"Lu\u00eds","family":"Reis","sequence":"first","affiliation":[{"name":"University of Porto, Portugal, INESC-TEC, Portugal"}]},{"given":"Ricardo","family":"Nobre","sequence":"additional","affiliation":[{"name":"University of Porto, Portugal, INESC-TEC, Portugal"}]},{"given":"Jo\u00e3o M. P.","family":"Cardoso","sequence":"additional","affiliation":[{"name":"University of Porto, Portugal, INESC-TEC, Portugal"}]}],"member":"320","published-online":{"date-parts":[[2018,1,23]]},"reference":[{"volume-title":"IEEE Standard for Floating-Point Arithmetic","year":"2008","key":"e_1_3_2_1_1_1","unstructured":"2008. IEEE Standard for Floating-Point Arithmetic . IEEE Std 754- 2008 (Aug 2008), 1--70. 2008. IEEE Standard for Floating-Point Arithmetic. IEEE Std 754-2008 (Aug 2008), 1--70."},{"key":"e_1_3_2_1_2_1","unstructured":"2017. ROCm a New Era in Open GPU Computing. (2017). https:\/\/rocm.github.io\/  2017. ROCm a New Era in Open GPU Computing. (2017). https:\/\/rocm.github.io\/"},{"key":"e_1_3_2_1_3_1","unstructured":"Advanced Micro Devices Inc. 2015. AMD OpenCL\u2122 Optimization Guide. http:\/\/developer.amd.com\/wordpress\/media\/2013\/12\/AMD_OpenCL_Programming_Optimization_Guide2.pdf.  Advanced Micro Devices Inc. 2015. AMD OpenCL\u2122 Optimization Guide. http:\/\/developer.amd.com\/wordpress\/media\/2013\/12\/AMD_OpenCL_Programming_Optimization_Guide2.pdf."},{"key":"e_1_3_2_1_4_1","unstructured":"Advanced Micro Devices Inc. 2017. \"Vega\" Instruction Set Architecture - Reference Guide. http:\/\/developer.amd.com\/wordpress\/media\/2017\/08\/Vega_Shader_ISA_28July2017.pdf.  Advanced Micro Devices Inc. 2017. \"Vega\" Instruction Set Architecture - Reference Guide. http:\/\/developer.amd.com\/wordpress\/media\/2017\/08\/Vega_Shader_ISA_28July2017.pdf."},{"key":"e_1_3_2_1_5_1","unstructured":"AMD. 2017. Radeon's next-generation Vega architecture. (2017). https:\/\/radeon.com\/_downloads\/vega-whitepaper-11.6.17.pdf  AMD. 2017. Radeon's next-generation Vega architecture. (2017). https:\/\/radeon.com\/_downloads\/vega-whitepaper-11.6.17.pdf"},{"key":"e_1_3_2_1_6_1","unstructured":"Krishnaraj Bhat. 2017. clpeak: A tool which profiles OpenCL devices to find their peak capacities. (2017). https:\/\/rocm.github.io\/  Krishnaraj Bhat. 2017. clpeak: A tool which profiles OpenCL devices to find their peak capacities. (2017). https:\/\/rocm.github.io\/"},{"key":"e_1_3_2_1_7_1","unstructured":"Intel Corporation. 2015. The Compute Architecture of Intel Processor Graphics Gen9. (2015). https:\/\/software.intel.com\/sites\/default\/files\/managed\/c5\/9a\/The-Compute-Architecture-of-Intel-Processor-Graphics-Gen9-v1d0.pdf  Intel Corporation. 2015. The Compute Architecture of Intel Processor Graphics Gen9. (2015). https:\/\/software.intel.com\/sites\/default\/files\/managed\/c5\/9a\/The-Compute-Architecture-of-Intel-Processor-Graphics-Gen9-v1d0.pdf"},{"key":"e_1_3_2_1_8_1","volume-title":"Accessed: November 8th","author":"NVIDIA Corporation","year":"2017","unstructured":"NVIDIA Corporation . 2017 . Programming Guide:: CUDA Toolkit Documentation. http:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html. (22 Sept. 2017). v9.0.176 , Accessed: November 8th , 2017. NVIDIA Corporation. 2017. Programming Guide:: CUDA Toolkit Documentation. http:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html. (22 Sept. 2017). v9.0.176, Accessed: November 8th, 2017."},{"key":"e_1_3_2_1_9_1","volume-title":"Low precision arithmetic for deep learning. CoRR abs\/1412.7024","author":"Courbariaux Matthieu","year":"2014","unstructured":"Matthieu Courbariaux , Yoshua Bengio , and Jean-Pierre David . 2014. Low precision arithmetic for deep learning. CoRR abs\/1412.7024 ( 2014 ). http:\/\/arxiv.org\/abs\/1412.7024 Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2014. Low precision arithmetic for deep learning. CoRR abs\/1412.7024 (2014). http:\/\/arxiv.org\/abs\/1412.7024"},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/1964179.1964196"},{"key":"e_1_3_2_1_11_1","volume-title":"Proceedings of the 32Nd International Conference on International Conference on Machine Learning -Volume 37 (ICML'15)","author":"Gupta Suyog","year":"2015","unstructured":"Suyog Gupta , Ankur Agrawal , Kailash Gopalakrishnan , and Pritish Narayanan . 2015 . Deep Learning with Limited Numerical Precision . In Proceedings of the 32Nd International Conference on International Conference on Machine Learning -Volume 37 (ICML'15) . JMLR.org, 1737--1746. http:\/\/dl.acm.org\/citation.cfm?id=3045118.3045303 Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep Learning with Limited Numerical Precision. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning -Volume 37 (ICML'15). JMLR.org, 1737--1746. http:\/\/dl.acm.org\/citation.cfm?id=3045118.3045303"},{"key":"e_1_3_2_1_12_1","volume-title":"Accessed: November 11th","author":"Harris Mark","year":"2016","unstructured":"Mark Harris . 2016 . Mixed-Precision Programming with CUDA 8 | Parallel Forall. https:\/\/devblogs.nvidia.com\/parallelforall\/mixed-precision-programming-cuda-8\/. (Dec. 2016) . Accessed: November 11th , 2017. Mark Harris. 2016. Mixed-Precision Programming with CUDA 8 | Parallel Forall. https:\/\/devblogs.nvidia.com\/parallelforall\/mixed-precision-programming-cuda-8\/. (Dec. 2016). Accessed: November 11th, 2017."},{"volume-title":"2017 IEEE High Performance Extreme Computing Conference (HPEC). 1--7.","author":"Ho N. M.","key":"e_1_3_2_1_13_1","unstructured":"N. M. Ho and W. F. Wong . 2017. Exploiting half precision arithmetic in Nvidia GPUs . In 2017 IEEE High Performance Extreme Computing Conference (HPEC). 1--7. N. M. Ho and W. F. Wong. 2017. Exploiting half precision arithmetic in Nvidia GPUs. In 2017 IEEE High Performance Extreme Computing Conference (HPEC). 1--7."},{"key":"e_1_3_2_1_14_1","volume-title":"CUDA Pro Tip: Increase Performance with Vectorized Memory Access | Parallel Forall. https:\/\/devblogs.nvidia.com\/parallelforall\/cuda-pro-tip-increase-performance-with-vectorized-memory-access\/. (10","author":"Luitjens Justin","year":"2017","unstructured":"Justin Luitjens . 2017. CUDA Pro Tip: Increase Performance with Vectorized Memory Access | Parallel Forall. https:\/\/devblogs.nvidia.com\/parallelforall\/cuda-pro-tip-increase-performance-with-vectorized-memory-access\/. (10 May 2017 ). Accessed : November 11th, 2017. Justin Luitjens. 2017. CUDA Pro Tip: Increase Performance with Vectorized Memory Access | Parallel Forall. https:\/\/devblogs.nvidia.com\/parallelforall\/cuda-pro-tip-increase-performance-with-vectorized-memory-access\/. (10 May 2017). Accessed: November 11th, 2017."},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/2628071.2628087"},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.2011.68"},{"key":"e_1_3_2_1_17_1","volume-title":"Mixed Precision Training. CoRR abs\/1710.03740","author":"Micikevicius Paulius","year":"2017","unstructured":"Paulius Micikevicius , Sharan Narang , Jonah Alben , Gregory F. Diamos , Erich Elsen , David Garcia , Boris Ginsburg , Michael Houston , Oleksii Kuchaiev , Ganesh Venkatesh , and Hao Wu. 2017. Mixed Precision Training. CoRR abs\/1710.03740 ( 2017 ). http:\/\/arxiv.org\/abs\/1710.03740 Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory F. Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2017. Mixed Precision Training. CoRR abs\/1710.03740 (2017). http:\/\/arxiv.org\/abs\/1710.03740"},{"key":"e_1_3_2_1_18_1","unstructured":"Christian Rau. 2017. half: IEEE 754-based half-precision floating point library. (2017). http:\/\/half.sourceforge.net\/  Christian Rau. 2017. half: IEEE 754-based half-precision floating point library. (2017). http:\/\/half.sourceforge.net\/"},{"key":"e_1_3_2_1_19_1","unstructured":"Louis-Noel Pouchet Scott Grauer-Gray. 2012. PolyBench\/GPU: Implementation of PolyBench codes for GPU processing. (2012). http:\/\/web.cs.ucla.edu\/~pouchet\/software\/polybench\/GPU\/index.html  Louis-Noel Pouchet Scott Grauer-Gray. 2012. PolyBench\/GPU: Implementation of PolyBench codes for GPU processing. (2012). http:\/\/web.cs.ucla.edu\/~pouchet\/software\/polybench\/GPU\/index.html"},{"key":"e_1_3_2_1_20_1","unstructured":"SiSoftware. 2017. FP16 GPGPU Image Processing Performance & Quality. (2017). http:\/\/www.sisoftware.eu\/2017\/04\/14\/fp16-gpgpu-image-processing-performance-quality\/  SiSoftware. 2017. FP16 GPGPU Image Processing Performance & Quality. (2017). http:\/\/www.sisoftware.eu\/2017\/04\/14\/fp16-gpgpu-image-processing-performance-quality\/"},{"key":"e_1_3_2_1_21_1","volume-title":"Accelerating Deep Convolutional Networks using low-precision and sparsity. CoRR abs\/1610.00324","author":"Venkatesh Ganesh","year":"2016","unstructured":"Ganesh Venkatesh , Eriko Nurvitadhi , and Debbie Marr . 2016. Accelerating Deep Convolutional Networks using low-precision and sparsity. CoRR abs\/1610.00324 ( 2016 ). http:\/\/arxiv.org\/abs\/1610.00324 Ganesh Venkatesh, Eriko Nurvitadhi, and Debbie Marr. 2016. Accelerating Deep Convolutional Networks using low-precision and sparsity. CoRR abs\/1610.00324 (2016). http:\/\/arxiv.org\/abs\/1610.00324"}],"event":{"name":"PARMA-DITAM '18: 9th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and 7th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms","sponsor":["HiPEAC HiPEAC Network of Excellence"],"location":"Manchester United Kingdom","acronym":"PARMA-DITAM '18"},"container-title":["Proceedings of the 9th Workshop and 7th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3183767.3183777","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3183767.3183777","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T01:08:29Z","timestamp":1750208909000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3183767.3183777"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,1,23]]},"references-count":21,"alternative-id":["10.1145\/3183767.3183777","10.1145\/3183767"],"URL":"https:\/\/doi.org\/10.1145\/3183767.3183777","relation":{},"subject":[],"published":{"date-parts":[[2018,1,23]]},"assertion":[{"value":"2018-01-23","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}