{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,31]],"date-time":"2026-01-31T10:45:03Z","timestamp":1769856303879,"version":"3.49.0"},"reference-count":43,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2016,3,28]],"date-time":"2016-03-28T00:00:00Z","timestamp":1459123200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100000923","name":"Australian Research Council","doi-asserted-by":"crossref","award":["DP110104628"],"award-info":[{"award-number":["DP110104628"]}],"id":[{"id":"10.13039\/501100000923","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2016,4,5]]},"abstract":"<jats:p>\n            Existing vectorization techniques are ineffective for loops that exhibit little loop-level parallelism but some limited superword-level parallelism (SLP). We show that effectively vectorizing such loops requires partial vector operations to be executed correctly and efficiently, where the degree of partial SIMD parallelism is smaller than the SIMD datapath width. We present a simple yet effective SLP compiler technique called P\n            <jats:sc>aver<\/jats:sc>\n            (PArtial VEctorizeR), formulated and implemented in LLVM as a generalization of the traditional SLP algorithm, to optimize such partially vectorizable loops. The key idea is to maximize SIMD utilization by widening vector instructions used while minimizing the overheads caused by memory access, packing\/unpacking, and\/or masking operations, without introducing new memory errors or new numeric exceptions. For a set of 9 C\/C++\/Fortran applications with partial SIMD parallelism, P\n            <jats:sc>aver<\/jats:sc>\n            achieves significantly better kernel and whole-program speedups than LLVM on both Intel\u2019s AVX and ARM\u2019s NEON.\n          <\/jats:p>","DOI":"10.1145\/2886101","type":"journal-article","created":{"date-parts":[[2016,3,28]],"date-time":"2016-03-28T12:53:25Z","timestamp":1459169605000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":38,"title":["A Compiler Approach for Exploiting Partial SIMD Parallelism"],"prefix":"10.1145","volume":"13","author":[{"given":"Hao","family":"Zhou","sequence":"first","affiliation":[{"name":"UNSW Australia\/NUDT, China"}]},{"given":"Jingling","family":"Xue","sequence":"additional","affiliation":[{"name":"UNSW Australia, NSW, Australia"}]}],"member":"320","published-online":{"date-parts":[[2016,3,28]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"N-Body Simulation. Retrieved","author":"Aarseth Sverre","year":"2016","unstructured":"Sverre Aarseth . 2015. N-Body Simulation. Retrieved February 9, 2016 , from http:\/\/www.ast.cam.ac.uk\/research\/nbody. Sverre Aarseth. 2015. N-Body Simulation. Retrieved February 9, 2016, from http:\/\/www.ast.cam.ac.uk\/research\/nbody."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2010.38"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.5555\/586554.586555"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/305138.305231"},{"key":"e_1_2_1_5_1","volume-title":"Implemented cost model for masked load\/store operations. Retrieved","author":"Demikhovsky Elena","year":"2016","unstructured":"Elena Demikhovsky . 2015. Implemented cost model for masked load\/store operations. Retrieved February 9, 2016 , from http:\/\/lists.llvm.org\/pipermail\/llvm-commits\/Week-of-Mon-20150119\/254753.html Elena Demikhovsky. 2015. Implemented cost model for masked load\/store operations. Retrieved February 9, 2016, from http:\/\/lists.llvm.org\/pipermail\/llvm-commits\/Week-of-Mon-20150119\/254753.html"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/996841.996853"},{"key":"e_1_2_1_7_1","volume-title":"Instruction Tables: Lists of Instruction Latencies, Throughputs and Micro-Operation Breakdowns for Intel, AMD and VIA CPUs. Retrieved","author":"Fog Agner","year":"2014","unstructured":"Agner Fog . 2014 . Instruction Tables: Lists of Instruction Latencies, Throughputs and Micro-Operation Breakdowns for Intel, AMD and VIA CPUs. Retrieved February 9, 2016, from http:\/\/www.agner.org\/optimize\/instruction_tables.pdf. Agner Fog. 2014. Instruction Tables: Lists of Instruction Latencies, Throughputs and Micro-Operation Breakdowns for Intel, AMD and VIA CPUs. Retrieved February 9, 2016, from http:\/\/www.agner.org\/optimize\/instruction_tables.pdf."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.5555\/2523721.2523767"},{"key":"e_1_2_1_9_1","volume-title":"Utilizing Full Vectors and Use of Option -Qopt-Assume-Safe-Padding. Retrieved","author":"Green Ronald W.","year":"2016","unstructured":"Ronald W. Green . 2012. Utilizing Full Vectors and Use of Option -Qopt-Assume-Safe-Padding. Retrieved February 9, 2016 , from https:\/\/software.intel.com\/en-us\/articles\/utilizing-full-vectors. Ronald W. Green. 2012. Utilizing Full Vectors and Use of Option -Qopt-Assume-Safe-Padding. Retrieved February 9, 2016, from https:\/\/software.intel.com\/en-us\/articles\/utilizing-full-vectors."},{"key":"e_1_2_1_10_1","volume-title":"Proceedings of the 2003 International Conference on Parallel Processing. 615--624","author":"Huang Q.","unstructured":"Q. Huang , J. Xue , and X. Vera . 2003. Code tiling for improving the cache performance of PDE solvers . In Proceedings of the 2003 International Conference on Parallel Processing. 615--624 . Q. Huang, J. Xue, and X. Vera. 2003. Code tiling for improving the cache performance of PDE solvers. In Proceedings of the 2003 International Conference on Parallel Processing. 615--624."},{"key":"e_1_2_1_11_1","first-page":"248966","article-title":"Intel\u00ae 64 and IA-32 Architectures Optimization Reference Manual","year":"2014","unstructured":"Intel. 2014 . Intel\u00ae 64 and IA-32 Architectures Optimization Reference Manual . Number 248966 - 248030 . Retrieved February 9, 2016, from http:\/\/www.intel.com\/content\/www\/us\/en\/architecture-and-technology\/64-ia-32-architectures-optimization-manual.html. Intel. 2014. Intel\u00ae 64 and IA-32 Architectures Optimization Reference Manual. Number 248966-030. Retrieved February 9, 2016, from http:\/\/www.intel.com\/content\/www\/us\/en\/architecture-and-technology\/64-ia-32-architectures-optimization-manual.html.","journal-title":"Number"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.5555\/2523721.2523770"},{"key":"e_1_2_1_13_1","volume-title":"Automatic SIMD Vectorization of SSA-Based Control Flow Graphs","author":"Karrenberg Ralf","unstructured":"Ralf Karrenberg . 2015. Automatic SIMD Vectorization of SSA-Based Control Flow Graphs . Springer Vieweg . Ralf Karrenberg. 2015. Automatic SIMD Vectorization of SSA-Based Control Flow Graphs. Springer Vieweg."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.5555\/2190025.2190061"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/2145816.2145824"},{"key":"e_1_2_1_16_1","volume-title":"Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI\u201913)","author":"Kong M.","unstructured":"M. Kong , R. Veras , K. Stock , F. Franchetti , L.-N. Pouchet , and P. Sadayappan . 2013. When polyhedral transformations meet SIMD code generation . In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI\u201913) . ACM, New York, NY, 127--138. M. Kong, R. Veras, K. Stock, F. Franchetti, L.-N. Pouchet, and P. Sadayappan. 2013. When polyhedral transformations meet SIMD code generation. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI\u201913). ACM, New York, NY, 127--138."},{"key":"e_1_2_1_17_1","volume-title":"Ueberhuber","author":"Kral Stefan","year":"2003","unstructured":"Stefan Kral , Franz Franchetti , Juergen Lorenz , and Christoph W . Ueberhuber . 2003 . SIMD vectorization of straight line FFT code. In Euro-Par 2003 Parallel Processing. Lecture Notes in Computer Science, Vol. 2790 . Springer , 251--260. Stefan Kral, Franz Franchetti, Juergen Lorenz, and Christoph W. Ueberhuber. 2003. SIMD vectorization of straight line FFT code. In Euro-Par 2003 Parallel Processing. Lecture Notes in Computer Science, Vol. 2790. Springer, 251--260."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/349299.349320"},{"key":"e_1_2_1_19_1","volume-title":"Proceedings of the 11th International Conference on Parallel Architectures and Compilation Techniques (PACT\u201902)","author":"Larsen Samuel","unstructured":"Samuel Larsen , Emmett Witchel , and Saman P. Amarasinghe . 2002. Increasing and detecting memory address congruence . In Proceedings of the 11th International Conference on Parallel Architectures and Compilation Techniques (PACT\u201902) . IEEE, Los Alamitos, CA, 18--29. Samuel Larsen, Emmett Witchel, and Saman P. Amarasinghe. 2002. Increasing and detecting memory address congruence. In Proceedings of the 11th International Conference on Parallel Architectures and Compilation Techniques (PACT\u201902). IEEE, Los Alamitos, CA, 18--29."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/2254064.2254106"},{"key":"e_1_2_1_21_1","volume-title":"Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT\u201911)","author":"Maleki Saeed","unstructured":"Saeed Maleki , Yaoqing Gao , Maria J. Garzar\u00e1n , Tommy Wong , and David A. Padua . 2011. An evaluation of vectorizing compilers . In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT\u201911) . IEEE, Los Alamitos, CA, 372--382. Saeed Maleki, Yaoqing Gao, Maria J. Garzar\u00e1n, Tommy Wong, and David A. Padua. 2011. An evaluation of vectorizing compilers. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT\u201911). IEEE, Los Alamitos, CA, 372--382."},{"key":"e_1_2_1_22_1","unstructured":"Mantevo. 2015. The Mantevo Benchmark Suite. Available at http:\/\/mantevo.org.  Mantevo. 2015. The Mantevo Benchmark Suite. Available at http:\/\/mantevo.org."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/1995896.1995938"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.5555\/2190025.2190062"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/1133981.1133997"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/2150976.2151014"},{"key":"e_1_2_1_27_1","volume-title":"Proceedings of the 13th Annual IEEE\/ACM International Symposium on Code Generation and Optimization (CGO\u201915)","author":"Porpodas Vasileios","unstructured":"Vasileios Porpodas , Alberto Magni , and Timothy M. Jones . 2015. PSLP: Padded SLP automatic vectorization . In Proceedings of the 13th Annual IEEE\/ACM International Symposium on Code Generation and Optimization (CGO\u201915) . IEEE, Los Alamitos, CA, 190--201. Vasileios Porpodas, Alberto Magni, and Timothy M. Jones. 2015. PSLP: Padded SLP automatic vectorization. In Proceedings of the 13th Annual IEEE\/ACM International Symposium on Code Generation and Optimization (CGO\u201915). IEEE, Los Alamitos, CA, 190--201."},{"key":"e_1_2_1_28_1","volume-title":"Proceedings of the 2013 IEEE\/ACM International Symposium on Code Generation and Optimization (CGO\u201913)","author":"Ren Bin","unstructured":"Bin Ren , Tomi Poutanen , Todd Mytkowicz , Wolfram Schulte , Gagan Agrawal , and James R. Larus . 2013. SIMD parallelization of applications that traverse irregular data structures . In Proceedings of the 2013 IEEE\/ACM International Symposium on Code Generation and Optimization (CGO\u201913) . IEEE, Los Alamitos, CA, 1--10. Bin Ren, Tomi Poutanen, Todd Mytkowicz, Wolfram Schulte, Gagan Agrawal, and James R. Larus. 2013. SIMD parallelization of applications that traverse irregular data structures. In Proceedings of the 2013 IEEE\/ACM International Symposium on Code Generation and Optimization (CGO\u201913). IEEE, Los Alamitos, CA, 1--10."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/1133981.1133996"},{"key":"e_1_2_1_30_1","volume-title":"Proceedings of GCC Developers\u2019 Summit (GCC Developers\u2019 Summit\u201907)","author":"Rosen Ira","year":"2007","unstructured":"Ira Rosen , Dorit Nuzman , and Ayal Zaks . 2007 . Loop-aware SLP in GCC . In Proceedings of GCC Developers\u2019 Summit (GCC Developers\u2019 Summit\u201907) . 131--142. Ira Rosen, Dorit Nuzman, and Ayal Zaks. 2007. Loop-aware SLP in GCC. In Proceedings of GCC Developers\u2019 Summit (GCC Developers\u2019 Summit\u201907). 131--142."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.5555\/1299042.1299055"},{"key":"e_1_2_1_32_1","volume-title":"Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques (PACT\u201902)","author":"Shin Jaewook","unstructured":"Jaewook Shin , Jacqueline Chame , and Mary W. Hall . 2002. Compiler-controlled caching in superword register files for multimedia extension architectures . In Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques (PACT\u201902) . IEEE, Los Alamitos, CA, 45--55. Jaewook Shin, Jacqueline Chame, and Mary W. Hall. 2002. Compiler-controlled caching in superword register files for multimedia extension architectures. In Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques (PACT\u201902). IEEE, Los Alamitos, CA, 45--55."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/CGO.2005.33"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1007559022013"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.5555\/2523721.2523769"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.2009.18"},{"key":"e_1_2_1_37_1","volume-title":"C-Ray Raytracing Benchmark Results. Retrieved","author":"Tsiombikas John","year":"2016","unstructured":"John Tsiombikas . 2015. C-Ray Raytracing Benchmark Results. Retrieved February 9, 2016 , from http:\/\/www.futuretech.blinkenlights.nl\/c-ray.html. John Tsiombikas. 2015. C-Ray Raytracing Benchmark Results. Retrieved February 9, 2016, from http:\/\/www.futuretech.blinkenlights.nl\/c-ray.html."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/76263.76337"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/1088149.1088172"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.5555\/353939"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10936-7_20"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/2854038.2854054"},{"key":"e_1_2_1_43_1","volume-title":"Supercompilers for Parallel and Vector Computers","author":"Zima Hans","unstructured":"Hans Zima and Barbara Chapman . 1991. Supercompilers for Parallel and Vector Computers . ACM , New York, NY . Hans Zima and Barbara Chapman. 1991. Supercompilers for Parallel and Vector Computers. ACM, New York, NY."}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2886101","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2886101","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:38:50Z","timestamp":1750221530000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2886101"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2016,3,28]]},"references-count":43,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2016,4,5]]}},"alternative-id":["10.1145\/2886101"],"URL":"https:\/\/doi.org\/10.1145\/2886101","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2016,3,28]]},"assertion":[{"value":"2015-08-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2016-01-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2016-03-28","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}