{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,29]],"date-time":"2025-08-29T10:47:32Z","timestamp":1756464452352,"version":"3.41.0"},"reference-count":28,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2015,12,8]],"date-time":"2015-12-08T00:00:00Z","timestamp":1449532800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Irish Software Engineering Research Centre"},{"DOI":"10.13039\/501100001602","name":"Science Foundation Ireland","doi-asserted-by":"crossref","award":["12\/IA\/1381 and 10\/CE\/I1855"],"award-info":[{"award-number":["12\/IA\/1381 and 10\/CE\/I1855"]}],"id":[{"id":"10.13039\/501100001602","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2016,1,7]]},"abstract":"<jats:p>Automatically exploiting short vector instructions sets (SSE, AVX, NEON) is a critically important task for optimizing compilers. Vector instructions typically work best on data that is contiguous in memory, and operating on non-contiguous data requires additional work to gather and scatter the data. There are several varieties of non-contiguous access, including interleaved data access. An existing approach used by GCC generates extremely efficient code for loops with power-of-2 interleaving factors (strides). In this paper we propose a generalization of this approach that produces similar code for any compile-time constant interleaving factor. In addition, we propose several novel program transformations, which were made possible by our generalized representation of the problem. Experiments show that our approach achieves significant speedups for both power-of-2 and non--power-of-2 interleaving factors. Our vectorization approach results in mean speedups over scalar code of 1.77x on Intel SSE and 2.53x on Intel AVX2 in real-world benchmarking on a selection of BLAS Level 1 routines. On the same benchmark programs, GCC 5.0 achieves mean improvements of 1.43x on Intel SSE and 1.30x on Intel AVX2. In synthetic benchmarking on Intel SSE, our maximum improvement on data movement is over 4x for gathering operations and over 6x for scattering operations versus scalar code.<\/jats:p>","DOI":"10.1145\/2838735","type":"journal-article","created":{"date-parts":[[2015,12,10]],"date-time":"2015-12-10T14:22:10Z","timestamp":1449757330000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":17,"title":["Automatic Vectorization of Interleaved Data Revisited"],"prefix":"10.1145","volume":"12","author":[{"given":"Andrew","family":"Anderson","sequence":"first","affiliation":[{"name":"Lero, Trinity College Dublin, Dublin, Ireland"}]},{"given":"Avinash","family":"Malik","sequence":"additional","affiliation":[{"name":"University of Auckland, Auckland, New Zealand"}]},{"given":"David","family":"Gregg","sequence":"additional","affiliation":[{"name":"Lero, Trinity College Dublin, Dublin, Ireland"}]}],"member":"320","published-online":{"date-parts":[[2015,12,8]]},"reference":[{"key":"e_1_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/69558.75700"},{"key":"e_1_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1016\/0096-0551(90)90006-B"},{"key":"e_1_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2010.38"},{"key":"e_1_2_2_4_1","doi-asserted-by":"crossref","unstructured":"Albert Cohen Sylvain Girbal and Olivier Temam. 2004. A polyhedral approach to ease the composition of program transformations. In Euro-Par. 292--303.  Albert Cohen Sylvain Girbal and Olivier Temam. 2004. A polyhedral approach to ease the composition of program transformations. In Euro-Par. 292--303.","DOI":"10.1007\/978-3-540-27866-5_38"},{"key":"e_1_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/996841.996853"},{"volume-title":"Compiler Construction","author":"Fireman Liza","key":"e_1_2_2_6_1","unstructured":"Liza Fireman , Erez Petrank , and Ayal Zaks . 2007. New algorithms for SIMD alignment . In Compiler Construction . Springer , 1--15. Liza Fireman, Erez Petrank, and Ayal Zaks. 2007. New algorithms for SIMD alignment. In Compiler Construction. Springer, 1--15."},{"key":"e_1_2_2_7_1","volume-title":"Proceedings of the 1993 IEEE Vehicle Navigation and Information Systems Conference. 7.","author":"Franchetti Franz","year":"1993","unstructured":"Franz Franchetti and Markus Puschel . 1993 . A SIMD vectorizing compiler for digital signal processing algorithms . In Proceedings of the 1993 IEEE Vehicle Navigation and Information Systems Conference. 7. Franz Franchetti and Markus Puschel. 1993. A SIMD vectorizing compiler for digital signal processing algorithms. In Proceedings of the 1993 IEEE Vehicle Navigation and Information Systems Conference. 7."},{"key":"e_1_2_2_8_1","doi-asserted-by":"crossref","unstructured":"Franz Franchetti and Markus P\u00fcschel. 2008. Generating SIMD vectorized permutations. In CC. 116--131.   Franz Franchetti and Markus P\u00fcschel. 2008. Generating SIMD vectorized permutations. In CC. 116--131.","DOI":"10.1007\/978-3-540-78791-4_8"},{"key":"e_1_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/131080.131089"},{"key":"e_1_2_2_10_1","doi-asserted-by":"crossref","unstructured":"Ralf Karrenberg and Sebastian Hack. 2011. Whole-function vectorization. In CGO. 141--150.   Ralf Karrenberg and Sebastian Hack. 2011. Whole-function vectorization. In CGO. 141--150.","DOI":"10.1109\/CGO.2011.5764682"},{"key":"e_1_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/1070891.1065931"},{"key":"e_1_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/349299.349320"},{"key":"e_1_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/355841.355847"},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/40.372347"},{"key":"e_1_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/2254064.2254106"},{"key":"e_1_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.2011.68"},{"key":"e_1_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/1133981.1133997"},{"volume-title":"How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures","author":"Paoloni Gabriele","key":"e_1_2_2_18_1","unstructured":"Gabriele Paoloni . 2010. How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures . Intel Corporation . Retrieved from http:\/\/www.intel.com\/content\/dam\/www\/public\/us\/en\/documents\/white-papers\/ia-32-ia-64-benchmark-code-execution-paper.pdf. Gabriele Paoloni. 2010. How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures. Intel Corporation. Retrieved from http:\/\/www.intel.com\/content\/dam\/www\/public\/us\/en\/documents\/white-papers\/ia-32-ia-64-benchmark-code-execution-paper.pdf."},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/2189750.2151014"},{"key":"e_1_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/1133255.1133996"},{"key":"e_1_2_2_21_1","volume-title":"Discrete Mathematics and Its Applications","author":"Rosen Kenneth","unstructured":"Kenneth Rosen . 2011. Discrete Mathematics and Its Applications ( 7 th ed.). McGraw-Hill . Kenneth Rosen. 2011. Discrete Mathematics and Its Applications (7th ed.). McGraw-Hill.","edition":"7"},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/2687355"},{"key":"e_1_2_2_23_1","volume-title":"Hall","author":"Shin Jaewook","year":"2002","unstructured":"Jaewook Shin , Jacqueline Chame , and Mary W . Hall . 2002 . Compiler-controlled caching in superword register files for multimedia extension architectures. In IEEE PACT. 45--55. Jaewook Shin, Jacqueline Chame, and Mary W. Hall. 2002. Compiler-controlled caching in superword register files for multimedia extension architectures. In IEEE PACT. 45--55."},{"key":"e_1_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/CGO.2005.33"},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2003.1223637"},{"volume-title":"High-Performance Computing on the Intel\u00ae Xeon Phi","author":"Wang Endong","key":"e_1_2_2_26_1","unstructured":"Endong Wang , Qing Zhang , Bo Shen , Guangyong Zhang , Xiaowei Lu , Qing Wu , and Yajuan Wang . 2014. Intel math kernel library . In High-Performance Computing on the Intel\u00ae Xeon Phi . Springer , 167--188. Endong Wang, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, and Yajuan Wang. 2014. Intel math kernel library. In High-Performance Computing on the Intel\u00ae Xeon Phi. Springer, 167--188."},{"key":"e_1_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0167-8191(00)00087-9"},{"volume-title":"High Performance Compilers for Parallel Computing","author":"Wolfe Michael Joseph","key":"e_1_2_2_28_1","unstructured":"Michael Joseph Wolfe . 1996. High Performance Compilers for Parallel Computing . Addison-Wesley . Michael Joseph Wolfe. 1996. High Performance Compilers for Parallel Computing. Addison-Wesley."}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2838735","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2838735","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T05:43:31Z","timestamp":1750225411000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2838735"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2015,12,8]]},"references-count":28,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2016,1,7]]}},"alternative-id":["10.1145\/2838735"],"URL":"https:\/\/doi.org\/10.1145\/2838735","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2015,12,8]]},"assertion":[{"value":"2015-04-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2015-10-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2015-12-08","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}