{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,3]],"date-time":"2026-06-03T06:27:30Z","timestamp":1780468050220,"version":"3.54.1"},"reference-count":39,"publisher":"Association for Computing Machinery (ACM)","issue":"3","funder":[{"name":"Key Program of the National Natural Science Foundation of China","award":["62032001"],"award-info":[{"award-number":["62032001"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62472436"],"award-info":[{"award-number":["62472436"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Foundation of Key Laboratory","award":["WDZC20245250110"],"award-info":[{"award-number":["WDZC20245250110"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2025,9,30]]},"abstract":"<jats:p>While Single Instruction Multiple Data (SIMD) units are widely employed in processors for neural networks, signal processing, and high-performance computing, they suffer from expensive shuffle operations dedicated to data alignment. In fact, shuffle operations only change the layout of data and ideally should be done entirely within memory.<\/jats:p>\n          <jats:p>To this end, we propose Shuffle SRAM in this article, which can shuffle multiple data elements simultaneously across SRAM banks. The key idea is exploiting inter-bank word line wise data movement to shuffle data in parallel, where all data elements on the same word line of SRAM can be shuffled simultaneously, achieving a high level of parallelism. Through suitable data layout preparation and proper control, Shuffle SRAM efficiently supports a wide range of commonly used shuffle operations. Our evaluation results show that the Shuffle SRAM can reap performance benefits of 14.3\u00d7 for data reorganization only applications and 1.97\u00d7 for data reorganization + computation applications over conventional shuffle architecture on general-purpose processors. With Shuffle SRAM, the state-of-the-art vector processor can obtain 2.58\u00d7 energy efficiency. Compared with traditional SRAM, Shuffle SRAM only increases 3.5% additional area overhead.<\/jats:p>","DOI":"10.1145\/3743136","type":"journal-article","created":{"date-parts":[[2025,6,5]],"date-time":"2025-06-05T07:12:13Z","timestamp":1749107533000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["In-SRAM Parallel Data Shuffle"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0009-0003-8054-9131","authenticated-orcid":false,"given":"Chaoyang","family":"Jia","sequence":"first","affiliation":[{"name":"National University of Defense Technology","place":["Changsha, China"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2384-5359","authenticated-orcid":false,"given":"Zhang","family":"Dunbo","sequence":"additional","affiliation":[{"name":"National University of Defense Technology","place":["Changsha, China"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-9456-8695","authenticated-orcid":false,"given":"Qingjie","family":"Lang","sequence":"additional","affiliation":[{"name":"National University of Defense Technology","place":["Changsha, China"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-5154-0086","authenticated-orcid":false,"given":"Ruoxi","family":"Wang","sequence":"additional","affiliation":[{"name":"National University of Defense Technology","place":["Changsha, China"]}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9043-2998","authenticated-orcid":false,"given":"Li","family":"Shen","sequence":"additional","affiliation":[{"name":"Department of Computing Science, National University of Defense Technology","place":["Changsha, China"]}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2025,9,17]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"Menachem Adelman Robert Valentine Barukh Ziv Amit Gradstein Simon Rubanovich Zeev Sperber Mark J. Charney Christopher J. Hughes Alexander F. Heinecke Evangelos Georganas Binh Pham and Intel Corporation. 2021. Matrix Transpose and Multiply. 20210405974."},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2017.21"},{"key":"e_1_3_1_4_2","unstructured":"ANDES Technology. 2020. AndesCore NX27V Processor. Retrieved November 03 2021 from http:\/\/https:\/\/www.andestech.com\/en\/products-solutions\/andescore-processors\/riscv-nx27v\/\/"},{"key":"e_1_3_1_5_2","unstructured":"Kumar Chellapilla S. Puri and P. Simard. 2006. High performance convolutional neural networks for document processing. Tenth International Workshop on Frontiers in Handwriting Recognition (2006)."},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2013.129"},{"key":"e_1_3_1_7_2","volume-title":"2018 IEEE International Solid - State Circuits Conference - (ISSCC\u201918)","author":"Chen W. H.","year":"2018","unstructured":"W. H. Chen, K. X. Li, W. Y. Lin, K. H. Hsu, and M. F. Chang. 2018. A 65nm 1Mb nonvolatile computing-in-memory ReRAM macro with sub-16ns multiply-and-accumulate for binary DNN AI edge processors. In 2018 IEEE International Solid - State Circuits Conference - (ISSCC\u201918)."},{"key":"e_1_3_1_8_2","unstructured":"National center for biotechnology information. 2025. PubChem Patent Summary for US-7631025-B2 Method and apparatus for rearranging data between multiple registers. Retrieved June 18 2025 from https:\/\/pubchem.ncbi.nlm.nih.gov\/patent\/US-7631025-B2"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA52012.2021.00025"},{"key":"e_1_3_1_10_2","volume-title":"2018 ACM\/IEEE 45th Annual International Symposium on Computer Architecture (ISCA\u201918)","author":"Eckert C.","year":"2018","unstructured":"C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer, D. Sylvester, David Blaauw, and R. Das. 2018. Neural cache: Bit-serial in-cache acceleration of deep neural networks. In 2018 ACM\/IEEE 45th Annual International Symposium on Computer Architecture (ISCA\u201918)."},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCC\/SmartCity\/DSS.2018.00059"},{"key":"e_1_3_1_12_2","volume-title":"2018 ACM\/IEEE 45th Annual International Symposium on Computer Architecture (ISCA\u201918)","author":"Fowers J.","year":"2018","unstructured":"J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill, and D. Burger. 2018. A configurable cloud-scale DNN processor for real-time AI. In 2018 ACM\/IEEE 45th Annual International Symposium on Computer Architecture (ISCA\u201918)."},{"key":"e_1_3_1_13_2","volume-title":"Computer Architecture - A Quantitative Approach","author":"Hennessy John","year":"2007","unstructured":"John Hennessy and David Patterson. 2007. Computer Architecture - A Quantitative Approach."},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2010.5416631"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/40.918001"},{"issue":"2","key":"e_1_3_1_16_2","doi-asserted-by":"crossref","first-page":"p.52\u201363","DOI":"10.1145\/1028176.1006736","article-title":"The vector-thread architecture","volume":"32","author":"Krashinsky R.","year":"2004","unstructured":"R. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris, J. Casper, and K. Asanovic. 2004. The vector-thread architecture. Computer Architecture News 32, 2 (2004), p.52\u201363.","journal-title":"Computer Architecture News"},{"key":"e_1_3_1_17_2","volume-title":"International Symposium on Computer Architecture","author":"Lee Y.","year":"2011","unstructured":"Y. Lee, R. Avizienis, A. Bishara, R. Xia, D. Lockhart, C. Batten, and K. Asanovi?2011. Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. In International Symposium on Computer Architecture."},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/HOTCHIPS.2019.8875654"},{"key":"e_1_3_1_19_2","volume-title":"International Symposium on Computer Architecture","author":"Lin Y.","year":"2006","unstructured":"Y. Lin, H. Lee, M. Woh, Y. Harel, S. Mahlke, T. Mudge, C. Chakrabarti, and K. Flautner. 2006. SODA: A low-power architecture for software radio. In International Symposium on Computer Architecture."},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISVLSI.2016.71"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/NAS.2011.29"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","unstructured":"Nathan Binkert Bradford Beckmann Gabriel Black Steven K. Reinhardt Ali Saidi Arkaprava Basu Joel Hestness Derek R. Hower Tushar Krishna Somayeh Sardashti Rathijit Sen Korey Sewell Muhammad Shoaib Nilay Vaish Mark D. Hill and David A. Wood. 2011. The gem5 simulator. SIGARCH Comput. Archit. News 39 2 (May 2011) 1\u20137. 10.1145\/2024716.2024718","DOI":"10.1145\/2024716.2024718"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.23919\/DATE.2019.8714915"},{"key":"e_1_3_1_24_2","volume-title":"2019 IEEE International Symposium on Circuits and Systems (ISCAS\u201919)","author":"Malkowsky S.","year":"2019","unstructured":"S. Malkowsky, H. Prabhu, L. Liu, O. Edfors, and V. Owall. 2019. A programmable 16-Lane SIMD ASIP for massive MIMO. In 2019 IEEE International Symposium on Circuits and Systems (ISCAS\u201919)."},{"key":"e_1_3_1_25_2","volume-title":"ARCS","author":"Raghavan Praveen","year":"2007","unstructured":"Praveen Raghavan, Satyakiran Munaga, Estela Rey Ramos, Andy Lambrechts, Murali Jayapala, Francky Catthoor, and Diederik Verkest. 2007. A customized cross-bar for data-shuffling in domain-specific SIMD processors. In ARCS."},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2017.35"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2017.35"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","unstructured":"R. Sun P. Liu J. Xue S. Yang J. Qian and R. Ying. 2020. BAX: A bundle adjustment accelerator with decoupled access\/execute architecture for visual odometry. In IEEE Access 8 (2020) 75530\u201375542. DOI:10.1109\/ACCESS.2020.2988527","DOI":"10.1109\/ACCESS.2020.2988527"},{"key":"e_1_3_1_29_2","volume-title":"Design, Automation, and Test in Europe","author":"Tagliavini Giuseppe","year":"2019","unstructured":"Giuseppe Tagliavini, Stefan Mach, Davide Rossi, Andrea Marongiu, and Luca Benini. 2019. Design and evaluation of SmallFloat SIMD extensions to the RISC-V ISA. In Design, Automation, and Test in Europe."},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/JIOT.2021.3058015"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/JSSC.2019.2939682"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","unstructured":"Y. Wang C. Li C. Liu et\u00a0al. 2021. Advancing DSP into HPC AI and beyond: Challenges mechanisms and future directions. CCF Trans. HPC 3 114\u2013125 (2021). 10.1007\/s42514-020-00057-2","DOI":"10.1007\/s42514-020-00057-2"},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1145\/2847253"},{"key":"e_1_3_1_34_2","volume-title":"36th International Symposium on Computer Architecture (ISCA 2009), June 20\u201324, 2009, Austin, TX, USA","author":"Woh M.","year":"2009","unstructured":"M. Woh, S. Seo, S. A. Mahlke, T. N. Mudge, and K. Flautner. 2009. AnySP: Anytime anywhere anyway signal processing. In 36th International Symposium on Computer Architecture (ISCA 2009), June 20\u201324, 2009, Austin, TX, USA."},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1145\/1273440.1250689"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNET.2020.3017500"},{"key":"e_1_3_1_37_2","volume-title":"2016 IEEE Hot Chips 28 Symposium (HCS\u201916)","author":"Yoshida T.","year":"2016","unstructured":"T. Yoshida. 2016. Introduction of Fujitsu\u2019s HPC processor for the Post-K computer. In 2016 IEEE Hot Chips 28 Symposium (HCS\u201916)."},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","unstructured":"D. Zhang C. Jia and L. Shen. 2022. Compressed page walk cache. Front. Comput. Sci. 16 163104 (2022). 10.1007\/s11704-020-9485-2","DOI":"10.1007\/s11704-020-9485-2"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/3631528"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/ASICON.2011.6157297"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3743136","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,17]],"date-time":"2025-09-17T13:43:50Z","timestamp":1758116630000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3743136"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,17]]},"references-count":39,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2025,9,30]]}},"alternative-id":["10.1145\/3743136"],"URL":"https:\/\/doi.org\/10.1145\/3743136","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,9,17]]},"assertion":[{"value":"2024-12-02","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-05-19","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-09-17","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}