{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,21]],"date-time":"2025-09-21T07:14:35Z","timestamp":1758438875413,"version":"3.44.0"},"reference-count":42,"publisher":"Association for Computing Machinery (ACM)","issue":"3","funder":[{"name":"ANR project Maplurinum","award":["ANR-21-CE25-0016"],"award-info":[{"award-number":["ANR-21-CE25-0016"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2025,9,30]]},"abstract":"<jats:p>Although they differentiate between integer and floating-point datum, modern Instruction Set Architectures and their implementations do not differentiate integer datum used to address memory from integer datum used in purely arithmetic and logical computations. This is a perfectly reasonable choice as addresses are, in fact, integral quantities. However, in many cases, there is already a fundamental difference between addresses and integer data: Their width. As computer systems moved from 16 to 32, then to 64-bit pointers, with a potential future where 128-bit might be used for specific systems, the data width required to compute a given output with a given algorithm has remained the same, e.g., an ASCII character is still represented on a byte.<\/jats:p>\n          <jats:p>This work aims to leverage this dichotomy to revisit hardware clustering, a well-known microarchitectural technique used to mitigate the cost of scaling processor backend structures by dividing the backend into several mostly independent execution clusters. We show that by treating instructions as manipulating addresses or data and steering them to a \u201cdata\u201d or an \u201caddress\u201d cluster accordingly, reasonable cluster load balancing can be achieved without the need for complex steering policies that can lead to performance on par with the baseline with limited hardware overhead.<\/jats:p>\n          <jats:p>Moreover, we highlight two possible optimizations stemming from this distribution. First, the registers of the \u201caddress\u201d cluster can easily be compressed thanks to address spatial and temporal locality. Second, if a processor requires a large address space but only processes narrow data (e.g., 32-bit data with 64-bit pointers or 64-bit data with 128-bit pointers), the \u201cdata\u201d cluster datapath can be kept narrower than the \u201caddress\u201d cluster datapath.<\/jats:p>","DOI":"10.1145\/3744908","type":"journal-article","created":{"date-parts":[[2025,8,11]],"date-time":"2025-08-11T11:26:06Z","timestamp":1754911566000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Address\/Data Instruction Steering in Clustered General Purpose Processors"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-1952-2530","authenticated-orcid":false,"given":"Chandana S.","family":"Deshpande","sequence":"first","affiliation":[{"name":"TIMA, CNRS, Grenoble INP, University Grenoble Alpes","place":["Grenoble, France"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5757-2507","authenticated-orcid":false,"given":"Arthur","family":"Perais","sequence":"additional","affiliation":[{"name":"TIMA, CNRS, Grenoble INP, University Grenoble Alpes","place":["Grenoble, France"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0624-7373","authenticated-orcid":false,"given":"Fr\u00e9d\u00e9ric","family":"P\u00e9trot","sequence":"additional","affiliation":[{"name":"TIMA, CNRS, Grenoble INP, University Grenoble Alpes","place":["Grenoble, France"]}]}],"member":"320","published-online":{"date-parts":[[2025,9,19]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2000.898083"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/CACML55074.2022.00056"},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2000.824345"},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.1999.807517"},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/2749469.2750407"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1145\/780822.781165"},{"key":"e_1_3_2_8_2","unstructured":"Intel Corporation. 2019. Sunny Cove - Microarchitectures - Intel. Retrieved January 2025 from https:\/\/en.wikichip.org\/wiki\/intel\/microarchitectures\/sunny_cove"},{"key":"e_1_3_2_9_2","unstructured":"Standard Performance Evaluation Corporation. 2017. SPEC CPU. Retrieved January 2025 from https:\/\/www.spec.org\/cpu2017\/"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2014.6853219"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/LCA.2023.3287762"},{"key":"e_1_3_2_12_2","unstructured":"Hewlett Packard Enterprise. 2012. CACTI. Retrieved January 2025 from https:\/\/github.com\/HewlettPackard\/cacti?tab=readme-ov-file"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/2207222.2207223"},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.1997.645806"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2019.2934955"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1145\/379240.379253"},{"key":"e_1_3_2_17_2","unstructured":"RISC-V Foundation. 2024. The RISC-V Instruction Set Manual Volume 1 Unprivileged Architecture (Version 20240411). (2024).Chapter 5. RV128I Base Integer Instruction Set Version 1.7 Page 43."},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/1054943.1054950"},{"issue":"4","key":"e_1_3_2_19_2","first-page":"1","article-title":"Simpoint 3.0: Faster and more flexible program phase analysis","volume":"7","author":"Hamerly G.","year":"2005","unstructured":"G. Hamerly, E. Perelman, J. Lau, and B. Calder. 2005. Simpoint 3.0: Faster and more flexible program phase analysis. Journal of Instruction Level Parallelism 7, 4 (2005), 1\u201328.","journal-title":"Journal of Instruction Level Parallelism"},{"key":"e_1_3_2_20_2","unstructured":"Intel Corporation. 2018. 5-Level Paging and 5-Level EPT. Retrieved January 2025 from https:\/\/www.intel.com\/content\/www\/us\/en\/content-details\/671442\/5-level-paging-and-5-level-ept-white-paper.html"},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1145\/1542275.1542349"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/40.755465"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2019.00009"},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/3410463.3414629"},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1145\/237090.237173"},{"key":"e_1_3_2_26_2","unstructured":"J. Lowe-Power et\u00a0al. 2021. The gem5 Simulator: Version 20.0+. Retrieved January 2025 from https:\/\/hal.science\/hal-03100818\/"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1145\/2800787"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2003.1183532"},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/264107.264201"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.1995.476838"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.1990.151442"},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2008.4536229"},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.2004.1342543"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.1997.645805"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2005.6"},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/277650.277709"},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/232973.232985"},{"key":"e_1_3_2_38_2","first-page":"1","article-title":"A 256 kbits L-TAGE branch predictor","volume":"9","author":"Seznec A.","year":"2007","unstructured":"A. Seznec. 2007. A 256 kbits L-TAGE branch predictor. Journal of Instruction-Level Parallelism 9 (2007), 1\u201313.","journal-title":"Journal of Instruction-Level Parallelism"},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.5555\/800048.801719"},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/378993.379244"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2020.2974217"},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1145\/223982.224449"},{"key":"e_1_3_2_43_2","volume-title":"Proceedings of the Linux Plumbers Conference","author":"Wilcox M.","year":"2022","unstructured":"M. Wilcox. 2022. Zettalinux: It\u2019s not too late to start. In Proceedings of the Linux Plumbers Conference."}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3744908","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,20]],"date-time":"2025-09-20T00:50:01Z","timestamp":1758329401000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3744908"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,19]]},"references-count":42,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2025,9,30]]}},"alternative-id":["10.1145\/3744908"],"URL":"https:\/\/doi.org\/10.1145\/3744908","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2025,9,19]]},"assertion":[{"value":"2025-01-14","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-06-03","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-09-19","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}