{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,30]],"date-time":"2025-09-30T10:09:51Z","timestamp":1759226991019,"version":"3.41.0"},"reference-count":70,"publisher":"Association for Computing Machinery (ACM)","issue":"2","funder":[{"DOI":"10.13039\/100000185","name":"DARPA","doi-asserted-by":"crossref","id":[{"id":"10.13039\/100000185","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100000028","name":"Semiconductor Research Corporation","doi-asserted-by":"crossref","id":[{"id":"10.13039\/100000028","id-type":"DOI","asserted-by":"crossref"}]},{"name":"CRISP Center"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2025,6,30]]},"abstract":"<jats:p>\n            As modern applications demand more data, processing-in-memory (PIM) architectures have emerged to address the challenges of data movement and parallelism. In this article, we propose VersaTile, a heterogeneous, fully CMOS-based tiled architecture that combines conventional out-of-order (OoO) superscalar CPUs and associative processors (APs), a type of CAM-based PIM core. Both CPUs and APs leverage the RISC-V ISA and its standard RVV vector extension. VersaTile fosters collaboration between multiple low-latency CPUs and high-throughput APs by sharing the same software stack and adopting a common CPU programming and compilation frontend. Moreover, we introduce tile stitching, a mechanism enabling the aggregation of multiple APs into a single vector super-unit with modest hardware support and no programming effort. Tile stitching allows us to configure an architecture for optimal performance across a wide range of applications. We provide a detailed case study, including a scalable floorplan example, as well as a comprehensive evaluation of various design points. Our experiments show that, when using only AP tiles, VersaTile can achieve, on average across the Phoenix benchmark suite and 3D convolution, a\n            <jats:inline-formula content-type=\"math\/tex\">\n              <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(5.7\\times\\)<\/jats:tex-math>\n            <\/jats:inline-formula>\n            speedup with respect to area-equivalent OoO CPU cores with SIMD ALUs (up to\n            <jats:inline-formula content-type=\"math\/tex\">\n              <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(23\\times\\)<\/jats:tex-math>\n            <\/jats:inline-formula>\n            ), and\n            <jats:inline-formula content-type=\"math\/tex\">\n              <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(4.6\\times\\)<\/jats:tex-math>\n            <\/jats:inline-formula>\n            with respect to an equivalent-sized monolithic AP baseline (up to\n            <jats:inline-formula content-type=\"math\/tex\">\n              <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(29\\times\\)<\/jats:tex-math>\n            <\/jats:inline-formula>\n            ). For applications with both DLP (vector) and ILP (scalar) regions, VersaTile can use APs and OoO cores collaboratively to achieve better performance than using either one of them only, up to\n            <jats:inline-formula content-type=\"math\/tex\">\n              <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(4.4\\times\\)<\/jats:tex-math>\n            <\/jats:inline-formula>\n            .\n          <\/jats:p>","DOI":"10.1145\/3716873","type":"journal-article","created":{"date-parts":[[2025,2,10]],"date-time":"2025-02-10T16:09:29Z","timestamp":1739203769000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["VersaTile: Flexible Tiled Architectures via Associative Processors"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7762-2751","authenticated-orcid":false,"given":"Kailin","family":"Yang","sequence":"first","affiliation":[{"name":"Computer Systems Laboratory, Cornell University","place":["Ithaca, United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5451-5681","authenticated-orcid":false,"given":"Jos\u00e9 F.","family":"Mart\u00ednez","sequence":"additional","affiliation":[{"name":"Computer Systems Laboratory, Cornell University","place":["Ithaca, United States"]}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,6,30]]},"reference":[{"key":"e_1_3_3_2_2","unstructured":"1995. IEEE 1003.1c-1995 - Standard for Information Technology\u2013Portable Operating System Interface (POSIX(TM)) - System Application Program Interface (API) Amendment 2: Threads Extension (C Language). Retrieved April 14 2021 from https:\/\/standards.ieee.org\/standard\/1003_1c-1995.html. (1995)."},{"key":"e_1_3_3_3_2","unstructured":"1997. pthread_create. Retrieved April 14 2021 from https:\/\/pubs.opengroup.org\/onlinepubs\/7908799\/xsh\/pthread_create.html. (1997)."},{"key":"e_1_3_3_4_2","unstructured":"2015. Cortex-A53 - Performance and Power. Retrieved April 14 2021 from https:\/\/www.anandtech.com\/show\/8718\/the-samsung-galaxy-note-4-exynos-review\/4. (2015)."},{"key":"e_1_3_3_5_2","unstructured":"2015. Samsung Exynos 5433. Retrieved April 14 2021 from https:\/\/en.wikichip.org\/wiki\/samsung\/exynos\/5433. (2015)."},{"key":"e_1_3_3_6_2","unstructured":"2018. DC Ultra: Concurrent Timing Area Power and Test Optimization. Retrieved April 14 2021 from https:\/\/www.synopsys.com\/content\/dam\/synopsys\/implementation&signoff\/datasheets\/dc-ultra-ds.pdf. (2018)."},{"key":"e_1_3_3_7_2","unstructured":"2020. Innovus Implementation System. Retrieved April 13 2020 from https:\/\/www.cadence.com\/en_US\/home\/tools\/digital-design-and-signoff\/soc-implementation-and-floorplanning\/innovus-implementation-system.html. (2020)."},{"key":"e_1_3_3_8_2","unstructured":"2021. 16 nm lithography process. Retrieved April 14 2021 from https:\/\/en.wikichip.org\/wiki\/16_nm_lithography_process. (2021)."},{"key":"e_1_3_3_9_2","unstructured":"2021. 7 nm lithography process. Retrieved April 14 2021 from https:\/\/en.wikichip.org\/wiki\/7_nm_lithography_process. (2021)."},{"key":"e_1_3_3_10_2","unstructured":"2021. AMD Zen 2 CPU Core. Retrieved July 28 2021 from https:\/\/en.wikichip.org\/wiki\/amd\/microarchitectures\/zen_2. (2021)."},{"key":"e_1_3_3_11_2","unstructured":"2021. Cortex-A53. Retrieved April 14 2021 from https:\/\/developer.arm.com\/ip-products\/processors\/cortex-a\/cortex-a53. (2021)."},{"key":"e_1_3_3_12_2","unstructured":"2021. EPYC 7232P - AMD. Retrieved November 19 2021 from https:\/\/en.wikichip.org\/wiki\/amd\/epyc\/7232p. (2021)."},{"key":"e_1_3_3_13_2","unstructured":"2021. List of ALL Zen-Based Processors. Retrieved November 19 2021 from https:\/\/en.wikichip.org\/wiki\/amd\/microarchitectures\/zen_2#All_Zen_2_Chips. (2021)."},{"key":"e_1_3_3_14_2","unstructured":"2021. R-Car H3 - Renesas. Retrieved April 14 2021 from https:\/\/en.wikichip.org\/wiki\/renesas\/r-car\/h3. (2021)."},{"key":"e_1_3_3_15_2","unstructured":"2023. The ET-SoC-1 Chip. Retrieved March 14 2023 from https:\/\/www.esperanto.ai\/technology\/. (2023)."},{"key":"e_1_3_3_16_2","unstructured":"2023. Intel Skylake. Retrieved April 23 2023 from https:\/\/en.wikichip.org\/wiki\/intel\/microarchitectures\/skylake_(server). (2023)."},{"key":"e_1_3_3_17_2","unstructured":"2023. RISC-V GNU Toolchain. Retrieved April 03 2023 from https:\/\/github.com\/riscv\/riscv-gnu-toolchain. (2023)."},{"key":"e_1_3_3_18_2","unstructured":"2023. RISC-V \u201dV\u201d Vector Extension. Retrieved April 14 2023 from https:\/\/github.com\/riscv\/riscv-v-spec\/blob\/master. (2023)."},{"key":"e_1_3_3_19_2","unstructured":"2024. MESI and MOESI Protocol. Retrieved October 19 2024 from https:\/\/developer.arm.com\/documentation\/den0013\/d\/Multi-core-processors\/Cache-coherency\/MESI-and-MOESI-protocols. (2024)."},{"key":"e_1_3_3_20_2","unstructured":"2024. MOESI Protocol. Retrieved October 19 2024 from https:\/\/en.wikipedia.org\/wiki\/MOESI_protocol. (2024)."},{"key":"e_1_3_3_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2017.21"},{"key":"e_1_3_3_22_2","article-title":"Minor CPU Model","author":"Bardsley A.","year":"2025","unstructured":"A. Bardsley. 2025. Minor CPU Model. Retrieved from https:\/\/www.gem5.org\/documentation\/general_docs\/cpu_models\/minor_cpu. (2025).","journal-title":"R"},{"key":"e_1_3_3_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2004.826581"},{"key":"e_1_3_3_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/2024716.2024718"},{"key":"e_1_3_3_25_2","doi-asserted-by":"publisher","DOI":"10.1145\/3470496.3527435"},{"key":"e_1_3_3_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA51647.2021.00054"},{"key":"e_1_3_3_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA45697.2020.00016"},{"key":"e_1_3_3_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00040"},{"key":"e_1_3_3_29_2","volume-title":"Proceedings of the 2023 56th Annual IEEE\/ACM International Symposium on Microarchitecture","author":"Fan Renhao","year":"2013","unstructured":"Renhao Fan, Yikai Cui, Qilin Chen, Mingyu Wang, Youhui Zhang, Weimin Zheng, and Zhaolin Li. 2013. MAICC: A lightweight many-core architecture with in-cache computing for multi-DNN parallel inference. In Proceedings of the 2023 56th Annual IEEE\/ACM International Symposium on Microarchitecture."},{"key":"e_1_3_3_30_2","doi-asserted-by":"publisher","DOI":"10.5555\/540236"},{"key":"e_1_3_3_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/3307650.3322257"},{"key":"e_1_3_3_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/2.375174"},{"key":"e_1_3_3_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/LCA.2023.3341389"},{"key":"e_1_3_3_34_2","doi-asserted-by":"publisher","DOI":"10.1145\/2485922.2485939"},{"key":"e_1_3_3_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2022.3174101"},{"key":"e_1_3_3_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/3445814.3446749"},{"key":"e_1_3_3_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/3470496.3527424"},{"key":"e_1_3_3_38_2","volume-title":"Proceedings of the 2023 56th Annual IEEE\/ACM International Symposium on Microarchitecture","author":"Huang Yi","year":"2013","unstructured":"Yi Huang, Lingkun Kong, D. Chen, Z. Chen, X. Kong, J. Zhu, K. Mamouras, S. Wei, K. Yang, and L. Liu. 2013. CASA: An energy-efficient and high-speed CAM-based SMEM seeding accelerator for genome alignment. In Proceedings of the 2023 56th Annual IEEE\/ACM International Symposium on Microarchitecture."},{"key":"e_1_3_3_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/3307650.3322237"},{"key":"e_1_3_3_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/TETC.2016.2565262"},{"key":"e_1_3_3_41_2","volume-title":"Proceedings of the 2023 56th Annual IEEE\/ACM International Symposium on Microarchitecture","author":"Jahshan Z.","year":"2013","unstructured":"Z. Jahshan, I. Merlin, E. Garz\u00f3n, and L. Yavits. 2013. DASH-CAM: Dynamic approximate search content addressable memory for genome classification. In Proceedings of the 2023 56th Annual IEEE\/ACM International Symposium on Microarchitecture."},{"key":"e_1_3_3_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/JSSC.2016.2515510"},{"key":"e_1_3_3_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2016.41"},{"key":"e_1_3_3_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/HOTCHIPS.2014.7478812"},{"key":"e_1_3_3_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/HCS59251.2023.10254711"},{"key":"e_1_3_3_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/HCS59251.2023.10254717"},{"key":"e_1_3_3_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA52012.2021.00013"},{"key":"e_1_3_3_48_2","doi-asserted-by":"publisher","DOI":"10.1145\/2491464"},{"key":"e_1_3_3_49_2","doi-asserted-by":"publisher","DOI":"10.1145\/3123939.3123977"},{"key":"e_1_3_3_50_2","doi-asserted-by":"publisher","DOI":"10.1145\/2897937.2898064"},{"key":"e_1_3_3_51_2","unstructured":"Jos\u00e9 F. Mart\u00ednez Helena Caminal Kailin Yang Khalid Al-Hawaj and Christopher Batten. 2022. Content-addressable processing engine. (Oct. 42022). US Patent 11 461 097."},{"key":"e_1_3_3_52_2","doi-asserted-by":"publisher","DOI":"10.1145\/2845084"},{"key":"e_1_3_3_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/40.592312"},{"key":"e_1_3_3_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/2.330039"},{"key":"e_1_3_3_55_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2007.346181"},{"key":"e_1_3_3_56_2","article-title":"Intel\u00ae AVX-512 Instructions","author":"Reinders James","year":"2013","unstructured":"James Reinders. 2013. Intel\u00ae AVX-512 Instructions. Retrieved October 19, 2021 from https:\/\/www.intel.com\/content\/www\/us\/en\/developer\/articles\/technical\/intel-avx-512-instructions.html. (2013).","journal-title":"R"},{"key":"e_1_3_3_57_2","doi-asserted-by":"publisher","DOI":"10.1145\/359327.359336"},{"key":"e_1_3_3_58_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-66400-7_9"},{"key":"e_1_3_3_59_2","doi-asserted-by":"publisher","DOI":"10.1145\/2687355"},{"key":"e_1_3_3_60_2","doi-asserted-by":"publisher","DOI":"10.1145\/3123939.3124544"},{"key":"e_1_3_3_61_2","unstructured":"Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv:1409.1556 [cs.CV] https:\/\/arxiv.org\/abs\/1409.1556"},{"key":"e_1_3_3_62_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2017.55"},{"key":"e_1_3_3_63_2","doi-asserted-by":"publisher","unstructured":"Nigel Stephens Stuart Biles Matthias Boettcher Jacob Eapen Mbou Eyole Giacomo Gabrielli Matt Horsnell Grigorios Magklis Alejandro Martinez Nathanael Premillieu Alastair Reid Alejandro Rico and Paul Walker. 2017. The ARM scalable vector extension. IEEE Micro 37 2 (2017) 26\u201339. 10.1109\/MM.2017.35","DOI":"10.1109\/MM.2017.35"},{"key":"e_1_3_3_64_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.1970.5008902"},{"key":"e_1_3_3_65_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO56248.2022.00025"},{"key":"e_1_3_3_66_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCAD.2017.8203889"},{"key":"e_1_3_3_67_2","article-title":"The RISC-V Instruction Set Manual, Volume I: User-level Isa","author":"Waterman Andrew","year":"2016","unstructured":"Andrew Waterman, Yunsup Lee, David A. Patterson, and Krste Asanovic. 2016. The RISC-V Instruction Set Manual, Volume I: User-level Isa. Retrieved from April 14 2023 https:\/\/www2.eecs.berkeley.edu\/Pubs\/TechRpts\/2016\/EECS-2016-118.pdf. (2016).","journal-title":"R"},{"key":"e_1_3_3_68_2","doi-asserted-by":"publisher","DOI":"10.1109\/DAC56929.2023.10247818"},{"key":"e_1_3_3_69_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO61859.2024.00055"},{"key":"e_1_3_3_70_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2013.220"},{"key":"e_1_3_3_71_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA45697.2020.00074"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3716873","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,1]],"date-time":"2025-07-01T11:11:41Z","timestamp":1751368301000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3716873"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,30]]},"references-count":70,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2025,6,30]]}},"alternative-id":["10.1145\/3716873"],"URL":"https:\/\/doi.org\/10.1145\/3716873","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2025,6,30]]},"assertion":[{"value":"2024-05-31","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-01-05","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-06-30","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}