{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,28]],"date-time":"2026-02-28T07:33:04Z","timestamp":1772263984293,"version":"3.50.1"},"reference-count":69,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2023,3,1]],"date-time":"2023-03-01T00:00:00Z","timestamp":1677628800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2023,6,30]]},"abstract":"<jats:p>\n            Multi-pod systolic arrays are emerging as the architecture of choice in DNN inference accelerators. Despite their potential, designing multi-pod systolic arrays to maximize effective throughput\/Watt\u2014i.e., throughput\/Watt adjusted when accounting for array utilization\u2014poses a unique set of challenges. In this work, we study three key pillars in multi-pod systolic array designs, namely array granularity, interconnect, and tiling. We identify optimal array granularity across workloads and show that state-of-the-art commercial accelerators use suboptimal array sizes for single-tenancy workloads. We, then evaluate the bandwidth\/latency trade-offs in interconnects and show that Butterfly networks offer a scalable topology for accelerators with a large number of pods. Finally, we introduce a novel data tiling scheme with custom partition size to maximize utilization in optimally sized pods. We propose\n            <jats:italic>Scale-out Systolic Arrays<\/jats:italic>\n            , a multi-pod inference accelerator for both single- and multi-tenancy based on these three pillars. We show that SOSA exhibits scaling of up to 600\u00a0TeraOps\/s in effective throughput for state-of-the-art DNN inference workloads, and outperforms state-of-the-art multi-pod accelerators by a factor of 1.5 \u00d7.\n            <jats:xref ref-type=\"fn\">\n              <jats:sup>1<\/jats:sup>\n            <\/jats:xref>\n          <\/jats:p>","DOI":"10.1145\/3572917","type":"journal-article","created":{"date-parts":[[2022,11,29]],"date-time":"2022-11-29T12:05:35Z","timestamp":1669723535000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":24,"title":["Scale-out Systolic Arrays"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7809-9897","authenticated-orcid":false,"given":"Ahmet Caner","family":"Y\u00fcz\u00fcg\u00fcler","sequence":"first","affiliation":[{"name":"EPFL, Lausanne, VD, Switzerland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4542-2947","authenticated-orcid":false,"given":"Canberk","family":"S\u00f6nmez","sequence":"additional","affiliation":[{"name":"EPFL, Lausanne, VD, Switzerland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1981-3525","authenticated-orcid":false,"given":"Mario","family":"Drumond","sequence":"additional","affiliation":[{"name":"CodeDepot, Switzerland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6442-3705","authenticated-orcid":false,"given":"Yunho","family":"Oh","sequence":"additional","affiliation":[{"name":"Korea University, South Korea"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5916-8068","authenticated-orcid":false,"given":"Babak","family":"Falsafi","sequence":"additional","affiliation":[{"name":"EPFL, Lausanne, VD, Switzerland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4010-714X","authenticated-orcid":false,"given":"Pascal","family":"Frossard","sequence":"additional","affiliation":[{"name":"EPFL, Lausanne, VD, Switzerland"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2023,3]]},"reference":[{"key":"e_1_3_2_2_2","unstructured":"2017. Introduction to the IPU architecture.https:\/\/www.graphcore.ai\/nips2017_presentations. Accessed: 2019-08-06."},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA. 2016.11"},{"key":"e_1_3_2_4_2","series-title":"MICRO-49","volume-title":"The 49th Annual IEEE\/ACM International Symposium on Microarchitecture","author":"Alwani Manoj","year":"2016","unstructured":"Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. 2016. Fused-layer CNN accelerators. In The 49th Annual IEEE\/ACM International Symposium on Microarchitecture (Taipei, Taiwan) (MICRO-49). IEEE Press, Article 22, 12 pages."},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA45697.2020.00081"},{"key":"e_1_3_2_6_2","first-page":"1","volume-title":"2019 IEEE Hot Chips 31 Symposium (HCS), Cupertino, CA, USA, August 18\u201320, 2019","author":"Bannon Pete","year":"2019","unstructured":"Pete Bannon, Ganesh Venkataramanan, Debjit Das Sarma, and Emil Talpes. 2019. Computer and redundancy solution for the full self-driving computer. In 2019 IEEE Hot Chips 31 Symposium (HCS), Cupertino, CA, USA, August 18\u201320, 2019. IEEE, 1\u201322."},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1002\/j.1538-7305.1964.tb04103.x"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1145\/2541940.2541967"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2016.40"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/JSSC.2016.2616357"},{"key":"e_1_3_2_11_2","first-page":"609","volume-title":"47th Annual IEEE\/ACM International Symposium on Microarchitecture, MICRO 2014, Cambridge, United Kingdom, December 13\u201317, 2014","author":"Chen Yunji","year":"2014","unstructured":"Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2014. DaDianNao: A machine-learning supercomputer. In 47th Annual IEEE\/ACM International Symposium on Microarchitecture, MICRO 2014, Cambridge, United Kingdom, December 13\u201317, 2014. IEEE Computer Society, 609\u2013622."},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/JETCAS.2019.2910232"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA47549.2020.00027"},{"key":"e_1_3_2_14_2","unstructured":"Fran\u00e7ois Chollet. 2015. Keras. https:\/\/keras.io."},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2021.3061394"},{"key":"e_1_3_2_16_2","volume-title":"NVIDIA A100 40GB PCIe GPU Accelerator Product Brief","author":"Corporation NVIDIA","year":"2020","unstructured":"NVIDIA Corporation. 2020. NVIDIA A100 40GB PCIe GPU Accelerator Product Brief. https:\/\/www.nvidia.com\/content\/dam\/en-zz\/Solutions\/Data-Center\/a100\/pdf\/A100-PCIE-Prduct-Brief.pdf."},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/n19-1423"},{"key":"e_1_3_2_18_2","doi-asserted-by":"crossref","first-page":"265","DOI":"10.1109\/FPT.2016.7929549","volume-title":"2016 International Conference on Field-Programmable Technology (FPT)","author":"DiCecco R.","year":"2016","unstructured":"R. DiCecco, G. Lacey, J. Vasiljevic, P. Chow, G. Taylor, and S. Areibi. 2016. Caffeinated FPGAs: FPGA framework for convolutional neural networks. In 2016 International Conference on Field-Programmable Technology (FPT). 265\u2013268."},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/3466752.3480057"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.5075\/epfl-thesis-10265"},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW.2011.5981829"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00012"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/3037697.3037702"},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/3297858.3304014"},{"key":"e_1_3_2_25_2","first-page":"681","volume-title":"53rd Annual IEEE\/ACM International Symposium on Microarchitecture, MICRO 2020, Athens, Greece, October 17\u201321, 2020","author":"Ghodrati Soroush","year":"2020","unstructured":"Soroush Ghodrati, Byung Hoon Ahn, Joon Kyung Kim, Sean Kinzer, Brahmendra Reddy Yatham, Navateja Alla, Hardik Sharma, Mohammad Alian, Eiman Ebrahimi, Nam Sung Kim, Cliff Young, and Hadi Esmaeilzadeh. 2020. Planaria: Dynamic architecture fission for spatial multi-tenant acceleration of deep neural networks. In 53rd Annual IEEE\/ACM International Symposium on Microarchitecture, MICRO 2020, Athens, Greece, October 17\u201321, 2020. IEEE, 681\u2013697."},{"key":"e_1_3_2_26_2","unstructured":"Google. 2017. Cloud TPU. https:\/\/cloud.google.com\/tpu. Accessed: 2018-01-31."},{"key":"e_1_3_2_27_2","unstructured":"Google. 2020. BERT.https:\/\/github.com\/google-research\/bert. Accessed: 2020-10-5."},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2016.30"},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2018.00059"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.243"},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/TVLSI.2022.3170233"},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2019.2918851"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA52012.2021.00010"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1145\/3360307"},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080246"},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/MC.1982.1653825"},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/3297858.3304028"},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/ASAP.2019.00-31"},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCAS.2019.8702753"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1145\/3352460.3358252"},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1145\/3173162.3173176"},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCAD.2011.6105405"},{"key":"e_1_3_2_44_2","volume-title":"Principles of Broadband Switching and Networking","author":"Liew Soung C.","year":"2010","unstructured":"Soung C. Liew and Tony T. Lee. 2010. Principles of Broadband Switching and Networking. Vol. 32. John Wiley & Sons."},{"key":"e_1_3_2_45_2","article-title":"Sparse systolic tensor array for efficient CNN hardware acceleration","volume":"2009","author":"Liu Zhi Gang","year":"2020","unstructured":"Zhi Gang Liu, Paul N. Whatmough, and Matthew Mattina. 2020. Sparse systolic tensor array for efficient CNN hardware acceleration. CoRR abs\/2009.02381 (2020). arxiv:2009.02381. https:\/\/arxiv.org\/abs\/2009.02381.","journal-title":"CoRR"},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/LCA.2020.2979965"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2017.29"},{"key":"e_1_3_2_48_2","unstructured":"NVIDIA. 2019. NVIDIA Deep Learning SDK Documentation 52 pages. [Online] Available: https:\/\/docs.nvidia.com\/deeplearning\/sdk."},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/HOTCHIPS.2015.7477459"},{"key":"e_1_3_2_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCD.2013.6657019"},{"key":"e_1_3_2_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA47549.2020.00015"},{"key":"e_1_3_2_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2016.32"},{"key":"e_1_3_2_53_2","series-title":"MICRO-49","volume-title":"The 49th Annual IEEE\/ACM International Symposium on Microarchitecture","author":"Rhu Minsoo","year":"2016","unstructured":"Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W. Keckler. 2016. VDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design. In The 49th Annual IEEE\/ACM International Symposium on Microarchitecture (Taipei, Taiwan) (MICRO-49). IEEE Press, Article 18, 13 pages."},{"key":"e_1_3_2_54_2","unstructured":"Jonathan Ross. 2017. Prefetching weights for use in a neural network processor. US 2017\/0103314 A1."},{"key":"e_1_3_2_55_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2019.2931392"},{"key":"e_1_3_2_56_2","article-title":"SCALE-Sim: Systolic CNN accelerator","volume":"1811","author":"Samajdar Ananda","year":"2018","unstructured":"Ananda Samajdar, Yuhao Zhu, Paul N. Whatmough, Matthew Mattina, and Tushar Krishna. 2018. SCALE-Sim: Systolic CNN accelerator. CoRR abs\/1811.02883 (2018). arxiv:1811.02883http:\/\/arxiv.org\/abs\/1811.02883.","journal-title":"CoRR"},{"key":"e_1_3_2_57_2","doi-asserted-by":"publisher","DOI":"10.1145\/3352460.3358302"},{"key":"e_1_3_2_58_2","first-page":"1","volume-title":"Microarchitecture (MICRO), 2016 49th Annual IEEE\/ACM International Symposium on","author":"Sharma Hardik","year":"2016","unstructured":"Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh. 2016. From high-level deep neural models to FPGAs. In Microarchitecture (MICRO), 2016 49th Annual IEEE\/ACM International Symposium on. IEEE, 1\u201312."},{"key":"e_1_3_2_59_2","first-page":"764","volume-title":"45th ACM\/IEEE Annual International Symposium on Computer Architecture, ISCA 2018, Los Angeles, CA, USA, June 1\u20136, 2018","author":"Sharma Hardik","year":"2018","unstructured":"Hardik Sharma, Jongse Park, Naveen Suda, Liangzhen Lai, Benson Chau, Vikas Chandra, and Hadi Esmaeilzadeh. 2018. Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network. In 45th ACM\/IEEE Annual International Symposium on Computer Architecture, ISCA 2018, Los Angeles, CA, USA, June 1\u20136, 2018, Murali Annavaram, Timothy Mark Pinkston, and Babak Falsafi (Eds.). IEEE Computer Society, 764\u2013775."},{"key":"e_1_3_2_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/LCA.2019.2924007"},{"key":"e_1_3_2_61_2","doi-asserted-by":"publisher","DOI":"10.1145\/3352460.3358304"},{"key":"e_1_3_2_62_2","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2017.2761740"},{"key":"e_1_3_2_63_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.308"},{"key":"e_1_3_2_64_2","unstructured":"Tencent. 2020. Turbo Transformers.https:\/\/github.com\/Tencent\/TurboTransformers. Accessed: 2020-09-26."},{"key":"e_1_3_2_65_2","doi-asserted-by":"publisher","DOI":"10.1145\/3200691.3178491"},{"key":"e_1_3_2_66_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA53966.2022.00010"},{"key":"e_1_3_2_67_2","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM53951.2022.9786215"},{"key":"e_1_3_2_68_2","first-page":"1","volume-title":"2017 54th ACM\/EDAC\/IEEE Design Automation Conference (DAC)","author":"Wei Xuechao","year":"2017","unstructured":"Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and J. Cong. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In 2017 54th ACM\/EDAC\/IEEE Design Automation Conference (DAC). 1\u20136."},{"key":"e_1_3_2_69_2","doi-asserted-by":"publisher","DOI":"10.1145\/2830772.2830810"},{"key":"e_1_3_2_70_2","doi-asserted-by":"publisher","DOI":"10.1109\/MDAT.2019.2952329"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3572917","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3572917","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:51:38Z","timestamp":1750182698000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3572917"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,3]]},"references-count":69,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2023,6,30]]}},"alternative-id":["10.1145\/3572917"],"URL":"https:\/\/doi.org\/10.1145\/3572917","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,3]]},"assertion":[{"value":"2022-04-06","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-11-13","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-03-01","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}