{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,27]],"date-time":"2026-03-27T09:36:40Z","timestamp":1774604200089,"version":"3.50.1"},"reference-count":24,"publisher":"Association for Computing Machinery (ACM)","issue":"6","license":[{"start":{"date-parts":[[2021,5,24]],"date-time":"2021-05-24T00:00:00Z","timestamp":1621814400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100000185","name":"DARPA","doi-asserted-by":"crossref","award":["CRAFT"],"award-info":[{"award-number":["CRAFT"]}],"id":[{"id":"10.13039\/100000185","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Commun. ACM"],"published-print":{"date-parts":[[2021,6]]},"abstract":"<jats:p>Package-level integration using multi-chip-modules (MCMs) is a promising approach for building large-scale systems. Compared to a large monolithic die, an MCM combines many smaller chiplets into a larger system, substantially reducing fabrication and design costs. Current MCMs typically only contain a handful of coarse-grained large chiplets due to the high area, performance, and energy overheads associated with inter-chiplet communication. This work investigates and quantifies the costs and benefits of using MCMs with finegrained chiplets for deep learning inference, an application domain with large compute and on-chip storage requirements. To evaluate the approach, we architected, implemented, fabricated, and tested Simba, a 36-chiplet prototype MCM system for deep-learning inference. Each chiplet achieves 4 TOPS peak performance, and the 36-chiplet MCM package achieves up to 128 TOPS and up to 6.1 TOPS\/W. The MCM is configurable to support a flexible mapping of DNN layers to the distributed compute and storage units. To mitigate inter-chiplet communication overheads, we introduce three tiling optimizations that improve data locality. These optimizations achieve up to 16% speedup compared to the baseline layer mapping. Our evaluation shows that Simba can process 1988 images\/s running ResNet-50 with a batch size of one, delivering an inference latency of 0.50 ms.<\/jats:p>","DOI":"10.1145\/3460227","type":"journal-article","created":{"date-parts":[[2021,5,24]],"date-time":"2021-05-24T17:58:51Z","timestamp":1621879131000},"page":"107-116","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":22,"title":["Simba"],"prefix":"10.1145","volume":"64","author":[{"given":"Yakun Sophia","family":"Shao","sequence":"first","affiliation":[{"name":"UC Berkeley, CA"}]},{"given":"Jason","family":"Cemons","sequence":"additional","affiliation":[{"name":"NVIDIA, Austin, TX"}]},{"given":"Rangharajan","family":"Venkatesan","sequence":"additional","affiliation":[{"name":"NVIDIA, Santa Clara, CA"}]},{"given":"Brian","family":"Zimmer","sequence":"additional","affiliation":[{"name":"NVIDIA, Santa Clara, CA"}]},{"given":"Matthew","family":"Fojtik","sequence":"additional","affiliation":[{"name":"NVIDIA, Durham, NC"}]},{"given":"Nan","family":"Jiang","sequence":"additional","affiliation":[{"name":"NVIDIA, Westford, MA"}]},{"given":"Ben","family":"Keller","sequence":"additional","affiliation":[{"name":"NVIDIA, Santa Clara, CA"}]},{"given":"Alicia","family":"Klinefelter","sequence":"additional","affiliation":[{"name":"NVIDIA, Durham, NC"}]},{"given":"Nathaniel","family":"Pinckney","sequence":"additional","affiliation":[{"name":"NVIDIA, Austin, TX"}]},{"given":"Priyanka","family":"Raina","sequence":"additional","affiliation":[{"name":"Stanford University, Stanford, CA"}]},{"given":"Stephen G.","family":"Tell","sequence":"additional","affiliation":[{"name":"NVIDIA, Durham, NC"}]},{"given":"Yanqing","family":"Zhang","sequence":"additional","affiliation":[{"name":"NVIDIA, Santa Clara, CA"}]},{"given":"William J.","family":"Dally","sequence":"additional","affiliation":[{"name":"NVIDIA, Incline Village, NV, USA\/Stanford University, Stanford, CA"}]},{"given":"Joel","family":"Emer","sequence":"additional","affiliation":[{"name":"Massachusetts Institute of Technology, Cambridge, MA\/NVIDIA, Westford, TX"}]},{"given":"C. Thomas","family":"Gray","sequence":"additional","affiliation":[{"name":"NVIDIA, Durham, NC"}]},{"given":"Brucek","family":"Khailany","sequence":"additional","affiliation":[{"name":"NVIDIA, Austin, TX"}]},{"given":"Stephen W.","family":"Keckler","sequence":"additional","affiliation":[{"name":"NVIDIA, Austin, TX"}]}],"member":"320","published-online":{"date-parts":[[2021,5,24]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Proceedings of the International Symposium on Computer Architecture (ISCA)","author":"Arunkurnar A.","year":"2017","unstructured":"Arunkurnar , A. , Bolotin , E , Cho , B. , Milic , U. , Ebrahimi , E. , Villa , O. , Jaleel , A. , Wu , C.-J. , Nellans , D. MCM-GPU : Multi-chip-module GPUs for continued performance scalability . In Proceedings of the International Symposium on Computer Architecture (ISCA) ( Toronto, ON, Canada , 2017 ), Association for Computing Machinery, New York, NY, USA. Arunkurnar, A., Bolotin, E, Cho, B., Milic, U., Ebrahimi, E., Villa, O., Jaleel, A., Wu, C.-J., Nellans, D. MCM-GPU: Multi-chip-module GPUs for continued performance scalability. In Proceedings of the International Symposium on Computer Architecture (ISCA) (Toronto, ON, Canada, 2017), Association for Computing Machinery, New York, NY, USA."},{"key":"e_1_2_1_3_1","volume-title":"Proceedings of the International Solid State Circuits Conference (ISSCC)","author":"Beck N.","year":"2018","unstructured":"Beck , N. , White , S. , Paraschou , M. , Naffziger , S. Zeppelin : An SoC for multichip architectures . In Proceedings of the International Solid State Circuits Conference (ISSCC) ( 2018 ), IEEE, San Francisco, CA, USA. Beck, N., White, S., Paraschou, M., Naffziger, S. Zeppelin: An SoC for multichip architectures. In Proceedings of the International Solid State Circuits Conference (ISSCC) (2018), IEEE, San Francisco, CA, USA."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCAD.1999.810667"},{"key":"e_1_2_1_5_1","volume-title":"Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS)","author":"Chen T.","year":"2014","unstructured":"Chen , T. , Du , Z. , Sun , N. , Wang , J. , Wu , C. , Chen , Y. , Temam , O. DianNao : A small-footprint high-throughput accelerator for ubiquitous machine-learning . In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS) ( Salt Lake City, Utah, USA , 2014 ), Association for Computing Machinery, New York, NY, USA. Chen, T., Du, Z., Sun, N., Wang, J., Wu, C., Chen, Y., Temam, O. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS) (Salt Lake City, Utah, USA, 2014), Association for Computing Machinery, New York, NY, USA."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3007787.3001177"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2009.4798252"},{"key":"e_1_2_1_8_1","volume-title":"International Symposium on Asynchronous Circuits and Systems (ASYNC)","author":"Fojtik M.","year":"2019","unstructured":"Fojtik , M. , Keller , B. , Klinefelter , A. , Pinckney , N. , Tell , S.G. , Zimmer , B. , Raja , T. , Zhou , K. , Dally , W.J. , Khailany , B. A fine-grained GALS SoC with pausible adaptive clocking in 16nm FinFET . In International Symposium on Asynchronous Circuits and Systems (ASYNC) ( 2019 ), IEEE, Hirosaki, Japan. Fojtik, M., Keller, B., Klinefelter, A., Pinckney, N., Tell, S.G., Zimmer, B., Raja, T., Zhou, K., Dally, W.J., Khailany, B. A fine-grained GALS SoC with pausible adaptive clocking in 16nm FinFET. In International Symposium on Asynchronous Circuits and Systems (ASYNC) (2019), IEEE, Hirosaki, Japan."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00012"},{"key":"e_1_2_1_10_1","volume-title":"Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS)","author":"Gao M.","year":"2019","unstructured":"Gao , M. , Yang , X. , Pu , J. , Horowitz , M. , Kozyrakis , C. Tangram : Optimized coarse-grained dataflow for scalable NN accelerators . In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS) ( 2019 ), Association for Computing Machinery, New York NY, USA. Gao, M., Yang, X., Pu, J., Horowitz, M., Kozyrakis, C. Tangram: Optimized coarse-grained dataflow for scalable NN accelerators. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS) (2019), Association for Computing Machinery, New York NY, USA."},{"key":"e_1_2_1_11_1","volume-title":"Proceedings of the International Solid State Circuits Conference (ISSCC)","author":"Greenhill D.","year":"2017","unstructured":"Greenhill , D. , Ho , R. , Lewis , D. , Schmit , H. , Chan , K.H. , Tong , A. , Atsatt , S. , How , D. , McElheny , P. , Duwel , K. , Schulz , J. , Faulkner , D. , Iyer , G. , Chen , G. , Phoon , H.K. , Lim , H.W. , Koay , W.-Y. , Garibay , T. A14nm 1G Hz FPGA with 2.5D transceiver integration. In Proceedings of the International Solid State Circuits Conference (ISSCC) ( 2017 ), IEEE, San Francisco, CA, USA. Greenhill, D., Ho, R., Lewis, D., Schmit, H., Chan, K.H., Tong, A., Atsatt, S., How, D., McElheny, P., Duwel, K., Schulz, J., Faulkner, D., Iyer, G., Chen, G., Phoon, H.K., Lim, H.W., Koay, W.-Y., Garibay, T. A14nm 1GHz FPGA with 2.5D transceiver integration. In Proceedings of the International Solid State Circuits Conference (ISSCC) (2017), IEEE, San Francisco, CA, USA."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCPMT.2015.2511626"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/2647868.2654889"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080246"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/2818950.2818951"},{"key":"e_1_2_1_17_1","volume-title":"Proceedings of the International Conference on Machine Learning (ICML)","author":"Mirhoseini A.","year":"2017","unstructured":"Mirhoseini , A. , Pham , H. , Le , Q.V. , Steiner B. , Larsen , R. , Zhou , Y. , Kumar , N. , Norouzi , M. , Bengio , S. , Dean , J. Device placement optimization with reinforcement learning . In Proceedings of the International Conference on Machine Learning (ICML) ( 2017 ), JMLR.org, Sydney, NSW, Australia. Mirhoseini, A., Pham, H., Le, Q.V., Steiner B., Larsen, R., Zhou, Y., Kumar, N., Norouzi, M., Bengio, S., Dean, J. Device placement optimization with reinforcement learning. In Proceedings of the International Conference on Machine Learning (ICML) (2017), JMLR.org, Sydney, NSW, Australia."},{"key":"e_1_2_1_18_1","volume-title":"NVIDIA Tesla deep learning product performance. https:\/\/developer.nvidia.com\/deep-learning-performance-training-inference","author":"NVIDIA.","year":"2019","unstructured":"NVIDIA. NVIDIA Tesla deep learning product performance. https:\/\/developer.nvidia.com\/deep-learning-performance-training-inference , 2019 . NVIDIA. NVIDIA Tesla deep learning product performance. https:\/\/developer.nvidia.com\/deep-learning-performance-training-inference, 2019."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2019.00042"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080254"},{"key":"e_1_2_1_21_1","volume-title":"Proceedings of the International Symposium on Microarchitecture (MICRO)","author":"Shao Y.S.","year":"2019","unstructured":"Shao , Y.S. , Clemons , J. , Venkatesan , R. , Zimmer , B. , Fojtik , M. , Jiang , N. , Keller , B. , Klinefelter , A. , Pinckney , N. , Raina , P. , Tell , S.G. , Zhang , Y. , Dally , W.J. , Emer , J.S. , Gray , C.T. , Keckler , S.W. , Khailany , B. Simba : Scaling deep-learning inference with multi-chip-module-based architecture . In Proceedings of the International Symposium on Microarchitecture (MICRO) ( Columbus, OH, USA , 2019 ), Association for Computing Machinery, New York, NY, USA. Shao, Y.S., Clemons, J., Venkatesan, R., Zimmer, B., Fojtik, M., Jiang, N., Keller, B., Klinefelter, A., Pinckney, N., Raina, P., Tell, S.G., Zhang, Y., Dally, W.J., Emer, J.S., Gray, C.T., Keckler, S.W., Khailany, B. Simba: Scaling deep-learning inference with multi-chip-module-based architecture. In Proceedings of the International Symposium on Microarchitecture (MICRO) (Columbus, OH, USA, 2019), Association for Computing Machinery, New York, NY, USA."},{"key":"e_1_2_1_22_1","volume-title":"Hot Chips","author":"Sijstermans F.","year":"2018","unstructured":"Sijstermans , F. The NVIDIA deep learning accelerator . In Hot Chips ( 2018 ). Sijstermans, F. The NVIDIA deep learning accelerator. In Hot Chips (2018)."},{"key":"e_1_2_1_23_1","volume-title":"Proceedings of the International Symposium on Computer Architecture (ISCA)","author":"Venkataramani S.","year":"2017","unstructured":"Venkataramani , S. , Ranjan , A. , Banerjee , S. , Das , D. , Avancha , S. , Jagannathan , A. , Durg , A. , Nagaraj , D. , Kaul , B. , Dubey , P. , Raghunathan , A. ScaleDeep : A scalable compute architecture for learning and evaluating deep networks . In Proceedings of the International Symposium on Computer Architecture (ISCA) ( Toronto, ON, Canada , 2017 ), Association for Computing Machinery, New York, NY, USA. Venkataramani, S., Ranjan, A., Banerjee, S., Das, D., Avancha, S., Jagannathan, A., Durg, A., Nagaraj, D., Kaul, B., Dubey, P., Raghunathan, A. ScaleDeep: A scalable compute architecture for learning and evaluating deep networks. In Proceedings of the International Symposium on Computer Architecture (ISCA) (Toronto, ON, Canada, 2017), Association for Computing Machinery, New York, NY, USA."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISSCC.2018.8310291"},{"key":"e_1_2_1_25_1","volume-title":"Proceedings of the International Symposia on VLSI Technology and Circuits (VLSI)","author":"Zimmer B.","year":"2019","unstructured":"Zimmer , B. , Venkatesan , R. , Shao , Y.S. , Clemons , J. , Fojtik , M. , Jiang , N. , Keller , B. , Klinefelter , A. , Pinckney , N. , Raina , P. , Tell , S.G. , Zhang , Y. , Dally , W.J. , Emer J.S. , Gray , C.T. , Keckler , S.W. , Khailany , B. A 0.11 pJ\/ Op , 0. 32--128 TOPS , scalable multi-chip-module-based deep neural network accelerator with ground-reference signaling in 16nm . In Proceedings of the International Symposia on VLSI Technology and Circuits (VLSI) ( 2019 ), IEEE, Kyoto, Japan, Japan. Zimmer, B., Venkatesan, R., Shao, Y.S., Clemons, J., Fojtik, M., Jiang, N., Keller, B., Klinefelter, A., Pinckney, N., Raina, P., Tell, S.G., Zhang, Y., Dally, W.J., Emer J.S., Gray, C.T., Keckler, S.W., Khailany, B. A 0.11 pJ\/Op, 0.32--128 TOPS, scalable multi-chip-module-based deep neural network accelerator with ground-reference signaling in 16nm. In Proceedings of the International Symposia on VLSI Technology and Circuits (VLSI) (2019), IEEE, Kyoto, Japan, Japan."}],"container-title":["Communications of the ACM"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3460227","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3460227","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3460227","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:12:16Z","timestamp":1750191136000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3460227"}},"subtitle":["scaling deep-learning inference with chiplet-based architecture"],"short-title":[],"issued":{"date-parts":[[2021,5,24]]},"references-count":24,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2021,6]]}},"alternative-id":["10.1145\/3460227"],"URL":"https:\/\/doi.org\/10.1145\/3460227","relation":{},"ISSN":["0001-0782","1557-7317"],"issn-type":[{"value":"0001-0782","type":"print"},{"value":"1557-7317","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,5,24]]},"assertion":[{"value":"2021-05-24","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}