{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,24]],"date-time":"2026-03-24T15:19:18Z","timestamp":1774365558864,"version":"3.50.1"},"reference-count":32,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2023,9,8]],"date-time":"2023-09-08T00:00:00Z","timestamp":1694131200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"ERDF A way of making Europe","award":["RTI2018-098156-B-C53, MCIN\/AEI\/10.13039\/501100011033"],"award-info":[{"award-number":["RTI2018-098156-B-C53, MCIN\/AEI\/10.13039\/501100011033"]}]},{"name":"NSF OAC","award":["1909900"],"award-info":[{"award-number":["1909900"]}]},{"name":"US Department of Energy ARIAA","award":["PID2020-112827GB-I00, MCIN\/AEI\/10.13039\/501100011033"],"award-info":[{"award-number":["PID2020-112827GB-I00, MCIN\/AEI\/10.13039\/501100011033"]}]},{"name":"Fundaci\u00f3n S\u00e9neca","award":["20749\/FPI\/18"],"award-info":[{"award-number":["20749\/FPI\/18"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["J. Emerg. Technol. Comput. Syst."],"published-print":{"date-parts":[[2023,10,31]]},"abstract":"<jats:p>Increasing deployment of Deep Neural Networks (DNNs) recently fueled interest in the development of specific accelerator architectures capable of meeting their stringent performance and energy consumption requirements. DNN accelerators can be organized around three separate NoCs, namely distribution, multiplier, and reduction networks (or DN, MN, and RN, respectively) between the global buffer(s) and the compute units (multipliers\/adders). Among them, the RN, used to generate and reduce the partial sums produced during DNN processing, is a first-order driver of the area and energy efficiency of the accelerator. RNs can be orchestrated to exploit a Temporal, Spatial or Spatio-Temporal reduction dataflow. Among these, Spatio-Temporal reduction is the one that has shown superior performance. However, as we demonstrate in this work, a state-of-the-art implementation of the Spatio-Temporal reduction dataflow, based on the addition of Accumulators (Ac) to the RN (i.e., RN+Ac strategy), can result into significant area and energy expenses. To cope with this important issue, we propose STIFT (that stands for<jats:italic>Spatio-Temporal Integrated Folding Tree<\/jats:italic>) that implements the Spatio-Temporal reduction dataflow entirely on the RN hardware substrate (i.e., without the need for the extra accumulators). STIFT results into significant area and power savings regarding the more complex RN+Ac strategy, at the same time its performance advantage is preserved.<\/jats:p>","DOI":"10.1145\/3531011","type":"journal-article","created":{"date-parts":[[2022,5,2]],"date-time":"2022-05-02T12:22:15Z","timestamp":1651494135000},"page":"1-20","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":6,"title":["STIFT: A Spatio-Temporal Integrated Folding Tree for Efficient Reductions in Flexible DNN Accelerators"],"prefix":"10.1145","volume":"19","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1089-2191","authenticated-orcid":false,"given":"Francisco","family":"Mu\u00f1oz-Mart\u00ednez","sequence":"first","affiliation":[{"name":"Universidad de Murcia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3550-720X","authenticated-orcid":false,"given":"Jos\u00e9 L.","family":"Abell\u00e1n","sequence":"additional","affiliation":[{"name":"Universidad Cat\u00f3lica de Murcia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0935-4078","authenticated-orcid":false,"given":"Manuel E.","family":"Acacio","sequence":"additional","affiliation":[{"name":"Universidad de Murcia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5738-6942","authenticated-orcid":false,"given":"Tushar","family":"Krishna","sequence":"additional","affiliation":[{"name":"Georgia Institute of Technology"}]}],"member":"320","published-online":{"date-parts":[[2023,9,8]]},"reference":[{"key":"e_1_3_2_2_2","unstructured":"[n.d.]. Bluespec System Verilog (BSV). Retrieved from http:\/\/wiki.bluespec.com\/. Access dates 09\/20\/2020."},{"key":"e_1_3_2_3_2","unstructured":"[n.d.]. MAERI Code v1. Retrieved from https:\/\/github.com\/hyoukjun\/MAERI. Access dates 11\/20\/2020."},{"key":"e_1_3_2_4_2","unstructured":"[n.d.]. PyTorch. Retrieved from https:\/\/pytorch.org\/. Access dates 05\/15\/2021."},{"key":"e_1_3_2_5_2","unstructured":"n.d.. SIGMA Code v1. Retrieved from https:\/\/github.com\/georgia-tech-synergy-lab\/SIGMA. Access dates 05\/15\/2021."},{"key":"e_1_3_2_6_2","first-page":"149","article-title":"On-line algorithms for path selection in a nonblocking network","author":"Arora S.","year":"1990","unstructured":"S. Arora, T. Leighton, and B. Maggs. 1990. On-line algorithms for path selection in a nonblocking network. In Proceedings of the 22nd Annual ACM Symposium on Theory of Computing.149\u2013158.","journal-title":"In Proceedings of the 22nd Annual ACM Symposium on Theory of Computing."},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2016.2592330"},{"key":"e_1_3_2_8_2","article-title":"Matrix-based nonblocking routing algorithm for Bene\u02c7s Networks","author":"Chakrabarty A.","year":"2009","unstructured":"A. Chakrabarty, M. Collier, and S. Mukhopadhyay. 2009. Matrix-based nonblocking routing algorithm for Bene\u02c7s Networks. 2009 Computation World: Future Computing, Service Computation, Cognitive, Adaptive, Content, Patterns(2009).","journal-title":"2009 Computation World: Future Computing, Service Computation, Cognitive, Adaptive, Content, Patterns"},{"key":"e_1_3_2_9_2","first-page":"1","volume-title":"Proceedings of the 2019 IEEE\/ACM International Symposium on Networks-on-Chip","author":"Chen Kun-Chih (Jimmy)","year":"2019","unstructured":"Kun-Chih (Jimmy) Chen, Masoumeh Ebrahimi, Ting-Yi Wang, and Yuch-Chi Yang. 2019. NoC-based DNN accelerator: A future design paradigm. In Proceedings of the 2019 IEEE\/ACM International Symposium on Networks-on-Chip. 1\u20138."},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/JSSC.2016.2616357"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2017.54"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/JETCAS.2019.2910232"},{"key":"e_1_3_2_13_2","unstructured":"Jacob Devlin Ming-Wei Chang Kenton Lee and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Retrieved from https:\/\/arxiv.org\/abs\/1810.04805."},{"key":"e_1_3_2_14_2","first-page":"92","article-title":"ShiDianNao: Shifting vision processing closer to the sensor","author":"Du Zidong","year":"2015","unstructured":"Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. 2015. ShiDianNao: Shifting vision processing closer to the sensor. In Proceedings of the 2015 International Symposium on Computer Architecture.92\u2013104.","journal-title":"Proceedings of the 2015 International Symposium on Computer Architecture."},{"key":"e_1_3_2_15_2","unstructured":"Vijay Janapa Reddi Christine Cheng David Kanter Peter Mattson Guenther Schmuelling Carole-Jean Wu Brian Anderson Maximilien Breughe Mark Charlebois William Chou Ramesh Chukka Cody Coleman Sam Davis Pan Deng Greg Diamos Jared Duke Dave Fick J. Scott Gardner Itay Hubara Sachin Idgunji Thomas B. Jablin Jeff Jiao Tom St. John Pankaj Kanwar David Lee Jeffery Liao Anton Lokhmotov Francisco Massa Peng Meng Paulius Micikevicius Colin Osborne Gennady Pekhimenko Arun Tejusve Raghunath Rajan Dilip Sequeira Ashish Sirasao Fei Sun Hanlin Tang Michael Thomson Frank Wei Ephrem Wu Lingjie Xu Koichi Yamada Bing Yu George Yuan Aaron Zhong Peizhao Zhang and Yuchen Zhou. 2019. MLPerf inference benchmark. arXiv:1911.02549. Retrieved from https:\/\/arxiv.org\/abs\/1911.02549."},{"key":"e_1_3_2_16_2","unstructured":"Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2015. Deep residual learning for image recognition. arXiv:1512.03385. Retrieved from https:\/\/arxiv.org\/abs\/1512.03385."},{"key":"e_1_3_2_17_2","unstructured":"Andrew G. Howard Menglong Zhu Bo Chen Dmitry Kalenichenko Weijun Wang Tobias Weyand Marco Andreetto and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861. Retrieved from https:\/\/arxiv.org\/abs\/1704.04861."},{"key":"e_1_3_2_18_2","unstructured":"Forrest N. Iandola Song Han Matthew W. Moskewicz Khalid Ashraf William J. Dally and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and \\(\\lt\\) 0.5MB model size. arXiv:1611.10012. Retrieved from https:\/\/arxiv.org\/abs\/1611.10012."},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080246"},{"key":"e_1_3_2_20_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1007\/978-3-031-01767-4","volume-title":"Data orchestration in deep learning accelerators","author":"Krishna Tushar","year":"2020","unstructured":"Tushar Krishna, Hyoukjun Kwon, Angshuman Parashar, Michael Pellauer, and Ananda Samajdar. 2020. Data orchestration in deep learning accelerators. Synthesis Lectures on Computer Architecture 15, 3 (2020), 1\u2013164."},{"key":"e_1_3_2_21_2","first-page":"1106","article-title":"ImageNet classification with deep convolutional neural networks","author":"Krizhevsky Alex","year":"2012","unstructured":"Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. International Conference on Neural Information Processing Systems.1106\u20131114.","journal-title":"International Conference on Neural Information Processing Systems."},{"key":"e_1_3_2_22_2","first-page":"754","article-title":"Understanding reuse, performance, and hardware cost of DNN dataflow: A data-centric approach","author":"Kwon Hyoukjun","year":"2019","unstructured":"Hyoukjun Kwon, Prasanth Chatarasi, Michael Pellauer, Angshuman Parashar, Vivek Sarkar, and Tushar Krishna. 2019. Understanding reuse, performance, and hardware cost of DNN dataflow: A data-centric approach. In Proceedings of the International Symposium on Microarchitecture.754\u2013768.","journal-title":"Proceedings of the International Symposium on Microarchitecture."},{"key":"e_1_3_2_23_2","first-page":"1","volume-title":"Proceedings of the 2017 11th IEEE\/ACM International Symposium on Networks-on-Chip","author":"Kwon Hyoukjun","year":"2017","unstructured":"Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna. 2017. Rethinking NoCs for spatial neural network accelerators. In Proceedings of the 2017 11th IEEE\/ACM International Symposium on Networks-on-Chip. 1\u20138."},{"key":"e_1_3_2_24_2","first-page":"461","article-title":"MAERI: Enabling flexible dataflow mapping over DNN accelerators via reconfigurable interconnects","author":"Kwon Hyoukjun","year":"2018","unstructured":"Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna. 2018. MAERI: Enabling flexible dataflow mapping over DNN accelerators via reconfigurable interconnects. International Conference on Architectural Support for Programming Languages and Operating Systems.461\u2013475.","journal-title":"International Conference on Architectural Support for Programming Languages and Operating Systems."},{"key":"e_1_3_2_25_2","unstructured":"T. T. Lee and Soung-Yue Liew. 1996. Parallel routing algorithms in Benes-Clos networks. In Proceedings of IEEE INFOCOM\u201996 Conference on Computer Communications."},{"key":"e_1_3_2_26_2","doi-asserted-by":"crossref","unstructured":"Wei Liu Dragomir Anguelov Dumitru Erhan Christian Szegedy Scott Reed Cheng-Yang Fu and Alexander C. Berg. 2015. SSD: Single shot MultiBox detector. arXiv:1512.02325. Retrieved from https:\/\/arxiv.org\/abs\/1512.02325.","DOI":"10.1007\/978-3-319-46448-0_2"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2016.2574353"},{"key":"e_1_3_2_28_2","doi-asserted-by":"crossref","first-page":"122","DOI":"10.1109\/LCA.2021.3097253","article-title":"STONNE: Enabling cycle-level microarchitectural simulation for DNN inference accelerators","volume":"20","author":"Matr\u00ednez Francisco Mu\u00f1oz","year":"2021","unstructured":"Francisco Mu\u00f1oz Matr\u00ednez, Jos\u00e9 L. Abell\u00e1n, Manuel E. Acacio, and Tushar Krishna. 2021. STONNE: Enabling cycle-level microarchitectural simulation for DNN inference accelerators. IEEE Computer Architecture Letters 20 (2021), 122\u2013125.","journal-title":"IEEE Computer Architecture Letters"},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/3479876.3481602"},{"key":"e_1_3_2_30_2","article-title":"SIGMA: A sparse and irregular GEMM accelerator with flexible interconnects for DNN training","author":"Qin Eric","year":"2020","unstructured":"Eric Qin, Ananda Samajdar, Hyoukjun Kwon, Vineet Nadella, Sudarshan Srinivasan, Dipankar Das, Bharat Kaul, and Tushar Krishna. 2020. SIGMA: A sparse and irregular GEMM accelerator with flexible interconnects for DNN training. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture.","journal-title":"Proceedings of the IEEE International Symposium on High-Performance Computer Architecture."},{"key":"e_1_3_2_31_2","unstructured":"Karen Simonyan and Andrew Zisserman. 2016. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Retrieved from https:\/\/arxiv.org\/abs\/1409.1556."},{"key":"e_1_3_2_32_2","unstructured":"Vivienne Sze Yu-Hsin Chen Tien-Ju Yang and Joel Emer. 2017. Efficient processing of deep neural networks: A tutorial and survey. arXiv:1703.09039. Retrieved from https:\/\/arxiv.org\/abs\/1703.09039."},{"key":"e_1_3_2_33_2","first-page":"282","article-title":"mRNA: Enabling efficient mapping space exploration for a reconfigurable neural accelerator","author":"Zhao Zhongyuan","year":"2019","unstructured":"Zhongyuan Zhao, Hyoukjun Kwon, Sachit Kuhar, Weiguang Sheng, Zhigang Mao, and Tushar Krishna. 2019. mRNA: Enabling efficient mapping space exploration for a reconfigurable neural accelerator. In Proceedings of the International Symposium on Performance Analysis of Systems and Software.282\u2013292.","journal-title":"Proceedings of the International Symposium on Performance Analysis of Systems and Software."}],"container-title":["ACM Journal on Emerging Technologies in Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3531011","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3531011","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:00:26Z","timestamp":1750186826000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3531011"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,9,8]]},"references-count":32,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2023,10,31]]}},"alternative-id":["10.1145\/3531011"],"URL":"https:\/\/doi.org\/10.1145\/3531011","relation":{},"ISSN":["1550-4832","1550-4840"],"issn-type":[{"value":"1550-4832","type":"print"},{"value":"1550-4840","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,9,8]]},"assertion":[{"value":"2021-09-17","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-04-10","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-09-08","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}