{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,24]],"date-time":"2025-08-24T01:11:42Z","timestamp":1755997902014,"version":"3.41.0"},"reference-count":32,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2020,5,29]],"date-time":"2020-05-29T00:00:00Z","timestamp":1590710400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2020,6,30]]},"abstract":"<jats:p>Recent trends in deep convolutional neural networks (DCNNs) impose hardware accelerators as a viable solution for computer vision and speech recognition. The Orlando SoC architecture from STMicroelectronics targets exactly this class of problems by integrating hardware-accelerated convolutional blocks together with DSPs and on-chip memory resources to enable energy-efficient designs of DCNNs. The main advantage of the Orlando platform is to have runtime configurable convolutional accelerators that can adapt to different DCNN workloads. This opens new challenges for mapping the computation to the accelerators and for managing the on-chip resources efficiently. In this work, we propose a runtime design space exploration and mapping methodology for runtime resource management in terms of on-chip memory, convolutional accelerators, and external bandwidth. Experimental results are reported in terms of power\/performance scalability, Pareto analysis, mapping adaptivity, and accelerator utilization for the Orlando architecture mapping the VGG-16, Tiny-Yolo(v2), and MobileNet topologies.<\/jats:p>","DOI":"10.1145\/3379933","type":"journal-article","created":{"date-parts":[[2020,5,30]],"date-time":"2020-05-30T04:22:06Z","timestamp":1590812526000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["Runtime Design Space Exploration and Mapping of DCNNs for the Ultra-Low-Power Orlando SoC"],"prefix":"10.1145","volume":"17","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-5140-6463","authenticated-orcid":false,"given":"Ahmet","family":"Erdem","sequence":"first","affiliation":[{"name":"Politecnico di Milano, Milan, Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1668-0883","authenticated-orcid":false,"given":"Cristina","family":"Silvano","sequence":"additional","affiliation":[{"name":"Politecnico di Milano, Milan, Italy"}]},{"given":"Thomas","family":"Boesch","sequence":"additional","affiliation":[{"name":"STMicroelectronics, Geneva, Switzerland"}]},{"given":"Andrea Carlo","family":"Ornstein","sequence":"additional","affiliation":[{"name":"STMicroelectronics, Geneva, Switzerland"}]},{"given":"Surinder-Pal","family":"Singh","sequence":"additional","affiliation":[{"name":"STMicroelectronics, Geneva, Switzerland"}]},{"given":"Giuseppe","family":"Desoli","sequence":"additional","affiliation":[{"name":"STMicroelectronics, Geneva, Switzerland"}]}],"member":"320","published-online":{"date-parts":[[2020,5,29]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2015.41"},{"key":"e_1_2_1_2_1","volume-title":"TVM: End-to-end optimization stack for deep learning. arxiv:1802.04799.","author":"Chen Tianqi","year":"2018","unstructured":"Tianqi Chen , Thierry Moreau , Ziheng Jiang , Haichen Shen , Eddie Q. Yan , Leyuan Wang , Yuwei Hu , Luis Ceze , Carlos Guestrin , and Arvind Krishnamurthy . 2018 . TVM: End-to-end optimization stack for deep learning. arxiv:1802.04799. Tianqi Chen, Thierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q. Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: End-to-end optimization stack for deep learning. arxiv:1802.04799."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2014.58"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/JSSC.2017.2749425"},{"key":"e_1_2_1_5_1","volume-title":"Proceedings of Embedded World","author":"Desoli Giuseppe","year":"2018","unstructured":"Giuseppe Desoli , Thomas Boesch , Surinder Pal-Singh , and Nitin Chawla . 2018 . A new scalable architecture to accelerate deep convolutional neural networks for low power IoT applications . In Proceedings of Embedded World 2018. Giuseppe Desoli, Thomas Boesch, Surinder Pal-Singh, and Nitin Chawla. 2018. A new scalable architecture to accelerate deep convolutional neural networks for low power IoT applications. In Proceedings of Embedded World 2018."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISSCC.2017.7870349"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/ASAP.2018.8445096"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3037697.3037702"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3297858.3304014"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM.2017.25"},{"key":"e_1_2_1_11_1","volume-title":"Dally","author":"Han Song","year":"2015","unstructured":"Song Han , Huizi Mao , and William J . Dally . 2015 . Deep compression: Compressing deep neural network with pruning, trained quantization and Huffman coding. arxiv:1510.00149. Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: Compressing deep neural network with pruning, trained quantization and Huffman coding. arxiv:1510.00149."},{"key":"e_1_2_1_12_1","unstructured":"Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2015. Deep residual learning for image recognition. arxiv:1512.03385.  Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2015. Deep residual learning for image recognition. arxiv:1512.03385."},{"key":"e_1_2_1_13_1","unstructured":"Andrew G. Howard Menglong Zhu Bo Chen Dmitry Kalenichenko Weijun Wang Tobias Weyand Marco Andreetto and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arxiv:1704.04861.  Andrew G. Howard Menglong Zhu Bo Chen Dmitry Kalenichenko Weijun Wang Tobias Weyand Marco Andreetto and Hartwig Adam. 2017. MobileNets: Efficient convolutional neural networks for mobile vision applications. arxiv:1704.04861."},{"key":"e_1_2_1_14_1","volume-title":"Hinton","author":"Krizhevsky Alex","year":"2012","unstructured":"Alex Krizhevsky , Ilya Sutskever , and Geoffrey E . Hinton . 2012 . ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (NIPS\u2019 12). 1106--1114. http:\/\/papers.nips.cc\/paper\/4824-imagenet-classification-with-deep-convolutional-neural-networks. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (NIPS\u201912). 1106--1114. http:\/\/papers.nips.cc\/paper\/4824-imagenet-classification-with-deep-convolutional-neural-networks."},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/3173162.3173176"},{"key":"e_1_2_1_16_1","volume-title":"Compiled. Retrieved","author":"Leary Chris","year":"2017","unstructured":"Chris Leary and Todd Wang . 2017 . XLA: TensorFlow , Compiled. Retrieved July 17, 2019 from https:\/\/www.tensorflow.org\/xla. --&gt; Chris Leary and Todd Wang. 2017. XLA: TensorFlow, Compiled. Retrieved July 17, 2019 from https:\/\/www.tensorflow.org\/xla. --&gt;"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.23919\/DATE.2018.8342033"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2017.29"},{"key":"e_1_2_1_19_1","volume-title":"Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS\u201919)","author":"Parashar Angshuman","year":"2019","unstructured":"Angshuman Parashar , Priyanka Raina , Yakun Sophia Shao , Yu-Hsin Chen , Victor A. Ying , Anurag Mukkara , Rangharajan Venkatesan , Brucek Khailany , Stephen W. Keckler , and Joel S. Emer . 2019. Timeloop: A systematic approach to DNN accelerator evaluation . In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS\u201919) . IEEE, Los Alamitos, CA, 304--315. DOI:https:\/\/doi.org\/10.1109\/ISPASS. 2019 .00042 10.1109\/ISPASS.2019.00042 Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A. Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W. Keckler, and Joel S. Emer. 2019. Timeloop: A systematic approach to DNN accelerator evaluation. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS\u201919). IEEE, Los Alamitos, CA, 304--315. DOI:https:\/\/doi.org\/10.1109\/ISPASS.2019.00042"},{"key":"e_1_2_1_20_1","doi-asserted-by":"crossref","unstructured":"Joseph Redmon and Ali Farhadi. 2016. YOLO9000: Better faster stronger. arxiv:1612.08242.  Joseph Redmon and Ali Farhadi. 2016. YOLO9000: Better faster stronger. arxiv:1612.08242.","DOI":"10.1109\/CVPR.2017.690"},{"key":"e_1_2_1_21_1","volume-title":"Glow: Graph lowering compiler techniques for neural networks. arxiv:1805.00907.","author":"Rotem Nadav","year":"2018","unstructured":"Nadav Rotem , Jordan Fix , Saleem Abdulrasool , Summer Deng , Roman Dzhabarov , James Hegeman , Roman Levenstein , 2018 . Glow: Graph lowering compiler techniques for neural networks. arxiv:1805.00907. Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Summer Deng, Roman Dzhabarov, James Hegeman, Roman Levenstein, et al. 2018. Glow: Graph lowering compiler techniques for neural networks. arxiv:1805.00907."},{"key":"e_1_2_1_22_1","unstructured":"Ananda Samajdar Yuhao Zhu Paul N. Whatmough Matthew Mattina and Tushar Krishna. 2018. SCALE-Sim: Systolic CNN accelerator. arxiv:1811.02883.  Ananda Samajdar Yuhao Zhu Paul N. Whatmough Matthew Mattina and Tushar Krishna. 2018. SCALE-Sim: Systolic CNN accelerator. arxiv:1811.02883."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2016.7783720"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-94-007-1488-5_4"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISSCC.2016.7418008"},{"key":"e_1_2_1_26_1","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arxiv:1409.1556.  Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arxiv:1409.1556."},{"key":"e_1_2_1_27_1","doi-asserted-by":"crossref","unstructured":"Christian Szegedy Wei Liu Yangqing Jia Pierre Sermanet Scott E. Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich. 2014. Going deeper with convolutions. arxiv:1409.4842.  Christian Szegedy Wei Liu Yangqing Jia Pierre Sermanet Scott E. Reed Dragomir Anguelov Dumitru Erhan Vincent Vanhoucke and Andrew Rabinovich. 2014. Going deeper with convolutions. arxiv:1409.4842.","DOI":"10.1109\/CVPR.2015.7298594"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM.2016.22"},{"volume-title":"Proceedings of the 2017 54th ACM\/EDAC\/IEEE Design Automation Conference (DAC\u201917)","author":"Wei Xuechao","key":"e_1_2_1_29_1","unstructured":"Xuechao Wei , Cody Hao Yu , Peng Zhang , Youxiang Chen , Yuxin Wang , Han Hu , Yun Liang , and J. Cong . 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs . In Proceedings of the 2017 54th ACM\/EDAC\/IEEE Design Automation Conference (DAC\u201917) . ACM, New York, NY, 1--6. DOI:https:\/\/doi.org\/10.1145\/3061639.3062207 10.1145\/3061639.3062207 Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and J. Cong. 2017. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs. In Proceedings of the 2017 54th ACM\/EDAC\/IEEE Design Automation Conference (DAC\u201917). ACM, New York, NY, 1--6. DOI:https:\/\/doi.org\/10.1145\/3061639.3062207"},{"key":"e_1_2_1_30_1","volume-title":"XLA: Domain-specific compiler for linear algebra that optimizes TensorFlow computations.","author":"Team XLA","year":"2019","unstructured":"XLA Team 2019 . XLA: Domain-specific compiler for linear algebra that optimizes TensorFlow computations. XLA Team et al. 2019. XLA: Domain-specific compiler for linear algebra that optimizes TensorFlow computations."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/2684746.2689060"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2019.00040"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3379933","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3379933","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T22:41:20Z","timestamp":1750200080000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3379933"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,5,29]]},"references-count":32,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2020,6,30]]}},"alternative-id":["10.1145\/3379933"],"URL":"https:\/\/doi.org\/10.1145\/3379933","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2020,5,29]]},"assertion":[{"value":"2019-03-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-01-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-05-29","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}