{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,12]],"date-time":"2025-11-12T14:03:03Z","timestamp":1762956183422,"version":"3.41.0"},"reference-count":40,"publisher":"Association for Computing Machinery (ACM)","issue":"5s","license":[{"start":{"date-parts":[[2019,10,7]],"date-time":"2019-10-07T00:00:00Z","timestamp":1570406400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Embed. Comput. Syst."],"published-print":{"date-parts":[[2019,10,31]]},"abstract":"<jats:p>Deep Neural Networks (DNNs) have become an essential component of various applications. While today\u2019s DNNs are mainly restricted to cloud services, network connectivity, energy, and data privacy problems make it important to support efficient DNN computation on low-cost, low-power processors like microcontrollers. However, due to the constrained computation resources, it is challenging to execute large DNN models on microcontrollers. Using sub-byte low-precision input activations and weights is a typical method to reduce DNN computation. But on byte-addressable microcontrollers, the sub-byte computation is not well supported. The sub-byte inputs and weights need to be unpacked from bitstreams before computation, which incurs significant computation and energy overhead.<\/jats:p>\n          <jats:p>In this paper, we propose the TF-Net pipeline to efficiently deploy sub-byte DNNs on microcontrollers. While TF-Net allows for a range of weight and input precision, we find Ternary weights and Four-bit inputs provide the optimal balance between model accuracy, computation performance, and energy efficiency. TF-Net first includes a training framework for sub-byte low-precision DNN models. Two algorithms are then introduced to accelerate the trained models. The first, direct buffer convolution, amortizes unpacking overhead by caching unpacked inputs. The second, packed sub-byte multiply-accumulate, utilizes a single multiplication instruction to perform multiple sub-byte multiply-accumulate computations. To further accelerate DNN computation, we propose two instructions, Multiply-Shift-Accumulate and Unpack, to extend the existing microcontroller instruction set. On the tested networks, TF-Net can help improve the computation performance and energy efficiency by 1.83\u00d7 and 2.28\u00d7 on average, respectively.<\/jats:p>","DOI":"10.1145\/3358189","type":"journal-article","created":{"date-parts":[[2019,10,10]],"date-time":"2019-10-10T13:13:05Z","timestamp":1570713185000},"page":"1-21","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":15,"title":["TF-Net"],"prefix":"10.1145","volume":"18","author":[{"given":"Jiecao","family":"Yu","sequence":"first","affiliation":[{"name":"University of Michigan"}]},{"given":"Andrew","family":"Lukefahr","sequence":"additional","affiliation":[{"name":"Indiana University Bloomington"}]},{"given":"Reetuparna","family":"Das","sequence":"additional","affiliation":[{"name":"University of Michigan"}]},{"given":"Scott","family":"Mahlke","sequence":"additional","affiliation":[{"name":"University of Michigan"}]}],"member":"320","published-online":{"date-parts":[[2019,10,7]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/3123939.3123982"},{"key":"e_1_2_1_2_1","unstructured":"Atmel. 2010. Atmel AT86RF212 transceiver. http:\/\/ww1.microchip.com\/downloads\/en\/DeviceDoc\/doc8168.pdf.  Atmel. 2010. Atmel AT86RF212 transceiver. http:\/\/ww1.microchip.com\/downloads\/en\/DeviceDoc\/doc8168.pdf."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.574"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/3007787.3001177"},{"key":"e_1_2_1_5_1","volume-title":"cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759","author":"Chetlur Sharan","year":"2014","unstructured":"Sharan Chetlur , Cliff Woolley , Philippe Vandermersch , Jonathan Cohen , John Tran , Bryan Catanzaro , and Evan Shelhamer . 2014. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 ( 2014 ). Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014)."},{"key":"e_1_2_1_6_1","volume-title":"Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems. 3123--3131.","author":"Courbariaux Matthieu","year":"2015","unstructured":"Matthieu Courbariaux , Yoshua Bengio , and Jean-Pierre David . 2015 . Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems. 3123--3131. Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. 2015. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems. 3123--3131."},{"key":"e_1_2_1_7_1","volume-title":"Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or- 1. arXiv preprint arXiv:1602.02830","author":"Courbariaux Matthieu","year":"2016","unstructured":"Matthieu Courbariaux , Itay Hubara , Daniel Soudry , Ran El-Yaniv , and Yoshua Bengio . 2016. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or- 1. arXiv preprint arXiv:1602.02830 ( 2016 ). Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or- 1. arXiv preprint arXiv:1602.02830 (2016)."},{"key":"e_1_2_1_8_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00040"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00012"},{"key":"e_1_2_1_11_1","volume-title":"International Conference on Machine Learning. 1737--1746","author":"Gupta Suyog","year":"2015","unstructured":"Suyog Gupta , Ankur Agrawal , Kailash Gopalakrishnan , and Pritish Narayanan . 2015 . Deep learning with limited numerical precision . In International Conference on Machine Learning. 1737--1746 . Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep learning with limited numerical precision. In International Conference on Machine Learning. 1737--1746."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.322"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_2_1_14_1","unstructured":"iFixit. 2013. Fitbit Flex Teardown. https:\/\/www.ifixit.com\/Teardown\/Fitbit+Flex+Teardown\/16050.  iFixit. 2013. Fitbit Flex Teardown. https:\/\/www.ifixit.com\/Teardown\/Fitbit+Flex+Teardown\/16050."},{"key":"e_1_2_1_15_1","volume-title":"Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167","author":"Ioffe Sergey","year":"2015","unstructured":"Sergey Ioffe and Christian Szegedy . 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 ( 2015 ). Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)."},{"key":"e_1_2_1_16_1","unstructured":"Alex Krizhevsky. 2009. Learning multiple layers of features from tiny images. (2009).  Alex Krizhevsky. 2009. Learning multiple layers of features from tiny images. (2009)."},{"volume-title":"http:\/\/www.latticesemi.com\/Products\/FPGAandCPLD\/iCE40","key":"e_1_2_1_17_1","unstructured":"Lattice. 2013. Lattice. http:\/\/www.latticesemi.com\/Products\/FPGAandCPLD\/iCE40 . Lattice. 2013. Lattice. http:\/\/www.latticesemi.com\/Products\/FPGAandCPLD\/iCE40."},{"key":"e_1_2_1_18_1","volume-title":"Ternary weight networks. arXiv preprint arXiv:1605.04711","author":"Li Fengfu","year":"2016","unstructured":"Fengfu Li , Bo Zhang , and Bin Liu . 2016. Ternary weight networks. arXiv preprint arXiv:1605.04711 ( 2016 ). Fengfu Li, Bo Zhang, and Bin Liu. 2016. Ternary weight networks. arXiv preprint arXiv:1605.04711 (2016)."},{"key":"e_1_2_1_19_1","volume-title":"Network in network. arXiv preprint arXiv:1312.4400","author":"Lin Min","year":"2013","unstructured":"Min Lin , Qiang Chen , and Shuicheng Yan . 2013. Network in network. arXiv preprint arXiv:1312.4400 ( 2013 ). Min Lin, Qiang Chen, and Shuicheng Yan. 2013. Network in network. arXiv preprint arXiv:1312.4400 (2013)."},{"key":"e_1_2_1_20_1","unstructured":"Arm MBED. 2017. STM32 NUCLEO-F411RE development board. https:\/\/os.mbed.com\/platforms\/ST-Nucleo-F411RE\/.  Arm MBED. 2017. STM32 NUCLEO-F411RE development board. https:\/\/os.mbed.com\/platforms\/ST-Nucleo-F411RE\/."},{"key":"e_1_2_1_21_1","unstructured":"Mark McDermott. 2008. The ARM Instruction Set Architecture. http:\/\/users.ece.utexas.edu\/valvano\/EE345M\/Arm_EE382N_4.pdf.  Mark McDermott. 2008. The ARM Instruction Set Architecture. http:\/\/users.ece.utexas.edu\/valvano\/EE345M\/Arm_EE382N_4.pdf."},{"key":"e_1_2_1_22_1","unstructured":"NVIDIA. 2019. cuDNN Installation Guide :: Deep Learning SDK Documentation. https:\/\/docs.nvidia.com\/deeplearning\/sdk\/cudnn-developer-guide\/index.html.  NVIDIA. 2019. cuDNN Installation Guide :: Deep Learning SDK Documentation. https:\/\/docs.nvidia.com\/deeplearning\/sdk\/cudnn-developer-guide\/index.html."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00063"},{"key":"e_1_2_1_24_1","volume-title":"Precision highway for ultra low-precision quantization. arXiv preprint arXiv:1812.09818","author":"Park Eunhyeok","year":"2018","unstructured":"Eunhyeok Park , Dongyoung Kim , Sungjoo Yoo , and Peter Vajda . 2018. Precision highway for ultra low-precision quantization. arXiv preprint arXiv:1812.09818 ( 2018 ). Eunhyeok Park, Dongyoung Kim, Sungjoo Yoo, and Peter Vajda. 2018. Precision highway for ultra low-precision quantization. arXiv preprint arXiv:1812.09818 (2018)."},{"key":"e_1_2_1_25_1","volume-title":"Hai Li, Yiran Chen, and Pradeep Dubey.","author":"Park Jongsoo","year":"2016","unstructured":"Jongsoo Park , Sheng Li , Wei Wen , Ping Tak Peter Tang , Hai Li, Yiran Chen, and Pradeep Dubey. 2016 . Faster cnns with direct sparse convolutions and guided pruning. arXiv preprint arXiv:1608.01409 (2016). Jongsoo Park, Sheng Li, Wei Wen, Ping Tak Peter Tang, Hai Li, Yiran Chen, and Pradeep Dubey. 2016. Faster cnns with direct sparse convolutions and guided pruning. arXiv preprint arXiv:1608.01409 (2016)."},{"key":"e_1_2_1_26_1","unstructured":"Adam Paszke Sam Gross Soumith Chintala Gregory Chanan Edward Yang Zachary DeVito Zeming Lin Alban Desmaison Luca Antiga and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS-W.  Adam Paszke Sam Gross Soumith Chintala Gregory Chanan Edward Yang Zachary DeVito Zeming Lin Alban Desmaison Luca Antiga and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS-W."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46493-0_32"},{"key":"e_1_2_1_28_1","volume-title":"Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767","author":"Redmon Joseph","year":"2018","unstructured":"Joseph Redmon and Ali Farhadi . 2018. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 ( 2018 ). Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018)."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080221"},{"key":"e_1_2_1_30_1","unstructured":"STMicroelectronics. 2018. STM32 Power Shield Datasheet. http:\/\/www.st.com\/content\/ccc\/resource\/technical\/document\/data_brief\/group1\/1d\/46\/2a\/b9\/60\/98\/47\/13\/DM00417848\/files\/DM00417848.pdf\/jcr:content\/translations\/en.DM00417848.pdf.  STMicroelectronics. 2018. STM32 Power Shield Datasheet. http:\/\/www.st.com\/content\/ccc\/resource\/technical\/document\/data_brief\/group1\/1d\/46\/2a\/b9\/60\/98\/47\/13\/DM00417848\/files\/DM00417848.pdf\/jcr:content\/translations\/en.DM00417848.pdf."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00037"},{"key":"e_1_2_1_32_1","volume-title":"Mao","author":"Vanhoucke Vincent","year":"2011","unstructured":"Vincent Vanhoucke , Andrew Senior , and Mark Z . Mao . 2011 . Improving the speed of neural networks on CPUs. Citeseer . Vincent Vanhoucke, Andrew Senior, and Mark Z. Mao. 2011. Improving the speed of neural networks on CPUs. Citeseer."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2019.00029"},{"key":"e_1_2_1_34_1","unstructured":"Xilinx. 2017. Deep Learning with INT8 Optimization on Xilinx Devices. https:\/\/www.xilinx.com\/support\/documentation\/white_papers\/wp486-deep-learning-int8.pdf.  Xilinx. 2017. Deep Learning with INT8 Optimization on Xilinx Devices. https:\/\/www.xilinx.com\/support\/documentation\/white_papers\/wp486-deep-learning-int8.pdf."},{"key":"e_1_2_1_35_1","unstructured":"Xilinx. 2017. Xilinx 8-Bit Dot-Product Acceleration. https:\/\/pdfs.semanticscholar.org\/3ac6\/4259d37ad76c640333bf8cfccd36bb9bc4f0.pdf.  Xilinx. 2017. Xilinx 8-Bit Dot-Product Acceleration. https:\/\/pdfs.semanticscholar.org\/3ac6\/4259d37ad76c640333bf8cfccd36bb9bc4f0.pdf."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01237-3_23"},{"key":"e_1_2_1_37_1","volume-title":"Hello edge: Keyword spotting on microcontrollers. arXiv preprint arXiv:1711.07128","author":"Zhang Yundong","year":"2017","unstructured":"Yundong Zhang , Naveen Suda , Liangzhen Lai , and Vikas Chandra . 2017. Hello edge: Keyword spotting on microcontrollers. arXiv preprint arXiv:1711.07128 ( 2017 ). Yundong Zhang, Naveen Suda, Liangzhen Lai, and Vikas Chandra. 2017. Hello edge: Keyword spotting on microcontrollers. arXiv preprint arXiv:1711.07128 (2017)."},{"key":"e_1_2_1_38_1","volume-title":"DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160","author":"Zhou Shuchang","year":"2016","unstructured":"Shuchang Zhou , Yuxin Wu , Zekun Ni , Xinyu Zhou , He Wen , and Yuheng Zou . 2016. DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160 ( 2016 ). Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. 2016. DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160 (2016)."},{"key":"e_1_2_1_39_1","volume":"201","author":"Zhu Chenzhuo","unstructured":"Chenzhuo Zhu , Song Han , Huizi Mao , and William J Dally. 201 6. Trained ternary quantization. arXiv preprint arXiv:1612.01064 (2016). Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. 2016. Trained ternary quantization. arXiv preprint arXiv:1612.01064 (2016).","journal-title":"William J Dally."},{"key":"e_1_2_1_40_1","volume-title":"Euphrates: Algorithm-soc co-design for low-power mobile continuous vision. arXiv preprint arXiv:1803.11232","author":"Zhu Yuhao","year":"2018","unstructured":"Yuhao Zhu , Anand Samajdar , Matthew Mattina , and Paul Whatmough . 2018 . Euphrates: Algorithm-soc co-design for low-power mobile continuous vision. arXiv preprint arXiv:1803.11232 (2018). Yuhao Zhu, Anand Samajdar, Matthew Mattina, and Paul Whatmough. 2018. Euphrates: Algorithm-soc co-design for low-power mobile continuous vision. arXiv preprint arXiv:1803.11232 (2018)."}],"container-title":["ACM Transactions on Embedded Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3358189","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3358189","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T22:32:58Z","timestamp":1750199578000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3358189"}},"subtitle":["Deploying Sub-Byte Deep Neural Networks on Microcontrollers"],"short-title":[],"issued":{"date-parts":[[2019,10,7]]},"references-count":40,"journal-issue":{"issue":"5s","published-print":{"date-parts":[[2019,10,31]]}},"alternative-id":["10.1145\/3358189"],"URL":"https:\/\/doi.org\/10.1145\/3358189","relation":{},"ISSN":["1539-9087","1558-3465"],"issn-type":[{"type":"print","value":"1539-9087"},{"type":"electronic","value":"1558-3465"}],"subject":[],"published":{"date-parts":[[2019,10,7]]},"assertion":[{"value":"2019-04-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-07-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-10-07","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}