{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,20]],"date-time":"2026-04-20T22:49:19Z","timestamp":1776725359195,"version":"3.51.2"},"reference-count":57,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2022,12,16]],"date-time":"2022-12-16T00:00:00Z","timestamp":1671148800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2023,3,31]]},"abstract":"<jats:p>\n            Convolutional neural networks (CNNs) are emerging as powerful tools for image processing in important commercial applications. We focus on the important problem of improving the latency of image recognition. While CNNs are highly amenable to prefetching and multithreading to avoid memory latency issues, CNNs\u2019 large data \u2013 each layer\u2019s input, filters, and output \u2013 poses a memory bandwidth problem. While previous work captures only some of the enormous data reuse,\n            <jats:italic>full reuse<\/jats:italic>\n            implies that the initial input image and filters are read once from off-chip and the final output is written once off-chip without spilling the intermediate layers\u2019 data to off-chip. We propose\n            <jats:italic>Occam<\/jats:italic>\n            to capture full reuse via four contributions. First, we identify the necessary conditions for full reuse. Second, we identify the\n            <jats:italic>dependence closure<\/jats:italic>\n            as the sufficient condition to capture full reuse using the least on-chip memory. Third, because the dependence closure is often too large to fit in on-chip memory, we propose a dynamic programming algorithm that optimally partitions a given CNN to guarantee the least off-chip traffic at the partition boundaries for a given on-chip capacity. While tiling is well-known, our contribution determines the optimal cross-layer tiles. Occam\u2019s partitions reside on different chips, forming a pipeline so that a partition\u2019s filters and dependence closure remain on-chip as different images pass through (i.e., each partition incurs off-chip traffic only for its inputs and outputs). Finally, because the optimal partitions may result in an unbalanced pipeline, we propose\n            <jats:italic>staggered asynchronous pipelines (STAPs)<\/jats:italic>\n            that replicate bottleneck stages to improve throughput by staggering mini-batches across replicas. Importantly, STAPs achieve balanced pipelines\n            <jats:italic>without<\/jats:italic>\n            changing Occam\u2019s optimal partitioning. Our simulations show that, on average, Occam cuts off-chip transfers by 21\u00d7 and achieves 2.04\u00d7 and 1.21\u00d7 better performance, and 33% better energy than the base case, respectively. Using a field-programmable gate array (FPGA) implementation, Occam performs 6.1\u00d7 and 1.5\u00d7 better, on average, than the base case and Layer Fusion, respectively.\n          <\/jats:p>","DOI":"10.1145\/3566052","type":"journal-article","created":{"date-parts":[[2022,12,16]],"date-time":"2022-12-16T14:04:55Z","timestamp":1671199495000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["Occam: Optimal Data Reuse for Convolutional Neural Networks"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3370-3576","authenticated-orcid":false,"given":"Ashish","family":"Gondimalla","sequence":"first","affiliation":[{"name":"Purdue University, West Lafayette, Indiana, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4022-8897","authenticated-orcid":false,"given":"Jianqiao","family":"Liu","sequence":"additional","affiliation":[{"name":"Google, USA, Mountain View, California, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4164-4542","authenticated-orcid":false,"given":"Mithuna","family":"Thottethodi","sequence":"additional","affiliation":[{"name":"Purdue University, West Lafayette, Indiana, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6624-4372","authenticated-orcid":false,"given":"T. N.","family":"Vijaykumar","sequence":"additional","affiliation":[{"name":"Purdue University, West Lafayette, Indiana, USA"}]}],"member":"320","published-online":{"date-parts":[[2022,12,16]]},"reference":[{"key":"e_1_3_2_2_2","unstructured":"2022. NVIDIA Deep Learning Performance documentation. Retrieved October 11 2022 from https:\/\/docs.nvidia.com\/deeplearning\/performance\/dl-performance-convolutional\/index.html. Updated May 17 2022."},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1145\/3123939.3123982"},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2016.11"},{"key":"e_1_3_2_5_2","volume-title":"49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916)","author":"Alwani Manoj","year":"2016","unstructured":"Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. 2016. Fused-layer CNN accelerators. In 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916), Taipei, Taiwan. IEEE, 1\u201312."},{"key":"e_1_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2009.4919648"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1145\/2628071.2628106"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2016.2615094"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.1145\/3461648.3463848"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1145\/2541940.2541967"},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2014.58"},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISSCC.2016.7418007"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/207110.207162"},{"key":"e_1_3_2_14_2","volume-title":"Introduction to Algorithms (3rd ed.)","author":"Cormen Thomas H.","year":"2009","unstructured":"Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms (3rd ed.). The MIT Press."},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/984458.984486"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.1145\/2749469.2750389"},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00012"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/JETCAS.2019.2905361"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCAS.2017.8050809"},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2016.30"},{"key":"e_1_3_2_22_2","article-title":"PipeDream: Fast and efficient pipeline parallel DNN training","volume":"1806","author":"Harlap Aaron","year":"2018","unstructured":"Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, and Phillip B. Gibbons. 2018. PipeDream: Fast and efficient pipeline parallel DNN training. CoRR abs\/1806.03377 (2018). arxiv:1806.03377http:\/\/arxiv.org\/abs\/1806.03377.","journal-title":"CoRR"},{"key":"e_1_3_2_23_2","doi-asserted-by":"crossref","unstructured":"Aaron Harlap Deepak Narayanan Amar Phanishayee Vivek Seshadri Gregory R. Ganger and Phillip B. Gibbons. 2018. PipeDream: Pipeline Parallelism for DNN Training. SysML 2018: Conference on Systems and Machine Learning Extended Abstract and Poster Huntsville Ontario Canada. Association for Computing Machinery.","DOI":"10.1145\/3341301.3359646"},{"key":"e_1_3_2_24_2","article-title":"Deep residual learning for image recognition","volume":"1512","author":"He Kaiming","year":"2015","unstructured":"Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition. CoRR abs\/1512.03385 (2015).","journal-title":"CoRR"},{"key":"e_1_3_2_25_2","article-title":"GPipe: Efficient training of giant neural networks using pipeline parallelism","volume":"1811","author":"Huang Yanping","year":"2018","unstructured":"Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, and Zhifeng Chen. 2018. GPipe: Efficient training of giant neural networks using pipeline parallelism. CoRR abs\/1811.06965 (2018). arxiv:1811.06965.","journal-title":"CoRR"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICECS49266.2020.9294981"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1145\/73560.73588"},{"key":"e_1_3_2_28_2","article-title":"Exploring hidden dimensions in parallelizing convolutional neural networks","volume":"1802","author":"Jia Zhihao","year":"2018","unstructured":"Zhihao Jia, Sina Lin, Charles R. Qi, and Alex Aiken. 2018. Exploring hidden dimensions in parallelizing convolutional neural networks. CoRR abs\/1802.04924 (2018). arxiv:1802.04924.","journal-title":"CoRR"},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080246"},{"key":"e_1_3_2_30_2","article-title":"One weird trick for parallelizing convolutional neural networks","volume":"1404","author":"Krizhevsky Alex","year":"2014","unstructured":"Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. CoRR abs\/1404.5997 (2014). arxiv:1404.5997.","journal-title":"CoRR"},{"key":"e_1_3_2_31_2","first-page":"1097","volume-title":"Advances in Neural Information Processing Systems 25","author":"Krizhevsky Alex","year":"2012","unstructured":"Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Lake Tahoe, Nevada. Curran Associates, Inc., 1097\u20131105."},{"key":"e_1_3_2_32_2","article-title":"Packing sparse convolutional neural networks for efficient systolic array implementations: Column combining under joint optimization","volume":"1811","author":"Kung H. T.","year":"2018","unstructured":"H. T. Kung, Bradley McDanel, and Sai Qian Zhang. 2018. Packing sparse convolutional neural networks for efficient systolic array implementations: Column combining under joint optimization. CoRR abs\/1811.04770 (2018). arxiv:1811.04770.","journal-title":"CoRR"},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1145\/106972.106981"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/5.726791"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1145\/2485922.2485964"},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/3458744.3473353"},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.5555\/3045390.3045690"},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/2694344.2694358"},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISPA-BDCloud-SocialCom-SustainCom51426.2020.00041"},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2016.2549523"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1145\/2694344.2694364"},{"key":"e_1_3_2_42_2","unstructured":"Micron Technical Note. 2017. The next generation graphics DRAM. Micron 20. Retrieved October 11 2022 from https:\/\/www.micron.com\/-\/media\/client\/global\/documents\/products\/technical-note\/dram\/tned03_gddr6.pdf."},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1145\/3447818.3460378"},{"key":"e_1_3_2_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2019.00042"},{"key":"e_1_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080254"},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1145\/2491956.2462176"},{"key":"e_1_3_2_47_2","article-title":"ImageNet large scale visual recognition challenge","volume":"1409","author":"Russakovsky Olga","year":"2014","unstructured":"Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. 2014. ImageNet large scale visual recognition challenge. CoRR abs\/1409.0575 (2014).","journal-title":"CoRR"},{"key":"e_1_3_2_48_2","volume-title":"25th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM\u201917)","author":"Shen Yongming","year":"2017","unstructured":"Yongming Shen, Michael Ferdman, and Peter Milder. 2017. Escher: A CNN accelerator with flexible buffering to minimize off-chip transfer. In 25th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM\u201917), Napa, CA, USA. IEEE, 93\u2013100."},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1186\/s40537-019-0197-0"},{"key":"e_1_3_2_50_2","article-title":"HyPar: Towards hybrid parallelism for deep learning accelerator array","volume":"1901","author":"Song Linghao","year":"2019","unstructured":"Linghao Song, Jiachen Mao, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. 2019. HyPar: Towards hybrid parallelism for deep learning accelerator array. CoRR abs\/1901.02067 (2019). arxiv:1901.02067.","journal-title":"CoRR"},{"key":"e_1_3_2_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISVLSI49217.2020.00051"},{"key":"e_1_3_2_52_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-57675-2_14"},{"key":"e_1_3_2_53_2","doi-asserted-by":"publisher","DOI":"10.1145\/113445.113449"},{"key":"e_1_3_2_54_2","doi-asserted-by":"publisher","DOI":"10.5555\/353939"},{"key":"e_1_3_2_55_2","volume-title":"International Conference on Learning Representations (ICLR\u201916)","author":"Yu Fisher","year":"2016","unstructured":"Fisher Yu and Vladlen Koltun. 2016. Multi-scale context aggregation by dilated convolutions. In International Conference on Learning Representations (ICLR\u201916), San Juan, Puerto Rico."},{"key":"e_1_3_2_56_2","doi-asserted-by":"publisher","DOI":"10.1109\/NORCHIP.2016.7792892"},{"key":"e_1_3_2_57_2","first-page":"818","volume-title":"Visualizing and Understanding Convolutional Networks","author":"Zeiler Matthew D.","year":"2014","unstructured":"Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and Understanding Convolutional Networks. Springer International Publishing, 818\u2013833."},{"key":"e_1_3_2_58_2","doi-asserted-by":"publisher","DOI":"10.1145\/2684746.2689060"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3566052","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3566052","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T18:08:33Z","timestamp":1750183713000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3566052"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,12,16]]},"references-count":57,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2023,3,31]]}},"alternative-id":["10.1145\/3566052"],"URL":"https:\/\/doi.org\/10.1145\/3566052","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,12,16]]},"assertion":[{"value":"2021-09-17","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-09-07","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-12-16","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}