{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,2]],"date-time":"2026-05-02T15:40:24Z","timestamp":1777736424727,"version":"3.51.4"},"reference-count":45,"publisher":"Association for Computing Machinery (ACM)","issue":"6","license":[{"start":{"date-parts":[[2021,6,28]],"date-time":"2021-06-28T00:00:00Z","timestamp":1624838400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100003725","name":"National Research Foundation of Korea","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100003725","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Korea government","award":["NRF-2019R1A5A1027055, NRF-2020R1A2C2004329, and 2020-0-01309"],"award-info":[{"award-number":["NRF-2019R1A5A1027055, NRF-2020R1A2C2004329, and 2020-0-01309"]}]},{"name":"Institute of Information & communications Technology Planning & Evaluation"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Des. Autom. Electron. Syst."],"published-print":{"date-parts":[[2021,11,30]]},"abstract":"<jats:p>\n            This article discusses the high-performance near-memory neural network (NN) accelerator architecture utilizing the logic die in three-dimensional (3D) High Bandwidth Memory\u2013 (HBM) like memory. As most of the previously reported 3D memory-based near-memory NN accelerator designs used the Hybrid Memory Cube (HMC) memory, we first focus on identifying the key differences between HBM and HMC in terms of near-memory NN accelerator design. One of the major differences between the two 3D memories is that HBM has the centralized through-\n            <jats:bold>silicon-via (TSV)<\/jats:bold>\n            channels while HMC has distributed TSV channels for separate vaults. Based on the observation, we introduce the\n            <jats:italic>Round-Robin Data Fetching<\/jats:italic>\n            and\n            <jats:italic>Groupwise Broadcast<\/jats:italic>\n            schemes to exploit the centralized TSV channels for improvement of the data feeding rate for the processing elements. Using synthesized designs in a 28-nm CMOS technology, performance and energy consumption of the proposed architectures with various dataflow models are evaluated. Experimental results show that the proposed schemes reduce the runtime by 16.4\u201339.3% on average and the energy consumption by 2.1\u20135.1% on average compared to conventional data fetching schemes.\n          <\/jats:p>","DOI":"10.1145\/3460971","type":"journal-article","created":{"date-parts":[[2021,6,28]],"date-time":"2021-06-28T17:06:56Z","timestamp":1624900016000},"page":"1-20","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":14,"title":["High-throughput Near-Memory Processing on CNNs with 3D HBM-like Memory"],"prefix":"10.1145","volume":"26","author":[{"given":"Naebeom","family":"Park","sequence":"first","affiliation":[{"name":"Pohang University of Science and Technology, Pohang"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sungju","family":"Ryu","sequence":"additional","affiliation":[{"name":"Pohang University of Science and Technology, Pohang"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jaeha","family":"Kung","sequence":"additional","affiliation":[{"name":"DGIST, Daegu"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jae-Joon","family":"Kim","sequence":"additional","affiliation":[{"name":"Pohang University of Science and Technology, Pohang"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2021,6,28]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"NVIDIA Turing Architecture In-Depth. Retrieved","year":"2018","unstructured":"[n.d.]. NVIDIA Turing Architecture In-Depth. Retrieved December 7, 2018 from https:\/\/devblogs.nvidia.com\/nvidia-turing-architecture-in-depth\/. [n.d.]. NVIDIA Turing Architecture In-Depth. Retrieved December 7, 2018 from https:\/\/devblogs.nvidia.com\/nvidia-turing-architecture-in-depth\/."},{"key":"e_1_2_1_2_1","volume-title":"Reinventing Memory Technology. Retrieved","year":"2018","unstructured":"[n.d.]. Reinventing Memory Technology. Retrieved December 7, 2018 from https:\/\/www.amd.com\/en\/technologies\/hbm\/. [n.d.]. Reinventing Memory Technology. Retrieved December 7, 2018 from https:\/\/www.amd.com\/en\/technologies\/hbm\/."},{"key":"e_1_2_1_3_1","unstructured":"Takuya Akiba Shuji Suzuki and Keisuke Fukuda. 2017. Extremely large minibatch SGD: Training ResNet-50 on ImageNet in 15 minutes. arXiv:1711.04325. Retrieved from https:\/\/arxiv.org\/abs\/1711.04325.  Takuya Akiba Shuji Suzuki and Keisuke Fukuda. 2017. Extremely large minibatch SGD: Training ResNet-50 on ImageNet in 15 minutes. arXiv:1711.04325. Retrieved from https:\/\/arxiv.org\/abs\/1711.04325."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2017.2752706"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2014.58"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/JSSC.2016.2616357"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISSCC.2018.8310257"},{"key":"e_1_2_1_8_1","volume-title":"Proceedings of the IEEE Hot Chips 29 Symposium (HCS\u201917)","author":"Dean Jeff","year":"2017","unstructured":"Jeff Dean . 2017 . Recent advances in artificial intelligence and the implications for computer system design . In Proceedings of the IEEE Hot Chips 29 Symposium (HCS\u201917) . 1\u2013116. Jeff Dean. 2017. Recent advances in artificial intelligence and the implications for computer system design. In Proceedings of the IEEE Hot Chips 29 Symposium (HCS\u201917). 1\u2013116."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_2_1_10_1","volume-title":"2nd Workshop on Near-Data Processing (WoNDP\u201914)","author":"Eckert Yasuko","unstructured":"Yasuko Eckert , Nuwan Jayasena , and Gabriel H. Loh . 2014. Thermal feasibility of die-stacked processing in memory . In 2nd Workshop on Near-Data Processing (WoNDP\u201914) . Yasuko Eckert, Nuwan Jayasena, and Gabriel H. Loh. 2014. Thermal feasibility of die-stacked processing in memory. In 2nd Workshop on Near-Data Processing (WoNDP\u201914)."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/3093337.3037702"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA.2017.7989385"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/3007787.3001163"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3020078.3021745"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00140"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/HOTCHIPS.2011.7477494"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/VLSIT.2012.6242474"},{"key":"e_1_2_1_19_1","volume-title":"High bandwidth memory (HBM) dram. J. Educ. Sust. Dev.235","author":"Standard JEDEC","year":"2013","unstructured":"JEDEC Standard . 2013. High bandwidth memory (HBM) dram. J. Educ. Sust. Dev.235 ( 2013 ). JEDEC Standard. 2013. High bandwidth memory (HBM) dram. J. Educ. Sust. Dev.235 (2013)."},{"key":"e_1_2_1_20_1","volume-title":"Scarpazza","author":"Jia Zhe","year":"2018","unstructured":"Zhe Jia , Marco Maggioni , Benjamin Staiger , and Daniele P . Scarpazza . 2018 . Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. Technical Report. Citadel . Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. Technical Report. Citadel."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/3140659.3080246"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3007787.3001178"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.5555\/3294771.3294937"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.5555\/2999134.2999257"},{"key":"e_1_2_1_25_1","volume-title":"Deep learning. Nature 521, 7553","author":"LeCun Yann","year":"2015","unstructured":"Yann LeCun , Yoshua Bengio , and Geoffrey Hinton . 2015. Deep learning. Nature 521, 7553 ( 2015 ), 436. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISSCC.2019.8662302"},{"key":"e_1_2_1_27_1","volume-title":"Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC\u201916)","author":"Lee Jong Chern","year":"2016","unstructured":"Jong Chern Lee , Jihwan Kim , Kyung Whan Kim , Young Jun Ku , Dae Suk Kim , Chunseok Jeong , Tae Sik Yun , Hongjung Kim , Ho Sung Cho , Yeon Ok Kim , Jae Hwan Kim , Jin Ho Kim , Sangmuk Oh , Hyun Sung Lee , Ki Hun Kwon , Dong Beom Lee , Young Jae Choi , Jeajin Lee , Hyeon Gon Kim , Jun Hyun Chun , Jonghoon Oh , and Seok Hee Lee . 2016 . 18.3 A 1.2 V 64Gb 8-channel 256GB\/s HBM DRAM with peripheral-base-die architecture and small-swing technique on heavy load interface . In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC\u201916) . IEEE, 318\u2013319. Jong Chern Lee, Jihwan Kim, Kyung Whan Kim, Young Jun Ku, Dae Suk Kim, Chunseok Jeong, Tae Sik Yun, Hongjung Kim, Ho Sung Cho, Yeon Ok Kim, Jae Hwan Kim, Jin Ho Kim, Sangmuk Oh, Hyun Sung Lee, Ki Hun Kwon, Dong Beom Lee, Young Jae Choi, Jeajin Lee, Hyeon Gon Kim, Jun Hyun Chun, Jonghoon Oh, and Seok Hee Lee. 2016. 18.3 A 1.2 V 64Gb 8-channel 256GB\/s HBM DRAM with peripheral-base-die architecture and small-swing technique on heavy load interface. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC\u201916). IEEE, 318\u2013319."},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.5555\/2132325.2132479"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISSCC.2017.7870353"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/3123939.3124545"},{"key":"e_1_2_1_31_1","volume-title":"Proceedings of the 2008 IEEE International Symposium on Parallel and Distributed Processing. IEEE, 1\u201312","author":"Sancho Jos\u00e9 Carlos","unstructured":"Jos\u00e9 Carlos Sancho and Darren J. Kerbyson . 2008. Analysis of double buffering on two different multicore architectures: Quad-core Opteron and the Cell-BE . In Proceedings of the 2008 IEEE International Symposium on Parallel and Distributed Processing. IEEE, 1\u201312 . Jos\u00e9 Carlos Sancho and Darren J. Kerbyson. 2008. Analysis of double buffering on two different multicore architectures: Quad-core Opteron and the Cell-BE. In Proceedings of the 2008 IEEE International Symposium on Parallel and Distributed Processing. IEEE, 1\u201312."},{"key":"e_1_2_1_32_1","unstructured":"Andreas Schlapka. 2018. Micron Announces Shift in High-Performance Memory Roadmap Strategy. Retrieved from https:\/\/www.micron.com\/about\/blogs\/2018\/august\/micron-announces-shift-in-high-performance-memory-roadmap-strategy.  Andreas Schlapka. 2018. Micron Announces Shift in High-Performance Memory Roadmap Strategy. Retrieved from https:\/\/www.micron.com\/about\/blogs\/2018\/august\/micron-announces-shift-in-high-performance-memory-roadmap-strategy."},{"key":"e_1_2_1_33_1","unstructured":"Pierre Sermanet David Eigen Xiang Zhang Micha\u00ebl Mathieu Rob Fergus and Yann LeCun. 2013. OverFeat: Integrated recognition localization and detection using convolutional networks. arXiv:1312.6229. Retrieved from https:\/\/arxiv.org\/abs\/1312.6229.  Pierre Sermanet David Eigen Xiang Zhang Micha\u00ebl Mathieu Rob Fergus and Yann LeCun. 2013. OverFeat: Integrated recognition localization and detection using convolutional networks. arXiv:1312.6229. Retrieved from https:\/\/arxiv.org\/abs\/1312.6229."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1038\/nature16961"},{"key":"e_1_2_1_35_1","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Retrieved from https:\/\/arxiv.org\/abs\/1409.1556.  Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Retrieved from https:\/\/arxiv.org\/abs\/1409.1556."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/JSSC.2016.2602221"},{"key":"e_1_2_1_37_1","unstructured":"Synopsys. 2017. PrimeTime Static Timing Analysis. Retrieved from http:\/\/www.synopsys.com\/Tools\/Implementation\/RTLSynthesis\/DesignCompiler\/Pages\/default.aspx.  Synopsys. 2017. PrimeTime Static Timing Analysis. Retrieved from http:\/\/www.synopsys.com\/Tools\/Implementation\/RTLSynthesis\/DesignCompiler\/Pages\/default.aspx."},{"key":"e_1_2_1_38_1","unstructured":"Synopsys. 2018. Design Compiler. Retrieved from http:\/\/www.synopsys.com\/Tools\/Implementation\/RTLSynthesis\/DesignCompiler\/Pages\/default.aspx.  Synopsys. 2018. Design Compiler. Retrieved from http:\/\/www.synopsys.com\/Tools\/Implementation\/RTLSynthesis\/DesignCompiler\/Pages\/default.aspx."},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2017.2761740"},{"key":"e_1_2_1_40_1","volume-title":"Proceedings of the IEEE Hot Chips 20 Symposium (HCS\u201908)","author":"Williams S.","unstructured":"S. Williams , D. Patterson , L. Oliker , J. Shalf , and K. Yelick . 2008. The roofline model: A pedagogical tool for program analysis and optimization . In Proceedings of the IEEE Hot Chips 20 Symposium (HCS\u201908) . 1\u201371. S. Williams, D. Patterson, L. Oliker, J. Shalf, and K. Yelick. 2008. The roofline model: A pedagogical tool for program analysis and optimization. In Proceedings of the IEEE Hot Chips 20 Symposium (HCS\u201908). 1\u201371."},{"key":"e_1_2_1_41_1","volume-title":"Michael Niemier, Jason Cong, Yu Hu, and Yiyu Shi.","author":"Xu Xiaowei","year":"2018","unstructured":"Xiaowei Xu , Yukun Ding , Sharon Xiaobo Hu , Michael Niemier, Jason Cong, Yu Hu, and Yiyu Shi. 2018 . Scaling for edge inference of deep neural networks. Nat. Electr . 1 (Apr. 2018). Xiaowei Xu, Yukun Ding, Sharon Xiaobo Hu, Michael Niemier, Jason Cong, Yu Hu, and Yiyu Shi. 2018. Scaling for edge inference of deep neural networks. Nat. Electr. 1 (Apr. 2018)."},{"key":"e_1_2_1_42_1","unstructured":"Tom Young Devamanyu Hazarika Soujanya Poria and Erik Cambria. 2017. Recent trends in deep learning based natural language processing. arXiv:1708.02709. Retrieved from https:\/\/arxiv.org\/abs\/1708.02709.  Tom Young Devamanyu Hazarika Soujanya Poria and Erik Cambria. 2017. Recent trends in deep learning based natural language processing. arXiv:1708.02709. Retrieved from https:\/\/arxiv.org\/abs\/1708.02709."},{"key":"e_1_2_1_43_1","volume-title":"Zeiler and Rob Fergus","author":"Matthew","year":"2014","unstructured":"Matthew D. Zeiler and Rob Fergus . 2014 . Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision. Springer , 818\u2013833. Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision. Springer, 818\u2013833."},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/2684746.2689060"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.23919\/VLSIC.2019.8778193"}],"container-title":["ACM Transactions on Design Automation of Electronic Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3460971","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3460971","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:48:22Z","timestamp":1750193302000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3460971"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,6,28]]},"references-count":45,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2021,11,30]]}},"alternative-id":["10.1145\/3460971"],"URL":"https:\/\/doi.org\/10.1145\/3460971","relation":{},"ISSN":["1084-4309","1557-7309"],"issn-type":[{"value":"1084-4309","type":"print"},{"value":"1557-7309","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,6,28]]},"assertion":[{"value":"2021-04-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-04-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-06-28","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}