{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T04:57:39Z","timestamp":1750309059997,"version":"3.41.0"},"reference-count":86,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2023,7,19]],"date-time":"2023-07-19T00:00:00Z","timestamp":1689724800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2023,9,30]]},"abstract":"<jats:p>With the growing number of data-intensive workloads, GPU, which is the state-of-the-art single-instruction-multiple-thread (SIMT) processor, is hindered by the memory bandwidth wall. To alleviate this bottleneck, previously proposed 3D-stacking near-bank computing accelerators benefit from abundant bank-internal bandwidth by bringing computations closer to the DRAM banks. However, these accelerators are specialized for certain application domains with simple architecture data paths and customized software mapping schemes. For general-purpose scenarios, lightweight hardware designs for diverse data paths, architectural supports for the SIMT programming model, and end-to-end software optimizations remain challenging.<\/jats:p><jats:p>To address these issues, we propose Memory-centric Processing Unit (MPU), the first SIMT processor based on 3D-stacking near-bank computing architecture. First, to realize diverse data paths with small overheads, MPU adopts a hybrid pipeline with the capability of offloading instructions to near-bank compute-logic. Second, we explore two architectural supports for the SIMT programming model, including a near-bank shared memory design and a multiple activated row-buffers enhancement. Third, we present an end-to-end compilation flow for MPU to support CUDA programs. To fully utilize MPU\u2019s hybrid pipeline, we develop a backend optimization for the instruction offloading decision. The evaluation results of MPU demonstrate 3.46\u00d7 speedup and 2.57\u00d7 energy reduction compared with an NVIDIA Tesla V100 GPU on a set of representative data-intensive workloads.<\/jats:p>","DOI":"10.1145\/3603113","type":"journal-article","created":{"date-parts":[[2023,5,29]],"date-time":"2023-05-29T11:01:13Z","timestamp":1685358073000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["MPU: Memory-centric SIMT Processor via In-DRAM Near-bank Computing"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7285-6682","authenticated-orcid":false,"given":"Xinfeng","family":"Xie","sequence":"first","affiliation":[{"name":"University of California, Santa Barbara, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7206-2396","authenticated-orcid":false,"given":"Peng","family":"Gu","sequence":"additional","affiliation":[{"name":"University of California, Santa Barbara, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8716-5793","authenticated-orcid":false,"given":"Yufei","family":"Ding","sequence":"additional","affiliation":[{"name":"University of California, Santa Barbara, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8440-3875","authenticated-orcid":false,"given":"Dimin","family":"Niu","sequence":"additional","affiliation":[{"name":"Alibaba Group Inc., USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7696-9799","authenticated-orcid":false,"given":"Hongzhong","family":"Zheng","sequence":"additional","affiliation":[{"name":"Alibaba Group Inc., USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0771-4992","authenticated-orcid":false,"given":"Yuan","family":"Xie","sequence":"additional","affiliation":[{"name":"Alibaba Group Inc., USA"}]}],"member":"320","published-online":{"date-parts":[[2023,7,19]]},"reference":[{"unstructured":"NVIDIA. 2018. NVIDIA Tesla V100 GPU Architecture. Retrieved from http:\/\/www.nvidia.com https:\/\/images.nvidia.com\/content\/volta-architecture\/pdf\/volta-architecture-whitepaper.pdf.","key":"e_1_3_2_2_2"},{"key":"e_1_3_2_3_2","first-page":"265","volume-title":"Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201916)","author":"Abadi Mart\u00edn","year":"2016","unstructured":"Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI\u201916). USENIX Association, 265\u2013283. Retrieved from https:\/\/www.usenix.org\/conference\/osdi16\/technical-sessions\/presentation\/abadi."},{"doi-asserted-by":"publisher","key":"e_1_3_2_4_2","DOI":"10.1145\/3357526.3357532"},{"key":"e_1_3_2_5_2","first-page":"336","volume-title":"Proceedings of the ACM\/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA\u201915)","author":"Ahn Junwhan","year":"2015","unstructured":"Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture. In Proceedings of the ACM\/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA\u201915). IEEE, 336\u2013348."},{"doi-asserted-by":"publisher","key":"e_1_3_2_6_2","DOI":"10.1109\/LCA.2019.2894800"},{"issue":"1","key":"e_1_3_2_7_2","doi-asserted-by":"crossref","first-page":"14","DOI":"10.1109\/MM.2015.129","article-title":"Hamlet architecture for parallel data reorganization in memory","volume":"36","author":"Akin Berkin","year":"2015","unstructured":"Berkin Akin, Franz Franchetti, and James C. Hoe. 2015. Hamlet architecture for parallel data reorganization in memory. IEEE Micro 36, 1 (2015), 14\u201323.","journal-title":"IEEE Micro"},{"doi-asserted-by":"publisher","key":"e_1_3_2_8_2","DOI":"10.1109\/MICRO.2018.00070"},{"key":"e_1_3_2_9_2","first-page":"1","volume-title":"Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC\u201919)","author":"Arafa Yehia","year":"2019","unstructured":"Yehia Arafa, Abdel-Hameed A. Badawy, Gopinath Chennupati, Nandakishore Santhi, and Stephan Eidenbenz. 2019. Low overhead instruction latency characterization for nvidia gpgpus. In Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC\u201919). IEEE, 1\u20138."},{"doi-asserted-by":"publisher","key":"e_1_3_2_10_2","DOI":"10.1145\/3387902.3392613"},{"key":"e_1_3_2_11_2","first-page":"1","volume-title":"Proceedings of the 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916)","author":"Asghari-Moghaddam Hadi","year":"2016","unstructured":"Hadi Asghari-Moghaddam, Young Hoon Son, Jung Ho Ahn, and Nam Sung Kim. 2016. Chameleon: Versatile and practical near-DRAM acceleration architecture for large memory systems. In Proceedings of the 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201916). IEEE, 1\u201313."},{"doi-asserted-by":"publisher","key":"e_1_3_2_12_2","DOI":"10.1109\/ISPASS.2009.4919648"},{"doi-asserted-by":"publisher","key":"e_1_3_2_13_2","DOI":"10.1109\/MM.2014.55"},{"doi-asserted-by":"publisher","key":"e_1_3_2_14_2","DOI":"10.1109\/LCA.2016.2577557"},{"doi-asserted-by":"publisher","key":"e_1_3_2_15_2","DOI":"10.1109\/IISWC.2009.5306797"},{"key":"e_1_3_2_16_2","first-page":"33","volume-title":"Proceedings of the Conference on Design, Automation and Test in Europe","author":"Chen Ke","year":"2012","unstructured":"Ke Chen, Sheng Li, Naveen Muralimanohar, Jung Ho Ahn, Jay B. Brockman, and Norman P. Jouppi. 2012. CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory. In Proceedings of the Conference on Design, Automation and Test in Europe. EDA Consortium, 33\u201338."},{"doi-asserted-by":"publisher","key":"e_1_3_2_17_2","DOI":"10.1145\/3007787.3001140"},{"key":"e_1_3_2_18_2","article-title":"CHoNDA: Near data acceleration with concurrent host access","author":"Cho Benjamin Y.","year":"2020","unstructured":"Benjamin Y. Cho, Yongkee Kwon, Sangkug Lym, and Mattan Erez. 2020. CHoNDA: Near data acceleration with concurrent host access. In Proceedings of the International Symposium on Computer Architecture.","journal-title":"Proceedings of the International Symposium on Computer Architecture"},{"doi-asserted-by":"publisher","key":"e_1_3_2_19_2","DOI":"10.1145\/514191.514197"},{"unstructured":"Guy Dupenloup. 2004. Automatic synthesis script generation for synopsys design compiler. U.S. Patent 6 836 877.","key":"e_1_3_2_20_2"},{"unstructured":"Yasuko Eckert Nuwan Jayasena and Gabriel H. Loh. 2014. Thermal feasibility of die-stacked processing in memory. In 2nd Workshop on Near-Data Processing (WoNDP) .","key":"e_1_3_2_21_2"},{"doi-asserted-by":"publisher","key":"e_1_3_2_22_2","DOI":"10.1145\/3307650.3322257"},{"doi-asserted-by":"publisher","key":"e_1_3_2_23_2","DOI":"10.1109\/MICRO.2007.12"},{"doi-asserted-by":"publisher","key":"e_1_3_2_24_2","DOI":"10.1145\/3037697.3037702"},{"key":"e_1_3_2_25_2","volume-title":"Proceedings of the ACM\/IEEE 47th Annual International Symposium on Computer Architecture (ISCA\u201920)","author":"Gu Peng","year":"2020","unstructured":"Peng Gu, Xinfeng Xie, Yufei Ding, Guoyang Chen, Weifeng Zhang, Dimin Niu, and Yuan Xie. 2020. iPIM: Programmable in-memory image processing accelerator using near-bank architecture. In Proceedings of the ACM\/IEEE 47th Annual International Symposium on Computer Architecture (ISCA\u201920). IEEE."},{"key":"e_1_3_2_26_2","first-page":"10","volume-title":"Proceedings of the IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC\u201914)","author":"Horowitz Mark","year":"2014","unstructured":"Mark Horowitz. 2014. 1.1 computing\u2019s energy problem (and what we can do about it). In Proceedings of the IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC\u201914). IEEE, 10\u201314."},{"doi-asserted-by":"publisher","key":"e_1_3_2_27_2","DOI":"10.1145\/3007787.3001159"},{"key":"e_1_3_2_28_2","doi-asserted-by":"crossref","first-page":"587","DOI":"10.1145\/3352460.3358329","volume-title":"Proceedings of the 52nd Annual IEEE\/ACM International Symposium on Microarchitecture","author":"Huangfu Wenqin","year":"2019","unstructured":"Wenqin Huangfu, Xueqi Li, Shuangchen Li, Xing Hu, Peng Gu, and Yuan Xie. 2019. MEDAL: Scalable DIMM based Near Data Processing Accelerator for DNA Seeding Algorithm. In Proceedings of the 52nd Annual IEEE\/ACM International Symposium on Microarchitecture. 587\u2013599."},{"key":"e_1_3_2_29_2","first-page":"802","volume-title":"Proceedings of the ACM\/IEEE 46th Annual International Symposium on Computer Architecture (ISCA\u201919)","author":"Imani Mohsen","year":"2019","unstructured":"Mohsen Imani, Saransh Gupta, Yeseong Kim, and Tajana Rosing. 2019. FloatPIM: In-memory acceleration of deep neural network training with high precision. In Proceedings of the ACM\/IEEE 46th Annual International Symposium on Computer Architecture (ISCA\u201919). IEEE, 802\u2013815."},{"key":"e_1_3_2_30_2","doi-asserted-by":"crossref","first-page":"726","DOI":"10.1145\/3352460.3358297","volume-title":"Proceedings of the 52nd Annual IEEE\/ACM International Symposium on Microarchitecture","author":"Jang Jaeyoung","year":"2019","unstructured":"Jaeyoung Jang, Jun Heo, Yejin Lee, Jaeyeon Won, Seonghak Kim, Sung Jun Jung, Hakbeom Jang, Tae Jun Ham, and Jae W. Lee. 2019. Charon: Specialized near-memory processing architecture for clearing dead objects in memory. In Proceedings of the 52nd Annual IEEE\/ACM International Symposium on Microarchitecture. 726\u2013739."},{"key":"e_1_3_2_31_2","first-page":"86","volume-title":"Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS\u201913)","author":"Jiang Nan","year":"2013","unstructured":"Nan Jiang, Daniel U. Becker, George Michelogiannakis, James Balfour, Brian Towles, David E. Shaw, John Kim, and William J. Dally. 2013. A detailed and flexible cycle-accurate network-on-chip simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS\u201913). IEEE, 86\u201396."},{"doi-asserted-by":"publisher","key":"e_1_3_2_32_2","DOI":"10.1109\/ICCD.1999.808425"},{"doi-asserted-by":"publisher","key":"e_1_3_2_33_2","DOI":"10.1145\/3132402.3132426"},{"doi-asserted-by":"publisher","key":"e_1_3_2_34_2","DOI":"10.1145\/3126908.3126965"},{"doi-asserted-by":"publisher","key":"e_1_3_2_35_2","DOI":"10.5555\/2337159.2337202"},{"doi-asserted-by":"publisher","key":"e_1_3_2_36_2","DOI":"10.1109\/LCA.2015.2414456"},{"key":"e_1_3_2_37_2","doi-asserted-by":"crossref","first-page":"103","DOI":"10.1145\/1296907.1296909","volume-title":"Proceedings of the 6th International Symposium on Memory Management (ISMM\u201907)","volume":"7","author":"Kirk David","year":"2007","unstructured":"David Kirk et\u00a0al. 2007. NVIDIA CUDA software and GPU parallel computing architecture. In Proceedings of the 6th International Symposium on Memory Management (ISMM\u201907), Vol. 7. 103\u2013104."},{"doi-asserted-by":"publisher","key":"e_1_3_2_38_2","DOI":"10.1109\/ICPP.1994.108"},{"key":"e_1_3_2_39_2","first-page":"1","volume-title":"Proceedings of the IEEE Hot Chips 34 Symposium (HCS\u201922)","author":"Kwon Yongkee","year":"2022","unstructured":"Yongkee Kwon, Kornijcuk Vladimir, Nahsung Kim, Woojae Shin, Jongsoon Won, Minkyu Lee, Hyunha Joo, Haerang Choi, Guhyun Kim, Byeongju An et\u00a0al. 2022. System architecture and software stack for GDDR6-AiM. In Proceedings of the IEEE Hot Chips 34 Symposium (HCS\u201922). IEEE, 1\u201325."},{"key":"e_1_3_2_40_2","first-page":"350","volume-title":"Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC\u201921)","volume":"64","author":"Kwon Young-Cheon","year":"2021","unstructured":"Young-Cheon Kwon, Suk Han Lee, Jaehoon Lee, Sang-Hyuk Kwon, Je Min Ryu, Jong-Pil Son, O. Seongil, Hak-Soo Yu, Haesuk Lee, Soo Young Kim et\u00a0al. 2021. 25.4 A 20 nm 6 GB function-in-memory DRAM, based on HBM2 with a 1.2 TFLOPS programmable computing unit using bank-level parallelism, for machine learning applications. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC\u201921), Vol. 64. IEEE, 350\u2013352."},{"doi-asserted-by":"publisher","key":"e_1_3_2_41_2","DOI":"10.1109\/ISCA52012.2021.00013"},{"key":"e_1_3_2_42_2","first-page":"621","volume-title":"Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW\u201916)","author":"Leidel John D.","year":"2016","unstructured":"John D. Leidel and Yong Chen. 2016. HMC-Sim-2.0: A simulation platform for exploring custom memory cube operations. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW\u201916). IEEE, 621\u2013630."},{"doi-asserted-by":"publisher","key":"e_1_3_2_43_2","DOI":"10.1109\/TCAD.2015.2445741"},{"unstructured":"John Erik Lindholm Ming Y. Siu Simon S. Moy Samuel Liu and John R. Nickolls. 2008. Simulating multiported memories using lower port count memories. U.S. Patent 7 339 592.","key":"e_1_3_2_44_2"},{"key":"e_1_3_2_45_2","first-page":"1","volume-title":"Proceedings of the 16th International Symposium on High-Performance Computer Architecture (HPCA\u201910)","author":"Liu Fang","year":"2010","unstructured":"Fang Liu, Xiaowei Jiang, and Yan Solihin. 2010. Understanding how off-chip memory bandwidth partitioning in chip multiprocessors affects system performance. In Proceedings of the 16th International Symposium on High-Performance Computer Architecture (HPCA\u201910). IEEE, 1\u201312."},{"key":"e_1_3_2_46_2","first-page":"417","volume-title":"Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems","author":"Lockerman Elliot","year":"2020","unstructured":"Elliot Lockerman, Axel Feldmann, Mohammad Bakhshalipour, Alexandru Stanescu, Shashwat Gupta, Daniel Sanchez, and Nathan Beckmann. 2020. Livia: Data-centric computing throughout the memory hierarchy. In Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems. 417\u2013433."},{"doi-asserted-by":"publisher","key":"e_1_3_2_47_2","DOI":"10.1145\/3295500.3356156"},{"key":"e_1_3_2_48_2","first-page":"1","volume-title":"Proceedings of the IEEE Hot Chips 27 Symposium (HCS\u201915)","author":"Macri Joe","year":"2015","unstructured":"Joe Macri. 2015. AMD\u2019s next generation GPU and high bandwidth memory architecture: FURY. In Proceedings of the IEEE Hot Chips 27 Symposium (HCS\u201915). IEEE, 1\u201326."},{"key":"e_1_3_2_49_2","first-page":"444","volume-title":"Proceedings of the European Conference on Parallel Processing","author":"Martineau Matt","year":"2018","unstructured":"Matt Martineau, Patrick Atkinson, and Simon McIntosh-Smith. 2018. Benchmarking the NVIDIA V100 GPU and tensor cores. In Proceedings of the European Conference on Parallel Processing. Springer, 444\u2013455."},{"issue":"2009","key":"e_1_3_2_50_2","article-title":"Introduction to discrete-event simulation and the SimPy language","volume":"2","author":"Matloff Norm","year":"2008","unstructured":"Norm Matloff. 2008. Introduction to discrete-event simulation and the SimPy language. Dept. of Computer Science, University of California at Davis. Retrieved on August 2, 2009. https:\/\/web.cs.ucdavis.edu\/matloff\/matloff\/public_html\/156\/PLN\/DESimIntro.pdf.","journal-title":"Dept. of Computer Science, University of California at Davis. Retrieved on August"},{"key":"e_1_3_2_51_2","first-page":"175","volume-title":"Proceedings of the IEEE 30th International Conference on Computer Design (ICCD\u201912)","author":"Milojevic Dragomir","year":"2012","unstructured":"Dragomir Milojevic, Sachin Idgunji, Djordje Jevdjic, Emre Ozer, Pejman Lotfi-Kamran, Andreas Panteli, Andreas Prodromou, Chrysostomos Nicopoulos, Damien Hardy, Babak Falsari et\u00a0al. 2012. Thermal characterization of cloud workloads on a power-efficient server-on-chip. In Proceedings of the IEEE 30th International Conference on Computer Design (ICCD\u201912). IEEE, 175\u2013182."},{"doi-asserted-by":"publisher","key":"e_1_3_2_52_2","DOI":"10.3390\/make1010005"},{"key":"e_1_3_2_53_2","first-page":"457","volume-title":"Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA\u201917)","author":"Nai Lifeng","year":"2017","unstructured":"Lifeng Nai, Ramyad Hadidi, Jaewoong Sim, Hyojong Kim, Pranith Kumar, and Hyesoon Kim. 2017. GraphPIM: Enabling instruction-level PIM offloading in graph computing frameworks. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA\u201917). IEEE, 457\u2013468."},{"doi-asserted-by":"publisher","key":"e_1_3_2_54_2","DOI":"10.1109\/IPDPS.2018.00077"},{"doi-asserted-by":"publisher","key":"e_1_3_2_55_2","DOI":"10.2200\/S00458ED1V01Y201212CAC021"},{"volume-title":"CUB Library","year":"2020","unstructured":"NVIDIA. 2020. CUB Library. Retrieved from https:\/\/github.com\/NVlabs\/cub.","key":"e_1_3_2_56_2"},{"volume-title":"cuBLAS Library","year":"2020","unstructured":"NVIDIA. 2020. cuBLAS Library. Retrieved from https:\/\/docs.nvidia.com\/cuda\/cublas\/index.html.","key":"e_1_3_2_57_2"},{"volume-title":"Parallel Thread Extension ISA","year":"2020","unstructured":"NVIDIA. 2020. Parallel Thread Extension ISA. Retrieved from https:\/\/docs.nvidia.com\/cuda\/parallel-thread-execution\/index.html.","key":"e_1_3_2_58_2"},{"unstructured":"CUDA Nvidia. 2007. Compute unified device architecture programming guide. https:\/\/docs.nvidia.com\/cuda\/cuda-c-programming-guide\/index.html.","key":"e_1_3_2_59_2"},{"key":"e_1_3_2_60_2","article-title":"Compiler driver NVCC","author":"NVIDIA CUDA","year":"2013","unstructured":"CUDA NVIDIA. 2013. Compiler driver NVCC. Options for Steering GPU Code Generation URL. https:\/\/docs.nvidia.com\/cuda\/cuda-compiler-driver-nvcc\/index.html.","journal-title":"Options for Steering GPU Code Generation URL"},{"unstructured":"Lars Nyland John R Nickolls Gentaro Hirota and Tanmoy Mandal. 2011. Systems and methods for coalescing memory accesses of parallel threads. U.S. Patent 8 086 806.","key":"e_1_3_2_61_2"},{"unstructured":"Milad Hashemi Khubaib Eiman Ebrahimi Onur Mutlu and Yale N. Patt. 2015. Reducing memory access latency via an enhanced (compute capable) memory controller. https:\/\/hps.ece.utexas.edu\/pub\/TR-HPS-2015-001.pdf..","key":"e_1_3_2_62_2"},{"key":"e_1_3_2_63_2","volume-title":"Proceedings of the Memory Forum Workshop","author":"O\u2019Connor Mike","year":"2014","unstructured":"Mike O\u2019Connor. 2014. Highlights of the high-bandwidth memory standard. In Proceedings of the Memory Forum Workshop."},{"key":"e_1_3_2_64_2","doi-asserted-by":"crossref","first-page":"41","DOI":"10.1145\/3123939.3124545","volume-title":"Proceedings of the 50th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO)","author":"O\u2019Connor Mike","year":"2017","unstructured":"Mike O\u2019Connor, Niladrish Chatterjee, Donghyuk Lee, John Wilson, Aditya Agrawal, Stephen W. Keckler, and William J. Dally. 2017. Fine-grained DRAM: energy-efficient DRAM for extreme bandwidth systems. In Proceedings of the 50th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO). IEEE, 41\u201354."},{"doi-asserted-by":"publisher","key":"e_1_3_2_65_2","DOI":"10.1109\/40.592312"},{"key":"e_1_3_2_66_2","doi-asserted-by":"crossref","first-page":"31","DOI":"10.1145\/2967938.2967940","volume-title":"Proceedings of the International Conference on Parallel Architectures and Compilation","author":"Pattnaik Ashutosh","year":"2016","unstructured":"Ashutosh Pattnaik, Xulong Tang, Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, and Chita R. Das. 2016. Scheduling techniques for GPU architectures with processing-in-memory capabilities. In Proceedings of the International Conference on Parallel Architectures and Compilation. 31\u201344."},{"key":"e_1_3_2_67_2","first-page":"303","volume-title":"Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques (PACT\u201917)","author":"Picorel Javier","year":"2017","unstructured":"Javier Picorel, Djordje Jevdjic, and Babak Falsafi. 2017. Near-memory address translation. In Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques (PACT\u201917). IEEE, 303\u2013317."},{"doi-asserted-by":"publisher","key":"e_1_3_2_68_2","DOI":"10.1109\/ISPASS.2014.6844483"},{"doi-asserted-by":"publisher","key":"e_1_3_2_69_2","DOI":"10.1145\/2499370.2462176"},{"doi-asserted-by":"publisher","key":"e_1_3_2_70_2","DOI":"10.1145\/1555754.1555801"},{"doi-asserted-by":"publisher","key":"e_1_3_2_71_2","DOI":"10.1109\/L-CA.2011.4"},{"doi-asserted-by":"publisher","key":"e_1_3_2_72_2","DOI":"10.1145\/3007787.3001139"},{"doi-asserted-by":"publisher","key":"e_1_3_2_73_2","DOI":"10.1109\/TCAD.2018.2857044"},{"doi-asserted-by":"publisher","key":"e_1_3_2_74_2","DOI":"10.1109\/JSSC.2016.2602221"},{"doi-asserted-by":"publisher","key":"e_1_3_2_75_2","DOI":"10.1109\/HPCA.2017.55"},{"doi-asserted-by":"publisher","key":"e_1_3_2_76_2","DOI":"10.1038\/nature06932"},{"volume-title":"UPMem","year":"2020","unstructured":"UPMem. 2020. UPMem. Retrieved from https:\/\/www.upmem.com\/.","key":"e_1_3_2_77_2"},{"doi-asserted-by":"publisher","key":"e_1_3_2_78_2","DOI":"10.1016\/j.micpro.2017.01.005"},{"doi-asserted-by":"publisher","key":"e_1_3_2_79_2","DOI":"10.1145\/3392717.3392760"},{"key":"e_1_3_2_80_2","first-page":"1","volume-title":"Proceedings of the 53rd Annual Design Automation Conference","author":"Xia Lixue","year":"2016","unstructured":"Lixue Xia, Tianqi Tang, Wenqin Huangfu, Ming Cheng, Xiling Yin, Boxun Li, Yu Wang, and Huazhong Yang. 2016. Switched by input: Power efficient structure for RRAM-based convolutional neural network. In Proceedings of the 53rd Annual Design Automation Conference. 1\u20136."},{"key":"e_1_3_2_81_2","first-page":"637","volume-title":"Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA\u201917)","author":"Xie Chenhao","year":"2017","unstructured":"Chenhao Xie, Shuaiwen Leon Song, Jing Wang, Weigong Zhang, and Xin Fu. 2017. Processing-in-memory enabled graphics processors for 3D rendering. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA\u201917). IEEE, 637\u2013648."},{"issue":"1","key":"e_1_3_2_82_2","first-page":"1","article-title":"MPU-Sim: A simulator for In-DRAM near-bank processing architectures","volume":"21","author":"Xie Xinfeng","year":"2021","unstructured":"Xinfeng Xie, Peng Gu, Jiayi Huang, Yufei Ding, and Yuan Xie. 2021. MPU-Sim: A simulator for In-DRAM near-bank processing architectures. IEEE Comput. Arch. Lett. 21, 1 (2021), 1\u20134.","journal-title":"IEEE Comput. Arch. Lett."},{"doi-asserted-by":"publisher","key":"e_1_3_2_83_2","DOI":"10.1109\/HPCA51647.2021.00055"},{"doi-asserted-by":"publisher","key":"e_1_3_2_84_2","DOI":"10.2200\/S00644ED1V01Y201505CAC031"},{"doi-asserted-by":"publisher","key":"e_1_3_2_85_2","DOI":"10.1145\/3243176.3243188"},{"doi-asserted-by":"publisher","key":"e_1_3_2_86_2","DOI":"10.1145\/2600212.2600213"},{"key":"e_1_3_2_87_2","first-page":"402","volume-title":"Proceedings of the 2nd International Symposium on Memory Systems","author":"Zhu Yuxiong","year":"2016","unstructured":"Yuxiong Zhu, Borui Wang, Dong Li, and Jishen Zhao. 2016. Integrated thermal analysis for processing in die-stacking memory. In Proceedings of the 2nd International Symposium on Memory Systems. 402\u2013414."}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3603113","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3603113","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T22:29:51Z","timestamp":1750285791000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3603113"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,7,19]]},"references-count":86,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2023,9,30]]}},"alternative-id":["10.1145\/3603113"],"URL":"https:\/\/doi.org\/10.1145\/3603113","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2023,7,19]]},"assertion":[{"value":"2022-07-06","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-05-16","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-07-19","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}