{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,24]],"date-time":"2026-07-24T15:00:13Z","timestamp":1784905213760,"version":"3.55.0"},"reference-count":51,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2022,3,24]],"date-time":"2022-03-24T00:00:00Z","timestamp":1648080000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"NSF","award":["CCF-2047516"],"award-info":[{"award-number":["CCF-2047516"]}]},{"name":"US Department Of Energy, Office of Science, Offices of Nuclear Physics and Advanced Scientific Computing Research, through the SciDAC program","award":["DE-AC05-06OR23177"],"award-info":[{"award-number":["DE-AC05-06OR23177"]}]},{"DOI":"10.13039\/100011665","name":"Jefferson Lab","doi-asserted-by":"crossref","award":["17-SC-20-SC"],"award-info":[{"award-number":["17-SC-20-SC"]}],"id":[{"id":"10.13039\/100011665","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2022,6,30]]},"abstract":"<jats:p>The<jats:italic>many-body correlation function<\/jats:italic>is a fundamental computation kernel in modern physics computing applications, e.g., Hadron Contractions in Lattice quantum chromodynamics (QCD). This kernel is both computation and memory intensive, involving a series of tensor contractions, and thus usually runs on accelerators like GPUs. Existing optimizations on many-body correlation mainly focus on individual tensor contractions (e.g., cuBLAS libraries and others). In contrast, this work discovers a new optimization dimension for many-body correlation by exploring the optimization opportunities among tensor contractions. More specifically, it targets general GPU architectures (both NVIDIA and AMD) and optimizes many-body correlation\u2019s memory management by exploiting a set of<jats:italic>memory allocation and communication redundancy elimination<\/jats:italic>opportunities: first,<jats:italic>GPU memory allocation redundancy<\/jats:italic>: the intermediate output frequently occurs as input in the subsequent calculations; second,<jats:italic>CPU-GPU communication redundancy<\/jats:italic>: although all tensors are allocated on both CPU and GPU, many of them are used (and reused) on the GPU side only, and thus, many CPU\/GPU communications (like that in existing Unified Memory designs) are unnecessary; third,<jats:italic>GPU oversubscription:<\/jats:italic>limited GPU memory size causes oversubscription issues, and existing memory management usually results in near-reuse data eviction, thus incurring extra CPU\/GPU memory communications.<\/jats:p><jats:p>Targeting these memory optimization opportunities, this article proposes MemHC, an optimized systematic GPU memory management framework that aims to accelerate the calculation of many-body correlation functions utilizing a series of new memory reduction designs. These designs involve optimizations for GPU memory allocation, CPU\/GPU memory movement, and GPU memory oversubscription, respectively. More specifically, first, MemHC employs duplication-aware management and lazy release of GPU memories to corresponding host managing for better data reusability. Second, it implements data reorganization and on-demand synchronization to eliminate redundant (or unnecessary) data transfer. Third, MemHC exploits an optimized Least Recently Used (LRU) eviction policy called Pre-Protected LRU to reduce evictions and leverage memory hits. Additionally, MemHC is portable for various platforms including NVIDIA GPUs and AMD GPUs. The evaluation demonstrates that MemHC outperforms unified memory management by<jats:inline-formula content-type=\"math\/tex\"><jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\( 2.18\\times \\)<\/jats:tex-math><\/jats:inline-formula>to<jats:inline-formula content-type=\"math\/tex\"><jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\( 10.73\\times \\)<\/jats:tex-math><\/jats:inline-formula>. The proposed Pre-Protected LRU policy outperforms the original LRU policy by up to<jats:inline-formula content-type=\"math\/tex\"><jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\( 1.36\\times \\)<\/jats:tex-math><\/jats:inline-formula>improvement.<jats:xref ref-type=\"fn\"><jats:sup>1<\/jats:sup><\/jats:xref><\/jats:p>","DOI":"10.1145\/3506705","type":"journal-article","created":{"date-parts":[[2022,3,25]],"date-time":"2022-03-25T06:51:19Z","timestamp":1648191079000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":10,"title":["MemHC: An Optimized GPU Memory Management Framework for Accelerating Many-body Correlation"],"prefix":"10.1145","volume":"19","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9832-9060","authenticated-orcid":false,"given":"Qihan","family":"Wang","sequence":"first","affiliation":[{"name":"William &amp; Mary, Williamsburg, VA, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Zhen","family":"Peng","sequence":"additional","affiliation":[{"name":"William &amp; Mary, Williamsburg, VA, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Bin","family":"Ren","sequence":"additional","affiliation":[{"name":"William &amp; Mary, Williamsburg, VA, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jie","family":"Chen","sequence":"additional","affiliation":[{"name":"Jefferson Lab, Newport News, VA, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Robert G.","family":"Edwards","sequence":"additional","affiliation":[{"name":"Jefferson Lab, Newport News, VA, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2022,3,24]]},"reference":[{"key":"e_1_3_2_2_2","unstructured":"NVIDIA. 2015. http:\/\/docs.nvidia.com\/cuda\/cublas\/."},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.procs.2016.05.302"},{"key":"e_1_3_2_4_2","volume-title":"2016 IEEE HPCA","author":"Agarwal Neha","unstructured":"Neha Agarwal, David Nellans, Eiman Ebrahimi, Thomas F. Wenisch, John Danskin, and Stephen W. Keckler. Selective GPU caches to eliminate CPU-GPU HW cache coherence. In 2016 IEEE HPCA."},{"key":"e_1_3_2_5_2","unstructured":"Rachata Ausavarungnirun Joshua Landgraf Vance Miller Saugata Ghose Jayneel Gandhi Christopher J. Rossbach and Onur Mutlu. 2017. Mosaic: A GPU memory manager with application-transparent support for multiple page sizes. In Proceedings of the 50th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO) . 136\u2013150."},{"key":"e_1_3_2_6_2","volume-title":"FAST","author":"Bansal Sorav","year":"2004","unstructured":"Sorav Bansal and Dharmendra S. Modha. 2004. CAR: Clock with adaptive replacement. In FAST."},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.physletb.2016.12.024"},{"key":"e_1_3_2_8_2","first-page":"93","volume-title":"International Workshop on Languages and Compilers for Parallel Computing","author":"Bibireata Alina","year":"2003","unstructured":"Alina Bibireata, Sandhya Krishnan, Gerald Baumgartner, Daniel Cociorva, Chi-Chung Lam, P. Sadayappan, J. Ramanujam, David E. Bernholdt, and Venkatesh Choppella. 2003. Memory-constrained data locality optimization for tensor contractions. In International Workshop on Languages and Compilers for Parallel Computing. Springer, 93\u2013108."},{"key":"e_1_3_2_9_2","volume-title":"2017 IEEE ICDE","author":"Chen Guoyang","unstructured":"Guoyang Chen, Yufei Ding, and Xipeng Shen. Sweet KNN: An efficient KNN on GPU through reconciliation between redundancy removal and regularity. In 2017 IEEE ICDE."},{"issue":"03","key":"e_1_3_2_10_2","article-title":"Graph-based contractions with optimal evaluation strategies","author":"Chen Jie","year":"2017","unstructured":"Jie Chen, Robert Edwards, and Frank Winter. 2017. Graph-based contractions with optimal evaluation strategies. ADSE03-LatticeQCD Application Strategy WBS 1.2.1.03Milestone ADSE03-7 (2017).","journal-title":"ADSE03-LatticeQCD Application Strategy WBS 1.2.1.03"},{"issue":"03","key":"e_1_3_2_11_2","article-title":"Performance enhancement to the graph-based contraction calculations","author":"Chen Jie","year":"2018","unstructured":"Jie Chen, Robert Edwards, and Frank Winter. 2018. Performance enhancement to the graph-based contraction calculations. ADSE03-LatticeQCD Application Strategy WBS 1.2.1.03Milestone ADSE03-7 (2018).","journal-title":"ADSE03-LatticeQCD Application Strategy WBS 1.2.1.03"},{"issue":"03","key":"e_1_3_2_12_2","article-title":"Enabling graph based contraction calculations for multi-nucleon systems","author":"Chen Jie","year":"2019","unstructured":"Jie Chen, Robert Edwards, and Frank Winter. 2019. Enabling graph based contraction calculations for multi-nucleon systems. ADSE03-LatticeQCD Application Strategy WBS 1.2.1.03Milestone ADSE03-14 (2019).","journal-title":"ADSE03-LatticeQCD Application Strategy WBS 1.2.1.03"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/3092255.3092256"},{"key":"e_1_3_2_14_2","doi-asserted-by":"crossref","unstructured":"Kelu Diao Ioannis Papapanagiotou and Thomas J. Hacker. HARENS: Hardware accelerated redundancy elimination in network systems. In 2016 IEEE International Conference on Cloud Computing Technology and Science (CloudCom) . IEEE 237\u2013244.","DOI":"10.1109\/CloudCom.2016.0048"},{"key":"e_1_3_2_15_2","doi-asserted-by":"crossref","unstructured":"Feng-Kun Guo Christoph Hanhart Ulf-G. Mei\u00dfner Qian Wang Qiang Zhao and Bing-Song Zou. 2018. Hadronic molecules. Reviews of Modern Physics 90 1 (2018) 015004.","DOI":"10.1103\/RevModPhys.90.015004"},{"key":"e_1_3_2_16_2","doi-asserted-by":"crossref","unstructured":"Chaofeng Hou Ji Xu Peng Wang Wenlai Huang and Xiaowei Wang. 2013. Efficient GPU-accelerated molecular dynamics simulation of solid covalent crystals. Computer Physics Communications 184 5 (2013) 1364\u20131371.","DOI":"10.1016\/j.cpc.2013.01.001"},{"key":"e_1_3_2_17_2","doi-asserted-by":"crossref","unstructured":"Mohamed Assem Ibrahim Hongyuan Liu Onur Kayiran and Adwait Jog. Analyzing and leveraging remote-core bandwidth for enhanced performance in GPUs. In 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT) . IEEE 258\u2013271.","DOI":"10.1109\/PACT.2019.00028"},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2013.234"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/CLUSTER.2012.72"},{"key":"e_1_3_2_20_2","first-page":"323","volume-title":"USENIX Annual Technical Conference, General Track","author":"Jiang Song","year":"2005","unstructured":"Song Jiang, Feng Chen, and Xiaodong Zhang. 2005. CLOCK-Pro: An effective improvement of the CLOCK replacement. In USENIX Annual Technical Conference, General Track. 323\u2013336."},{"key":"e_1_3_2_21_2","first-page":"725","volume-title":"2020 53rd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201920)","author":"Kim Hyeonjin","year":"2020","unstructured":"Hyeonjin Kim, Sungwoo Ahn, Yunho Oh, Bogil Kim, Won Woo Ro, and William J. Song. 2020. Duplo: Lifting redundant memory accesses of deep neural networks for GPU tensor cores. In 2020 53rd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO\u201920). IEEE, 725\u2013737."},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378529"},{"key":"e_1_3_2_23_2","doi-asserted-by":"crossref","unstructured":"Jinsung Kim Aravind Sukumaran-Rajam Changwan Hong Ajay Panyala Rohit Kumar Srivastava Sriram Krishnamoorthy and Ponnuswamy Sadayappan. 2018. Optimizing tensor contractions in CCSD(T) for efficient execution on GPUs. In Proceedings of the 2018 International Conference on Supercomputing . 96\u2013106.","DOI":"10.1145\/3205289.3205296"},{"key":"e_1_3_2_24_2","first-page":"85","volume-title":"2019 IEEE\/ACM International Symposium on Code Generation and Optimization (CGO\u201919)","author":"Kim Jinsung","year":"2019","unstructured":"Jinsung Kim, Aravind Sukumaran-Rajam, Vineeth Thumma, Sriram Krishnamoorthy, Ajay Panyala, Louis-No\u00ebl Pouchet, Atanas Rountev, and Ponnuswamy Sadayappan. 2019. A code generator for high-performance tensor contractions on GPUs. In 2019 IEEE\/ACM International Symposium on Code Generation and Optimization (CGO\u201919). IEEE, 85\u201395."},{"key":"e_1_3_2_25_2","doi-asserted-by":"crossref","unstructured":"Marcin Knap and Pawe\u0142 Czarnul. 2019. Performance evaluation of unified memory with prefetching and oversubscription for selected parallel CUDA applications on NVIDIA Pascal and Volta GPUs. The Journal of Supercomputing 75 11 (2019) 7625\u20137645.","DOI":"10.1007\/s11227-019-02966-8"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICETECH.2016.7569243"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPEC.2014.7040988"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1145\/3297858.3304044"},{"key":"e_1_3_2_29_2","doi-asserted-by":"crossref","unstructured":"Lingda Li and Barbara Chapman. 2019. Compiler assisted hybrid implicit and explicit GPU memory management under unified address space. In Proceedings of the International Conference for High Performance Computing Networking Storage and Analysis . 1\u201316.","DOI":"10.1145\/3295500.3356141"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1145\/3295500.3356218"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/CCGrid.2015.105"},{"key":"e_1_3_2_32_2","volume-title":"ICS","author":"Liu Jiawen","year":"2021","unstructured":"Jiawen Liu, Dong Li, and Jiajia Li. 2021. Athena: High-performance sparse tensor contraction sequence on heterogeneous memory. In ICS."},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.5555\/2451462.2451482"},{"key":"e_1_3_2_34_2","article-title":"High-performance tensor contraction without BLAS","volume":"40","author":"Matthews Devin A.","year":"2016","unstructured":"Devin A. Matthews. 2016. High-performance tensor contraction without BLAS. SIAM Journal on Scientific Computing 40 (2016).","journal-title":"SIAM Journal on Scientific Computing"},{"key":"e_1_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.advengsoft.2016.05.013"},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1145\/3148173.3148184"},{"key":"e_1_3_2_37_2","volume-title":"2015 ICPP","author":"Nelson Thomas","unstructured":"Thomas Nelson, Axel Rivera, Prasanna Balaprakash, Mary Hall, Paul D. Hovland, Elizabeth Jessup, and Boyana Norris. Generating efficient tensor contractions for GPUs. In 2015 ICPP. IEEE."},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378534"},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/170036.170081"},{"key":"e_1_3_2_40_2","doi-asserted-by":"crossref","unstructured":"Roman Poya Antonio J. Gil and Rogelio Ortigosa. 2017. A high performance data parallel tensor contraction framework: Application to coupled electro-mechanics. Computer Physics Communications 216 (2017) 35\u201352.","DOI":"10.1016\/j.cpc.2017.02.016"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1145\/1345206.1345220"},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1155\/2009\/681708"},{"key":"e_1_3_2_43_2","doi-asserted-by":"crossref","unstructured":"Yang Shi Uma Naresh Niranjan Animashree Anandkumar and Cris Cecka. Tensor contractions with extended BLAS kernels on CPU and GPU. In 2016 IEEE 23rd International Conference on High Performance Computing (HiPC) . 193\u2013202.","DOI":"10.1109\/HiPC.2016.031"},{"key":"e_1_3_2_44_2","first-page":"308","volume-title":"IPDPS","author":"Tomas Andres","year":"2012","unstructured":"Andres Tomas, Chia-Chen Chang, Richard Scalettar, and Zhaojun Bai. 2012. Advancing large scale many-body QMC simulations on GPU accelerated multicore systems. In IPDPS. 308\u2013319."},{"key":"e_1_3_2_45_2","volume-title":"International Conference on Parallel Processing and Applied Mathematics","author":"Valero-Lara Pedro","year":"2017","unstructured":"Pedro Valero-Lara, Ivan Mart\u00ednez-P\u00e9rez, Ra\u00fcl Sirvent, Xavier Martorell, and Antonio J. Pena. 2017. NVIDIA GPUs scalability to solve multiple (batch) tridiagonal systems implementation of cuthomasbatch. In International Conference on Parallel Processing and Applied Mathematics. Springer."},{"key":"e_1_3_2_46_2","doi-asserted-by":"crossref","unstructured":"Hao Wang Sreeram Potluri Miao Luo Ashish Kumar Singh Xiangyong Ouyang Sayantan Sur and Dhabaleswar K. Panda. Optimized non-contiguous MPI datatype communication for GPU clusters: Design implementation and evaluation with MVAPICH2. In 2011 IEEE International Conference on Cluster Computing . IEEE 308\u2013316.","DOI":"10.1109\/CLUSTER.2011.42"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/HiPC53243.2021.00039"},{"key":"e_1_3_2_48_2","volume-title":"ICAIF","author":"Wu Q.","year":"2021","unstructured":"Q. Wu, C. Brinton, Z. Zhang, M. Cucuringu, A. Pizzoferrato, and Z. Liu. 2021. Equity2vec: End-to-end deep learning framework for cross-sectional asset pricing. In ICAIF."},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1145\/3468268"},{"key":"e_1_3_2_50_2","volume-title":"2019 ICSC","author":"Wu Qiong","unstructured":"Qiong Wu, Wen-Ling Hsu, Tan Xu, Zhenming Liu, George Ma, Guy Jacobson, and Shuai Zhao. Speaking with actions-learning customer journey behavior. In 2019 ICSC. IEEE."},{"key":"e_1_3_2_51_2","article-title":"Rosella: A self-driving distributed scheduler for heterogeneous clusters","author":"Wu Qiong","year":"2020","unstructured":"Qiong Wu and Zhenming Liu. 2020. Rosella: A self-driving distributed scheduler for heterogeneous clusters. arXiv preprint arXiv:2010.15206 (2020).","journal-title":"arXiv preprint arXiv:2010.15206"},{"key":"e_1_3_2_52_2","article-title":"Adaptive reduced rank regression","author":"Wu Qiong","year":"2019","unstructured":"Qiong Wu, Felix Ming Fai Wong, Zhenming Liu, Yanhua Li, and Varun Kanade. 2019. Adaptive reduced rank regression. arXiv preprint arXiv:1905.11566 (2019).","journal-title":"arXiv preprint arXiv:1905.11566"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3506705","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3506705","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3506705","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:11:50Z","timestamp":1750191110000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3506705"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,3,24]]},"references-count":51,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2022,6,30]]}},"alternative-id":["10.1145\/3506705"],"URL":"https:\/\/doi.org\/10.1145\/3506705","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,3,24]]},"assertion":[{"value":"2021-07-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-12-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-03-24","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}