{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,31]],"date-time":"2025-10-31T13:38:48Z","timestamp":1761917928610,"version":"3.41.0"},"reference-count":48,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2022,2,24]],"date-time":"2022-02-24T00:00:00Z","timestamp":1645660800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100000001","name":"NSF","doi-asserted-by":"publisher","award":["119236, 2122155, 2028929, 1931531, 1763681"],"award-info":[{"award-number":["119236, 2122155, 2028929, 1931531, 1763681"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Meas. Anal. Comput. Syst."],"published-print":{"date-parts":[[2022,2,24]]},"abstract":"<jats:p>Many program codes from different application domains process very large amounts of data, making their cache memory behavior critical for high performance. Most of the existing work targeting cache memory hierarchies focus on improving data access patterns, e.g., maximizing sequential accesses to program data structures via code and\/or data layout restructuring strategies. Prior work has addressed this data locality optimization problem in the context of both single-core and multi-core systems. Another dimension of optimization, which can be as equally important\/beneficial as improving data access pattern is to reduce the data volume (total number of addresses) accessed by the program code. Compared to data access pattern restructuring, this volume minimization problem has relatively taken much less attention. In this work, we focus on this volume minimization problem and address it in both single-core and multi-core execution scenarios. Specifically, we explore the idea of rewriting an application program code to reduce its \"memory space footprint\". The main idea behind this approach is to reuse\/recycle, for a given data element, a memory location that has originally been assigned to another data element, provided that the lifetimes of these two data elements do not overlap with each other. A unique aspect is that it is \"distance aware\", i.e., in identifying the memory\/cache locations to recycle it takes into account the physical distance between the location of the core and the memory\/cache location to be recycled. We present a detailed experimental evaluation of our proposed memory space recycling strategy, using five different metrics: memory space consumption, network footprint, data access distance, cache miss rate, and execution time. The experimental results show that our proposed approach brings, respectively, 33.2%, 48.6%, 46.5%, 31.8%, and 27.9% average improvements in these metrics, in the case of single-threaded applications. With the multi-threaded versions of the same applications, the achieved improvements are 39.5%, 55.5%, 53.4%, 26.2%, and 22.2%, in the same order.<\/jats:p>","DOI":"10.1145\/3508034","type":"journal-article","created":{"date-parts":[[2022,2,28]],"date-time":"2022-02-28T23:44:29Z","timestamp":1646091869000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Memory Space Recycling"],"prefix":"10.1145","volume":"6","author":[{"given":"Jihyun","family":"Ryoo","sequence":"first","affiliation":[{"name":"The Pennsylvania State University, State College, PA, USA"}]},{"given":"Mahmut Taylan","family":"Kandemir","sequence":"additional","affiliation":[{"name":"The Pennsylvania State University, State College, PA, USA"}]},{"given":"Mustafa","family":"Karakoy","sequence":"additional","affiliation":[{"name":"TUBITAK-BILGEM, Gebze, Turkey"}]}],"member":"320","published-online":{"date-parts":[[2022,2,28]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"D'Hollander","author":"Beyls Kristof E.","year":"2000","unstructured":"Kristof E. Beyls and Erik H . D'Hollander . 2000 . Compiler generated multithreading to alleviate memory latency. Universal Computer Science, special issue on Multithreaded Processors and Chip-Multiprocessors , Vol. 6 , 10 (2000), 968-993. Kristof E. Beyls and Erik H. D'Hollander. 2000. Compiler generated multithreading to alleviate memory latency. Universal Computer Science, special issue on Multithreaded Processors and Chip-Multiprocessors , Vol. 6, 10 (2000), 968-993."},{"volume-title":"Benchmarking Modern Multiprocessors. Ph.,D. Dissertation","author":"Bienia Christian","key":"e_1_2_1_2_1","unstructured":"Christian Bienia . 2011. Benchmarking Modern Multiprocessors. Ph.,D. Dissertation . Princeton University . Christian Bienia. 2011. Benchmarking Modern Multiprocessors. Ph.,D. Dissertation. Princeton University."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/1454115.1454128"},{"key":"e_1_2_1_4_1","unstructured":"Jim Butterfield. 1986. Part 4: Overlaying. In Loading and Linking Commodore Programs. Compute!  Jim Butterfield. 1986. Part 4: Overlaying. In Loading and Linking Commodore Programs. Compute!"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/195473.195557"},{"key":"e_1_2_1_6_1","volume-title":"Dietz","author":"Chang Lung-Yu","year":"1990","unstructured":"Lung-Yu Chang and Henry G . Dietz . 1990 . Data Layout Optimization and Code Transformation for Paged Memory Systems . Technical Report TR-EE 90--43. Purdue University . Lung-Yu Chang and Henry G. Dietz. 1990. Data Layout Optimization and Code Transformation for Paged Memory Systems . Technical Report TR-EE 90--43. Purdue University."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/223428.207162"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/115372.115320"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2005.167"},{"volume-title":"Compiler Support for Optimizing Memory Bank-Level Parallelism. In International Symposium on Microarchitecture","author":"Ding Wei","key":"e_1_2_1_10_1","unstructured":"Wei Ding , Diana Guttman , and Mahmut T. Kandemir . 2014 . Compiler Support for Optimizing Memory Bank-Level Parallelism. In International Symposium on Microarchitecture ( Cambridge, UK). 571--582. Wei Ding, Diana Guttman, and Mahmut T. Kandemir. 2014. Compiler Support for Optimizing Memory Bank-Level Parallelism. In International Symposium on Microarchitecture (Cambridge, UK). 571--582."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/LCA.2014.2333735"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/1755888.1755894"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0167-8191(97)00089-6"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISLPED.2011.5993675"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/ETS.2013.6569370"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/1815961.1816018"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3316781.3317808"},{"volume-title":"Proceedings of the International Symposium on Computer Architecture","author":"Hsieh Kevin","key":"e_1_2_1_19_1","unstructured":"Kevin Hsieh , Eiman Ebrahimi , Gwangsun Kim , Niladrish Chatterjee , Mike O- Connor , Nandita Vijaykumar , Onur Mutlu , and Stephen W. Keckler . 2016. Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent near-Data Processing in GPU Systems . In Proceedings of the International Symposium on Computer Architecture ( Seoul, Republic of Korea). 204-216. Kevin Hsieh, Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O-Connor, Nandita Vijaykumar, Onur Mutlu, and Stephen W. Keckler. 2016. Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent near-Data Processing in GPU Systems. In Proceedings of the International Symposium on Computer Architecture (Seoul, Republic of Korea). 204-216."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1177\/1094342005056138"},{"volume-title":"Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems","author":"Kim Changkyu","key":"e_1_2_1_21_1","unstructured":"Changkyu Kim , Doug Burger , and Stephen W. Keckler . 2002. An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated on-Chip Caches . In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems ( San Jose, California). 211-222. Changkyu Kim, Doug Burger, and Stephen W. Keckler. 2002. An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated on-Chip Caches. In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, California). 211-222."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3192366.3192386"},{"volume-title":"Symposium on Principles of Programming Languages","author":"Kuck D. J.","key":"e_1_2_1_23_1","unstructured":"D. J. Kuck , R. H. Kuhn , D. A. Padua , B. Leasure , and M. Wolfe . 1981. Dependence graphs and compiler optimizations . In Symposium on Principles of Programming Languages ( Williamsburg, Virginia). 207-218. D. J. Kuck, R. H. Kuhn, D. A. Padua, B. Leasure, and M. Wolfe. 1981. Dependence graphs and compiler optimizations. In Symposium on Principles of Programming Languages (Williamsburg, Virginia). 207-218."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/12.752659"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.5555\/977395.977673"},{"key":"e_1_2_1_26_1","volume-title":"Proceedings of the Workshop on Multithreaded Execution, Architecture and Compilation","author":"Lin Wen-Yen","year":"1998","unstructured":"Wen-Yen Lin and Jean-Luc Gaudiot . 1998 . The Design of an I-Structure Software Cache System . In Proceedings of the Workshop on Multithreaded Execution, Architecture and Compilation ( Las Vegas, NV, USA). Wen-Yen Lin and Jean-Luc Gaudiot. 1998. The Design of an I-Structure Software Cache System. In Proceedings of the Workshop on Multithreaded Execution, Architecture and Compilation (Las Vegas, NV, USA)."},{"key":"e_1_2_1_27_1","doi-asserted-by":"crossref","unstructured":"Yu Liu Hong An Xiaomei Li Peng Leng Sun Sun and Junshi Chen. 2012. VSCP: A Cache Controlling Method for Improving Single Thread Performance in Multicore System. In International Conference on High Performance Computing and Communication & International Conference on Embedded Software and Systems (Liverpool England UK). 161--168.  Yu Liu Hong An Xiaomei Li Peng Leng Sun Sun and Junshi Chen. 2012. VSCP: A Cache Controlling Method for Improving Single Thread Performance in Multicore System. In International Conference on High Performance Computing and Communication & International Conference on Embedded Software and Systems (Liverpool England UK). 161--168.","DOI":"10.1109\/HPCC.2012.30"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1018780200739"},{"key":"e_1_2_1_29_1","doi-asserted-by":"crossref","unstructured":"Debabrata Mohapatra Vinay K. Chippa Anand Raghunathan and Kaushik Roy. 2011. Design of voltage-scalable meta-functions for approximate computing. In Design Automation Test in Europe (Grenoble France). 1--6.  Debabrata Mohapatra Vinay K. Chippa Anand Raghunathan and Kaushik Roy. 2011. Design of voltage-scalable meta-functions for approximate computing. In Design Automation Test in Europe (Grenoble France). 1--6.","DOI":"10.1109\/DATE.2011.5763154"},{"key":"e_1_2_1_30_1","volume-title":"SPEC OMP2012 -- An Application Benchmark Suite for Parallel Systems Using OpenMP. In OpenMP in a Heterogeneous World. 223--236","author":"M\u00fcller Matthias S.","year":"2012","unstructured":"Matthias S. M\u00fcller , John Baron , William C. Brantley , Huiyu Feng , Daniel Hackenberg , Robert Henschel , Gabriele Jost , Daniel Molka , Chris Parrott , Joe Robichaux , Pavel Shelepugin , Matthijs van Waveren , Brian Whitney , and Kalyan Kumaran . 2012 . SPEC OMP2012 -- An Application Benchmark Suite for Parallel Systems Using OpenMP. In OpenMP in a Heterogeneous World. 223--236 . Matthias S. M\u00fcller, John Baron, William C. Brantley, Huiyu Feng, Daniel Hackenberg, Robert Henschel, Gabriele Jost, Daniel Molka, Chris Parrott, Joe Robichaux, Pavel Shelepugin, Matthijs van Waveren, Brian Whitney, and Kalyan Kumaran. 2012. SPEC OMP2012 -- An Application Benchmark Suite for Parallel Systems Using OpenMP. In OpenMP in a Heterogeneous World. 223--236."},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/43.908427"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/146628.140397"},{"key":"e_1_2_1_33_1","volume-title":"NDC: Analyzing the impact of 3D-stacked memory","author":"Pugsley Seth H.","year":"2014","unstructured":"Seth H. Pugsley , Jeffrey Jestes , Huihui Zhang , Rajeev Balasubramonian , Vijayalakshmi Srinivasan , Alper Buyuktosunoglu , Al Davis , and Feifei Li . 2014 . NDC: Analyzing the impact of 3D-stacked memory Seth H. Pugsley, Jeffrey Jestes, Huihui Zhang, Rajeev Balasubramonian, Vijayalakshmi Srinivasan , Alper Buyuktosunoglu, Al Davis, and Feifei Li. 2014. NDC: Analyzing the impact of 3D-stacked memory"},{"volume-title":"International Symposium on Performance Analysis of Systems and Software. 190--200","author":"MapReduce","key":"e_1_2_1_34_1","unstructured":"logic devices on MapReduce workloads. In International Symposium on Performance Analysis of Systems and Software. 190--200 . logic devices on MapReduce workloads. In International Symposium on Performance Analysis of Systems and Software. 190--200."},{"volume-title":"Proc. of the International Conference on Parallel Processing","author":"Muhammad","key":"e_1_2_1_35_1","unstructured":"Muhammad M. Rafique and Zhichun Zhu. 2018. CAMPS: Conflict-Aware Memory-Side Prefetching Scheme for Hybrid Memory Cube . In Proc. of the International Conference on Parallel Processing ( Eugene, OR, USA). 1--9. Muhammad M. Rafique and Zhichun Zhu. 2018. CAMPS: Conflict-Aware Memory-Side Prefetching Scheme for Hybrid Memory Cube. In Proc. of the International Conference on Parallel Processing (Eugene, OR, USA). 1--9."},{"key":"e_1_2_1_36_1","volume-title":"Structure Layout Optimization for Multithreaded Programs. In International Symposium on Code Generation and Optimization","author":"Raman Easwaran","year":"2007","unstructured":"Easwaran Raman , Robert Hundt , and Sandya Mannarswamy . 2007 . Structure Layout Optimization for Multithreaded Programs. In International Symposium on Code Generation and Optimization ( San Jose, CA, USA). 271--282. Easwaran Raman, Robert Hundt, and Sandya Mannarswamy. 2007. Structure Layout Optimization for Multithreaded Programs. In International Symposium on Code Generation and Optimization (San Jose, CA, USA). 271--282."},{"volume-title":"Vtune performance analyzer essentials","author":"Reinders James","key":"e_1_2_1_37_1","unstructured":"James Reinders . 2005. Vtune performance analyzer essentials . In Intel Press . James Reinders. 2005. Vtune performance analyzer essentials. In Intel Press ."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/2897937.2906199"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2016.25"},{"volume-title":"Proceedings of Supercomputing","author":"Thomas","key":"e_1_2_1_40_1","unstructured":"Thomas L. Sterling and Hans P. Zima. 2002. Gilgamesh: A Multithreaded Processor-in-Memory Architecture for Petaflops Computing . In Proceedings of Supercomputing ( Baltimore, MD, USA). 48--48. Thomas L. Sterling and Hans P. Zima. 2002. Gilgamesh: A Multithreaded Processor-in-Memory Architecture for Petaflops Computing. In Proceedings of Supercomputing (Baltimore, MD, USA). 48--48."},{"key":"e_1_2_1_41_1","volume-title":"A Logic-in-Memory Computer. Transactions on Computers","author":"Stone Harold S.","year":"1970","unstructured":"Harold S. Stone . 1970. A Logic-in-Memory Computer. Transactions on Computers , Vol. C-19 , 1 ( 1970 ), 73--78. Harold S. Stone. 1970. A Logic-in-Memory Computer. Transactions on Computers , Vol. C-19, 1 (1970), 73--78."},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/291069.291015"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3314221.3314599"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3287321"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/3123939.3123954"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/2540708.2540710"},{"key":"e_1_2_1_47_1","volume-title":"Proceedings of International Symposium on Parallel Architectures, Algorithms and Networks","author":"Lin Wen-Yen","year":"1997","unstructured":"Wen-Yen Lin and Jean-Luc Gaudiot . 1997 . Exploiting global data locality in non-blocking multithreaded architectures . In Proceedings of International Symposium on Parallel Architectures, Algorithms and Networks ( Taipei, Taiwan). 78--84. Wen-Yen Lin and Jean-Luc Gaudiot. 1997. Exploiting global data locality in non-blocking multithreaded architectures. In Proceedings of International Symposium on Parallel Architectures, Algorithms and Networks (Taipei, Taiwan). 78--84."},{"key":"e_1_2_1_48_1","doi-asserted-by":"crossref","unstructured":"Qian Zhang Ting Wang Ye Tian Feng Yuan and Qiang Xu. 2015. ApproxANN: An approximate computing framework for artificial neural network. In Design Automation Test in Europe (Grenoble France). 701--706.  Qian Zhang Ting Wang Ye Tian Feng Yuan and Qiang Xu. 2015. ApproxANN: An approximate computing framework for artificial neural network. In Design Automation Test in Europe (Grenoble France). 701--706.","DOI":"10.7873\/DATE.2015.0618"},{"key":"e_1_2_1_49_1","volume-title":"Kandemir","author":"Zhang Yuanrui","year":"2011","unstructured":"Yuanrui Zhang , Wei Ding , Jun Liu , and Mahmut T . Kandemir . 2011 . Optimizing Data Layouts for Parallel Computation on Multicores. In Parallel Architectures and Compilation Techniques (Galveston, TX, USA) . 143--154. Yuanrui Zhang, Wei Ding, Jun Liu, and Mahmut T. Kandemir. 2011. Optimizing Data Layouts for Parallel Computation on Multicores. In Parallel Architectures and Compilation Techniques (Galveston, TX, USA). 143--154."}],"container-title":["Proceedings of the ACM on Measurement and Analysis of Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3508034","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3508034","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3508034","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:12:29Z","timestamp":1750191149000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3508034"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,2,24]]},"references-count":48,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2022,2,24]]}},"alternative-id":["10.1145\/3508034"],"URL":"https:\/\/doi.org\/10.1145\/3508034","relation":{},"ISSN":["2476-1249"],"issn-type":[{"type":"electronic","value":"2476-1249"}],"subject":[],"published":{"date-parts":[[2022,2,24]]},"assertion":[{"value":"2022-02-28","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}