{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,18]],"date-time":"2025-11-18T09:53:03Z","timestamp":1763459583923,"version":"3.45.0"},"reference-count":43,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2017,4,6]],"date-time":"2017-04-06T00:00:00Z","timestamp":1491436800000},"content-version":"vor","delay-in-days":365,"URL":"http:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100000185","name":"DARPA","doi-asserted-by":"crossref","award":["HR0011-13-2-0005"],"award-info":[{"award-number":["HR0011-13-2-0005"]}],"id":[{"id":"10.13039\/100000185","id-type":"DOI","asserted-by":"crossref"}]},{"name":"NSF","award":["CCF-1117042"],"award-info":[{"award-number":["CCF-1117042"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Comput. Syst."],"published-print":{"date-parts":[[2016,4,6]]},"abstract":"<jats:p>To enable performance improvements in a power-efficient manner, computer architects have been building CPUs that exploit greater amounts of thread-level parallelism. A key consideration in such CPUs is properly designing the on-chip cache hierarchy. Unfortunately, this can be hard to do, especially for CPUs with high core counts and large amounts of cache. The enormous design space formed by the combinatorial number of ways in which to organize the cache hierarchy makes it difficult to identify power-efficient configurations. Moreover, the problem is exacerbated by the slow speed of architectural simulation, which is the primary means for conducting such design space studies.<\/jats:p>\n                  <jats:p>A powerful tool that can help architects optimize CPU cache hierarchies is reuse distance (RD) analysis. Recent work has extended uniprocessor RD techniques-i.e., by introducing concurrent RD and private-stack RD profiling\u2014to enable analysis of different types of caches in multicore CPUs. Once acquired, parallel locality profiles can predict the performance of numerous cache configurations, permitting highly efficient design space exploration. To date, existing work on multicore RD analysis has focused on developing the profiling techniques and assessing their accuracy. Unfortunately, there has been no work on using RD analysis to optimize CPU performance or power consumption.<\/jats:p>\n                  <jats:p>This article investigates applying multicore RD analysis to identify the most power efficient cache configurations for a multicore CPU. First, we develop analytical models that use the cache-miss counts from parallel locality profiles to estimate CPU performance and power consumption. Although future scalable CPUs will likely employ multithreaded (and even out-of-order) cores, our current study assumes single-threaded in-order cores to simplify the models, allowing us to focus on the cache hierarchy and our RD-based techniques. Second, to demonstrate the utility of our techniques, we apply our models to optimize a large-scale tiled CPU architecture with a two-level cache hierarchy. We show that the most power efficient configuration varies considerably across different benchmarks, and that our locality profiles provide deep insights into why certain configurations are power efficient. We also show that picking the best configuration can provide significant gains, as there is a 2.01x power efficiency spread across our tiled CPU design space. Finally, we validate the accuracy of our techniques using detailed simulation. Among several simulated configurations, our techniques can usually pick the most power efficient configuration, or one that is very close to the best. In addition, across all simulated configurations, we can predict power efficiency with 15.2% error.<\/jats:p>","DOI":"10.1145\/2851503","type":"journal-article","created":{"date-parts":[[2016,4,7]],"date-time":"2016-04-07T18:16:10Z","timestamp":1460052970000},"page":"1-30","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":10,"title":["Identifying Power-Efficient Multicore Cache Hierarchies via Reuse Distance Analysis"],"prefix":"10.1145","volume":"34","author":[{"given":"Michael","family":"Badamo","sequence":"first","affiliation":[{"name":"University of Maryland at College Park, MD"}]},{"given":"Jeff","family":"Casarona","sequence":"additional","affiliation":[{"name":"University of Maryland at College Park, MD"}]},{"given":"Minshu","family":"Zhao","sequence":"additional","affiliation":[{"name":"University of Maryland at College Park, MD"}]},{"given":"Donald","family":"Yeung","sequence":"additional","affiliation":[{"name":"University of Maryland at College Park, MD"}]}],"member":"320","published-online":{"date-parts":[[2016,4,6]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2004.1291352"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/1064212.1064232"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/1454115.1454128"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2011.6114194"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.2005.42"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/363095.363141"},{"key":"e_1_2_1_7_1","unstructured":"Chen Ding and Trishul Chilimbi. 2009. A Composable Model for Analyzing Locality of Multi-Threaded Programs. Technical Report MSR-TR-2009-107. Microsoft Research."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/781131.781159"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/1944862.1944885"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2010.5452069"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/1534909.1534910"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1147\/sj.92.0078"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11390-010-9321-5"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/1555754.1555779"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/1105734.1105739"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.5555\/645988.674164"},{"key":"e_1_2_1_17_1","unstructured":"Intel. 2014. Intel Xeon Phi Product Family. Available at http:\/\/www.intel.com\/XeonPhi."},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/1168857.1168882"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-11970-5_15"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/1168857.1168881"},{"volume-title":"Proceedings of the International Symposium on Performance Analysis of Systems and Software.","author":"Li Jian","key":"e_1_2_1_21_1","unstructured":"Jian Li and Jose F. Martinez. 2005. Power-performance implications of thread-level parallelism on chip multiprocessors. In Proceedings of the International Symposium on Performance Analysis of Systems and Software."},{"volume-title":"Proceedings of the International Symposium on Microarchitecture.","author":"Li Sheng","key":"e_1_2_1_22_1","unstructured":"Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the International Symposium on Microarchitecture."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2006.1598109"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2008.15"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/1065010.1065034"},{"key":"e_1_2_1_26_1","volume-title":"Proceedings of the 6th International Workshop on Feedback Control Implementation and Design in Computing Systems and Networks.","author":"Maggio Martina","year":"2011","unstructured":"Martina Maggio, Henry Hoffman, Anant Agarwal, and Alberto Leva. 2011. Control-theoretical CPU allocation: Design and implementation with feedback control. In Proceedings of the 6th International Workshop on Feedback Control Implementation and Design in Computing Systems and Networks."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2010.5416635"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2006.302743"},{"key":"e_1_2_1_29_1","unstructured":"Apan Qasem and Ken Kennedy. 2005. Evaluating a Model for Cache Conflict Miss Prediction. Technical Report CS-TR05-457. Rice University."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/1555754.1555801"},{"volume-title":"Proceedings of the International Conference on Parallel Architectures and Compilation Techniques.","author":"Schuff Derek L.","key":"e_1_2_1_31_1","unstructured":"Derek L. Schuff, Milind Kulkarni, and Vijay S. Pai. 2010. Accelerating multicore reuse distance analysis with sampling and parallelization. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques."},{"key":"e_1_2_1_32_1","unstructured":"Derek L. Schuff Benjamin S. Parsons and Vijay S. Pai. 2009. Multicore-Aware Reuse Distance Analysis. Technical Report TR-ECE-09-08. Purdue University."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/NOCS.2012.31"},{"key":"e_1_2_1_34_1","volume-title":"PHD: A Hierarchical Cache Coherent Protocol. Master\u2019s Thesis","author":"Wallach Deborah A.","year":"1993","unstructured":"Deborah A. Wallach. 1993. PHD: A Hierarchical Cache Coherent Protocol. Master\u2019s Thesis. Massachusetts Institute of Technology."},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/223982.223990"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.2011.58"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/2247684.2247687"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/2427631.2427632"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1145\/2485922.2485965"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/1080695.1069998"},{"key":"e_1_2_1_41_1","volume-title":"Proceedings of the Workshop on Chip Multiprocessor Memory Systems and Interconnect.","author":"Zhao Li","year":"2007","unstructured":"Li Zhao, Ravi Iyer, Srihari Makineni, Jaideep Moses, Ramesh Illikkal, and Donald Newell. 2007. Performance, area and bandwidth implications on large-scale CMP cache design. In Proceedings of the Workshop on Chip Multiprocessor Memory Systems and Interconnect."},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.2003.1238004"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/1552309.1552310"}],"container-title":["ACM Transactions on Computer Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2851503","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2851503","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2851503","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,18]],"date-time":"2025-11-18T09:47:48Z","timestamp":1763459268000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2851503"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2016,4,6]]},"references-count":43,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2016,4,6]]}},"alternative-id":["10.1145\/2851503"],"URL":"https:\/\/doi.org\/10.1145\/2851503","relation":{},"ISSN":["0734-2071","1557-7333"],"issn-type":[{"type":"print","value":"0734-2071"},{"type":"electronic","value":"1557-7333"}],"subject":[],"published":{"date-parts":[[2016,4,6]]},"assertion":[{"value":"2014-10-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2015-11-01","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2016-04-06","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}