{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T04:09:00Z","timestamp":1750306140173,"version":"3.41.0"},"reference-count":41,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2017,5,31]],"date-time":"2017-05-31T00:00:00Z","timestamp":1496188800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100000001","name":"NSF","doi-asserted-by":"publisher","award":["#CCF-1117042"],"award-info":[{"award-number":["#CCF-1117042"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000185","name":"DARPA","doi-asserted-by":"crossref","award":["#HR0011-13-2-0005"],"award-info":[{"award-number":["#HR0011-13-2-0005"]}],"id":[{"id":"10.13039\/100000185","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Comput. Syst."],"published-print":{"date-parts":[[2017,5,31]]},"abstract":"<jats:p>\n            Researchers have proposed numerous techniques to improve the scalability of coherence directories. The effectiveness of these techniques not only depends on application behavior, but also on the CPU's configuration, for example, its core count and cache size. As CPUs continue to scale, it is essential to explore the directory's application\n            <jats:italic>and<\/jats:italic>\n            architecture dependencies. However, this is challenging given the slow speed of simulators. While it is common practice to simulate different applications, previous research on directory designs have explored only a few\u2014and in most cases, only one\u2014CPU configuration, which can lead to an incomplete and inaccurate view of the directory's behavior.\n          <\/jats:p>\n          <jats:p>\n            This article proposes to use\n            <jats:italic>multicore reuse distance analysis<\/jats:italic>\n            to study coherence directories. We develop a framework to extract the directory access stream from parallel least recently used (LRU) stacks, enabling rapid analysis of the directory's accesses and contents across both core count and cache size scaling. A key part of our framework is the notion of\n            <jats:italic>relative reuse distance between sharers<\/jats:italic>\n            , which defines sharing in a capacity-dependent fashion and facilitates our analyses along the data cache size dimension.\n          <\/jats:p>\n          <jats:p>We implement our framework in a profiler and then apply it to gain insights into the impact of multicore CPU scaling on directory behavior. Our profiling results show that directory accesses reduce by 3.3\u00d7 when scaling the data cache size from 16KB to 1MB, despite an increase in sharing-based directory accesses. We also show that increased sharing caused by data cache scaling allows the portion of on-chip memory occupied by the directory to be reduced by 43.3%, compared to a reduction of only 2.6% when scaling the number of cores. And, we show certain directory entries exhibit high temporal reuse. In addition to gaining insights, we also validate our profile-based results, and find they are within 2--10% of cache simulations on average, across different validation experiments. Finally, we conduct four case studies that illustrate our insights on existing directory techniques. In particular, we demonstrate our directory occupancy insights on a Cuckoo directory; we apply our sharing insights to provide bounds on the size of Scalable Coherence Directories (SCD) and Dual-Grain Directories (DGD); and, we demonstrate our directory entry reuse insights on a multilevel directory design.<\/jats:p>","DOI":"10.1145\/3092702","type":"journal-article","created":{"date-parts":[[2017,7,31]],"date-time":"2017-07-31T12:12:00Z","timestamp":1501503120000},"page":"1-49","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Using Multicore Reuse Distance to Study Coherence Directories"],"prefix":"10.1145","volume":"35","author":[{"given":"Minshu","family":"Zhao","sequence":"first","affiliation":[{"name":"University of Maryland at College Park"}]},{"given":"Donald","family":"Yeung","sequence":"additional","affiliation":[{"name":"University of Maryland at College Park, MD"}]}],"member":"320","published-online":{"date-parts":[[2017,7,28]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2001.903255"},{"key":"e_1_2_1_2_1","volume-title":"Proceedings of the Symposium on High Performance Chips.","author":"Agarwal Anant","year":"2007","unstructured":"Anant Agarwal , Liewei Bao , John Brown , Bruce Edwards , Matt Mattina , Chyi-Chang Miao , Carl Ramey , and David Wentzlaff . 2007 . Tile processor: Embedded multicore for networking and multimedia . In Proceedings of the Symposium on High Performance Chips. Anant Agarwal, Liewei Bao, John Brown, Bruce Edwards, Matt Mattina, Chyi-Chang Miao, Carl Ramey, and David Wentzlaff. 2007. Tile processor: Embedded multicore for networking and multimedia. In Proceedings of the Symposium on High Performance Chips."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.1988.5238"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.1999.809463"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2012.39"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/339647.339696"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2006.1620793"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/1454115.1454128"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/106972.106995"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1007\/3-540-56891-3_27"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.5555\/645608.662005"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2011.241"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/2000064.2000076"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/1944862.1944885"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2011.5749726"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11390-010-9321-5"},{"key":"e_1_2_1_18_1","volume-title":"Proceedings of the International Conference on Parallel Processing. 312--321","author":"Gupta Anoop","year":"1990","unstructured":"Anoop Gupta , Wolf dietrich Weber , and Todd Mowry . 1990 . Reducing memory and traffic requirements for scalable directory-based cache coherence schemes . In Proceedings of the International Conference on Parallel Processing. 312--321 . Anoop Gupta, Wolf dietrich Weber, and Todd Mowry. 1990. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In Proceedings of the International Conference on Parallel Processing. 312--321."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/1105734.1105739"},{"volume-title":"Intel Xeon Phi Product Family (July","year":"2014","key":"e_1_2_1_20_1","unstructured":"Intel. 2014. Intel Xeon Phi Product Family (July 2014 ). http:\/\/www.intel.com\/content\/www\/us\/en\/processors\/xeon\/xeon-phi-detail.html. Intel. 2014. Intel Xeon Phi Product Family (July 2014). http:\/\/www.intel.com\/content\/www\/us\/en\/processors\/xeon\/xeon-phi-detail.html."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-11970-5_15"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/1854273.1854291"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/1065010.1065034"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2009.4798261"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/344166.344610"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1147\/sj.92.0078"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/1127577.1127586"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2006.302743"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/344166.344526"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2012.6168950"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/1854273.1854286"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/2370816.2370891"},{"key":"e_1_2_1_34_1","volume-title":"PHD: A Hierarchical Cache Coherent Protocol (Master\u2019s Thesis).","author":"Wallach Deborah A.","year":"1993","unstructured":"Deborah A. Wallach . 1993 . PHD: A Hierarchical Cache Coherent Protocol (Master\u2019s Thesis). (1993). Deborah A. Wallach. 1993. PHD: A Hierarchical Cache Coherent Protocol (Master\u2019s Thesis). (1993)."},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/223982.223990"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.2011.58"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/2427631.2427632"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/2485922.2485965"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2002.995706"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/2540708.2540739"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/1669112.1669166"},{"key":"e_1_2_1_42_1","volume-title":"Proceedings of the Workshop on Chip Multiprocessor Memory Systems and Interconnect.","author":"Zhao Li","year":"2007","unstructured":"Li Zhao , Ravi Iyer , Srihari Makineni , Jaideep Moses , Ramesh Illikkal , and Donald Newell . 2007 . Performance, area and bandwidth implications on large-scale CMP cache design . In Proceedings of the Workshop on Chip Multiprocessor Memory Systems and Interconnect. Li Zhao, Ravi Iyer, Srihari Makineni, Jaideep Moses, Ramesh Illikkal, and Donald Newell. 2007. Performance, area and bandwidth implications on large-scale CMP cache design. In Proceedings of the Workshop on Chip Multiprocessor Memory Systems and Interconnect."},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2015.7056065"}],"container-title":["ACM Transactions on Computer Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3092702","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3092702","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3092702","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T03:37:26Z","timestamp":1750217846000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3092702"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2017,5,31]]},"references-count":41,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2017,5,31]]}},"alternative-id":["10.1145\/3092702"],"URL":"https:\/\/doi.org\/10.1145\/3092702","relation":{},"ISSN":["0734-2071","1557-7333"],"issn-type":[{"type":"print","value":"0734-2071"},{"type":"electronic","value":"1557-7333"}],"subject":[],"published":{"date-parts":[[2017,5,31]]},"assertion":[{"value":"2015-05-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2017-04-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2017-07-28","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}