{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T04:20:35Z","timestamp":1750306835470,"version":"3.41.0"},"reference-count":44,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2013,12,1]],"date-time":"2013-12-01T00:00:00Z","timestamp":1385856000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2013,12]]},"abstract":"<jats:p>\n            As microprocessor designs integrate more cores, scalability of cache coherence protocols becomes a challenging problem. Most directory-based protocols avoid races by using blocking tag directories that can impact the performance of parallel applications. In this article, we first quantitatively demonstrate that state-of-the-art blocking protocols significantly constrain throughput at large core counts for several parallel applications. Nonblocking protocols address this throughput concern at the expense of scalability in the interconnection network or in the required resource overheads. To address this concern, we enhance nonblocking directory protocols by migrating the point of service of responses. Our approach uses in-flight chains of cores making parallel memory requests to incorporate scalability while maintaining high-throughput. The proposed cache coherence protocol called\n            <jats:italic>chained cache coherence<\/jats:italic>\n            , can outperform blocking protocols by up to 20% on scientific and 12% on commercial applications. It also has low resource overheads and simple address ordering requirements making it both a high-performance and scalable protocol. Furthermore, in-flight chains provide a scalable solution to building hierarchical and nonblocking tag directories as well as optimize communication latencies.\n          <\/jats:p>","DOI":"10.1145\/2541228.2541235","type":"journal-article","created":{"date-parts":[[2014,1,2]],"date-time":"2014-01-02T13:09:43Z","timestamp":1388668183000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["Using in-flight chains to build a scalable cache coherence protocol"],"prefix":"10.1145","volume":"10","author":[{"given":"Samantika","family":"Subramaniam","sequence":"first","affiliation":[{"name":"Intel Corporation"}]},{"given":"Simon C.","family":"Steely","sequence":"additional","affiliation":[{"name":"Intel Corporation"}]},{"given":"Will","family":"Hasenplaugh","sequence":"additional","affiliation":[{"name":"Intel Corporation and MIT"}]},{"given":"Aamer","family":"Jaleel","sequence":"additional","affiliation":[{"name":"Intel Corporation"}]},{"given":"Carl","family":"Beckmann","sequence":"additional","affiliation":[{"name":"Intel Corporation"}]},{"given":"Tryggve","family":"Fossum","sequence":"additional","affiliation":[{"name":"Intel Corporation"}]},{"given":"Joel","family":"Emer","sequence":"additional","affiliation":[{"name":"Intel Corporation and MIT"}]}],"member":"320","published-online":{"date-parts":[[2013,12]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"Bailey D. H. 1994. The NAS Parallel Benchmarks. www.davidhbailey.com\/dhbpapers\/npb-encycpc.pdf.  Bailey D. H. 1994. The NAS Parallel Benchmarks. www.davidhbailey.com\/dhbpapers\/npb-encycpc.pdf."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/300979.301004"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/2024716.2024718"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10766-009-0096-7"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/2.55500"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2003.1214336"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2007.346210"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/165123.165146"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.5555\/647765.735832"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/2.982918"},{"volume-title":"Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA\u201911)","author":"Ferdman M.","key":"e_1_2_1_11_1","unstructured":"Ferdman , M. , Lotfi-Kamran , P. , Balet , K. , and Falsafi , B . 2011. Cuckoo directory: A scalable directory for many-core systems . In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA\u201911) . Ferdman, M., Lotfi-Kamran, P., Balet, K., and Falsafi, B. 2011. Cuckoo directory: A scalable directory for many-core systems. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA\u201911)."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/378993.378997"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-87475-1_21"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/35.533919"},{"volume-title":"Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA\u201999)","author":"Hagersten E.","key":"e_1_2_1_15_1","unstructured":"Hagersten , E. and Koster , M . 1999. WildFire: A scalable path for SMPs . In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA\u201999) . Hagersten, E. and Koster, M. 1999. WildFire: A scalable path for SMPs. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA\u201999)."},{"volume-title":"Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA\u201906)","author":"Jaleel A.","key":"e_1_2_1_16_1","unstructured":"Jaleel , A. , Mattina , M. , and Jacob , B . 2006. Last level cache (LLC) performance of data mining workloads on a CMP\u2014a case study of parallel bioinformatics workloads . In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA\u201906) . Jaleel, A., Mattina, M., and Jacob, B. 2006. Last level cache (LLC) performance of data mining workloads on a CMP\u2014a case study of parallel bioinformatics workloads. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA\u201906)."},{"key":"e_1_2_1_17_1","unstructured":"Jeffers J. 2012. Intel\u00ae many integrated core architecture: An overview and programming models. http:\/\/www.olcf.ornl.gov\/wp-content\/training\/electronic-structure-2012\/ORNL_Elec_Struct_WS_ 02062012.pdf.  Jeffers J. 2012. Intel\u00ae many integrated core architecture: An overview and programming models. http:\/\/www.olcf.ornl.gov\/wp-content\/training\/electronic-structure-2012\/ORNL_Elec_Struct_WS_ 02062012.pdf."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/237578.237583"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2010.82"},{"volume-title":"Proceedings of the International Symposium on High-Performance Architecture (HPCA\u201999)","author":"Kaxiras S.","key":"e_1_2_1_21_1","unstructured":"Kaxiras , S. and Goodman , J . 1999. Improving CC-NUMA performance using instruction-based prediction . In Proceedings of the International Symposium on High-Performance Architecture (HPCA\u201999) . Kaxiras, S. and Goodman, J. 1999. Improving CC-NUMA performance using instruction-based prediction. In Proceedings of the International Symposium on High-Performance Architecture (HPCA\u201999)."},{"key":"e_1_2_1_22_1","unstructured":"Kong J. Yew P.-C. Y. and Gyungho L. 1999. A Non-blocking Directory Protocol for Large-Scale Multiprocessors. Tech. rep. TR 99-012 University of Minnesota.  Kong J. Yew P.-C. Y. and Gyungho L. 1999. A Non-blocking Directory Protocol for Large-Scale Multiprocessors. Tech. rep. TR 99-012 University of Minnesota."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/1378533.1378536"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/264107.264206"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/2.121510"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/379189.379198"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/859618.859642"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/2209249.2209269"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/859618.859640"},{"volume-title":"Proceedings of the International Symposium on High-Performance Computer Architecture (ISCA\u201902)","author":"Martin M. M.","key":"e_1_2_1_30_1","unstructured":"Martin , M. M. , Sorin , D. J. , Hill , M. D. , and Wood , D. A . 2002. Bandwidth adaptive snooping . In Proceedings of the International Symposium on High-Performance Computer Architecture (ISCA\u201902) . Martin, M. M., Sorin, D. J., Hill, M. D., and Wood, D. A. 2002. Bandwidth adaptive snooping. In Proceedings of the International Symposium on High-Performance Computer Architecture (ISCA\u201902)."},{"volume-title":"Proceedings of the International Symposium on Computer Architecture (ISCA\u201908)","author":"Marty M.","key":"e_1_2_1_32_1","unstructured":"Marty , M. and Hill , M . 2008. Virtual hierarchies . In Proceedings of the International Symposium on Computer Architecture (ISCA\u201908) . Marty, M. and Hill, M. 2008. Virtual hierarchies. In Proceedings of the International Symposium on Computer Architecture (ISCA\u201908)."},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2005.17"},{"key":"e_1_2_1_34_1","unstructured":"Morgan T. P. 2012. Intel teaches Xeon Phi x86 coprossor snappy new tricks: The interconnect rings a bell. http:\/\/www.theregister.co.uk\/2012\/09\/05\/intel_xeon_phi_coprocessor.  Morgan T. P. 2012. Intel teaches Xeon Phi x86 coprossor snappy new tricks: The interconnect rings a bell. http:\/\/www.theregister.co.uk\/2012\/09\/05\/intel_xeon_phi_coprocessor."},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/279358.279386"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/SPDP.1992.242703"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2008.4771778"},{"key":"e_1_2_1_38_1","unstructured":"Rajamony R. Shafi H. Williams D. and Wright K. 2005. Chained cache coherency states for sequential non-homogeneous access to a cache line. US patent US20070083716 Al.  Rajamony R. Shafi H. Williams D. and Wright K. 2005. Chained cache coherency states for sequential non-homogeneous access to a cache line. US patent US20070083716 Al."},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2012.6168950"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.5555\/891382"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.5555\/888927"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/223982.223990"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1109\/71.139202"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/IFCSTA.2009.97"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/1669112.1669166"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2010.11"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2541228.2541235","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2541228.2541235","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T08:09:55Z","timestamp":1750234195000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2541228.2541235"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2013,12]]},"references-count":44,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2013,12]]}},"alternative-id":["10.1145\/2541228.2541235"],"URL":"https:\/\/doi.org\/10.1145\/2541228.2541235","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2013,12]]},"assertion":[{"value":"2013-06-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2013-08-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2013-12-01","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}