{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,6]],"date-time":"2026-04-06T12:07:10Z","timestamp":1775477230855,"version":"3.50.1"},"reference-count":41,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2018,10,10]],"date-time":"2018-10-10T00:00:00Z","timestamp":1539129600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100000923","name":"Australian Research Council","doi-asserted-by":"crossref","award":["DP170103956 and DP180104069"],"award-info":[{"award-number":["DP170103956 and DP180104069"]}],"id":[{"id":"10.13039\/501100000923","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Innovative Team Support Program of Hunan","award":["2017RS3047"],"award-info":[{"award-number":["2017RS3047"]}]},{"name":"National Natural Science Foundation of Hunan","award":["2018JJ3616"],"award-info":[{"award-number":["2018JJ3616"]}]},{"name":"National Key Research and Development Program of China","award":["2017YFB0202003"],"award-info":[{"award-number":["2017YFB0202003"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2018,12,31]]},"abstract":"<jats:p>GEneral Matrix Multiply (GEMM) is the most fundamental computational kernel routine in the BLAS library. To achieve high performance, in-memory data must be prefetched into fast on-chip caches before they are used. Two techniques, software prefetching and data packing, have been used to effectively exploit the capability of on-chip least recent used (LRU) caches, which are popular in traditional high-performance processors used in high-end servers and supercomputers. However, the market has recently witnessed a new diversity in processor design, resulting in high-performance processors equipped with shared caches with non-LRU replacement policies. This poses a challenge to the development of high-performance GEMM in a multithreaded context. As several threads try to load data into a shared cache simultaneously, interthread cache conflicts will increase significantly. We present a Shared Cache Partitioning (SCP) method to eliminate interthread cache conflicts in the GEMM routines, by partitioning a shared cache into physically disjoint sets and assigning different sets to different threads. We have implemented SCP in the OpenBLAS library and evaluated it on Phytium 2000+, a 64-core AArch64 processor with private LRU L1 caches and shared pseudo-random L2 caches (per four-core cluster). Our evaluation shows that SCP has effectively reduced the conflict misses in both L1 and L2 caches in a highly optimized GEMM implementation, resulting in an improvement of its performance by 2.75% to 6.91%.<\/jats:p>","DOI":"10.1145\/3274654","type":"journal-article","created":{"date-parts":[[2018,10,10]],"date-time":"2018-10-10T13:30:46Z","timestamp":1539178246000},"page":"1-21","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":7,"title":["SCP"],"prefix":"10.1145","volume":"15","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-7514-1495","authenticated-orcid":false,"given":"Xing","family":"Su","sequence":"first","affiliation":[{"name":"National Laboratory for Parallel and Distributed Processing, National University of Defense Technology, China"}]},{"given":"Xiangke","family":"Liao","sequence":"additional","affiliation":[{"name":"National Laboratory for Parallel and Distributed Processing, National University of Defense Technology, China"}]},{"given":"Hao","family":"Jiang","sequence":"additional","affiliation":[{"name":"National Laboratory for Parallel and Distributed Processing, National University of Defense Technology, China"}]},{"given":"Canqun","family":"Yang","sequence":"additional","affiliation":[{"name":"National Laboratory for Parallel and Distributed Processing, National University of Defense Technology, China"}]},{"given":"Jingling","family":"Xue","sequence":"additional","affiliation":[{"name":"UNSW Sydney, Australia"}]}],"member":"320","published-online":{"date-parts":[[2018,10,10]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"2018. The Phytium processor family. Retrieved from http:\/\/www.phytium.com.cn.  2018. The Phytium processor family. Retrieved from http:\/\/www.phytium.com.cn."},{"key":"e_1_2_1_2_1","unstructured":"November 2017. Top500 supercomputer sites. Retrieved from https:\/\/www.top500.org.  November 2017. Top500 supercomputer sites. Retrieved from https:\/\/www.top500.org."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/1375581.1375595"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2008.10.002"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2011.33"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/996841.996853"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/1356052.1356053"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/1377603.1377607"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/1454115.1454145"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/292395.292412"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/292395.292426"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11227-014-1098-9"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/2491956.2462187"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.5555\/1673012.1673014"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/106972.106981"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/349299.349320"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/355841.355847"},{"key":"e_1_2_1_18_1","volume-title":"Proceedings of the International Conference on High-Performance Computer Architecture (HPCA'08)","author":"Lin Jiang","unstructured":"Jiang Lin , Qingda Lu , Xiaoning Ding , Zhao Zhang , Xiaodong Zhang , and P. Sadayappan . 2008. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems . In Proceedings of the International Conference on High-Performance Computer Architecture (HPCA'08) . 367--378. Jiang Lin, Qingda Lu, Xiaoning Ding, Zhao Zhang, Xiaodong Zhang, and P. Sadayappan. 2008. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In Proceedings of the International Conference on High-Performance Computer Architecture (HPCA'08). 367--378."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-33065-0_18"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/2925987"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2006.49"},{"key":"e_1_2_1_22_1","unstructured":"I. Rosen D. Nuzman and A. Zaks. 2007. Loop-aware SLP in GCC. In GCC Developers\u2019 Summit. 131--142.  I. Rosen D. Nuzman and A. Zaks. 2007. Loop-aware SLP in GCC. In GCC Developers\u2019 Summit. 131--142."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2014.110"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/2581122.2544155"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/CGO.2017.7863734"},{"key":"e_1_2_1_26_1","volume-title":"Tools for High Performance Computing","author":"Terpstra Dan","year":"2009","unstructured":"Dan Terpstra , Heike Jagode , Haihang You , and Jack Dongarra . 2010. Collecting performance data with PAPI-C . In Tools for High Performance Computing 2009 . Springer , Berlin , 157--173. Dan Terpstra, Heike Jagode, Haihang You, and Jack Dongarra. 2010. Collecting performance data with PAPI-C. In Tools for High Performance Computing 2009. Springer, Berlin, 157--173."},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/2764454"},{"key":"e_1_2_1_28_1","volume-title":"Proceedings of the Conference on High Performance Computing (SC'08)","author":"Volkov Vasily","unstructured":"Vasily Volkov and James W. Demmel . 2008. Benchmarking GPUs to tune dense linear algebra . In Proceedings of the Conference on High Performance Computing (SC'08) . 31:1--31:11. Vasily Volkov and James W. Demmel. 2008. Benchmarking GPUs to tune dense linear algebra. In Proceedings of the Conference on High Performance Computing (SC'08). 31:1--31:11."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPP.2015.29"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/2503210.2503219"},{"key":"e_1_2_1_31_1","volume-title":"Proceedings of the Conference on Supercomputing (SC\u201998)","author":"Clint Whaley R.","unstructured":"R. Clint Whaley and Jack J. Dongarra . 1998. Automatically tuned linear algebra software . In Proceedings of the Conference on Supercomputing (SC\u201998) . 1--27. R. Clint Whaley and Jack J. Dongarra. 1998. Automatically tuned linear algebra software. In Proceedings of the Conference on Supercomputing (SC\u201998). 1--27."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.5555\/353939"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10766-016-0441-6"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.5555\/2190025.2190057"},{"key":"e_1_2_1_35_1","volume-title":"Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS'07)","author":"Yi Qing","unstructured":"Qing Yi , K. Seymour , H. You , R. Vuduc , and D. Quinlan . 2007. POET: Parameterized optimizations for empirical tuning . In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS'07) . 1--8. Qing Yi, K. Seymour, H. You, R. Vuduc, and D. Quinlan. 2007. POET: Parameterized optimizations for empirical tuning. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS'07). 1--8."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2014.14"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2004.840444"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/2755561"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPADS.2012.97"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/2886101"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/2854038.2854054"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3274654","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3274654","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T00:57:56Z","timestamp":1750208276000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3274654"}},"subtitle":["Shared Cache Partitioning for High-Performance GEMM"],"short-title":[],"issued":{"date-parts":[[2018,10,10]]},"references-count":41,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2018,12,31]]}},"alternative-id":["10.1145\/3274654"],"URL":"https:\/\/doi.org\/10.1145\/3274654","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2018,10,10]]},"assertion":[{"value":"2018-05-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-08-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-10-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}