{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,24]],"date-time":"2025-10-24T07:32:10Z","timestamp":1761291130586,"version":"3.37.3"},"reference-count":31,"publisher":"Springer Science and Business Media LLC","issue":"2","license":[{"start":{"date-parts":[[2022,4,1]],"date-time":"2022-04-01T00:00:00Z","timestamp":1648771200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2022,4,5]],"date-time":"2022-04-05T00:00:00Z","timestamp":1649116800000},"content-version":"vor","delay-in-days":4,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"Technische Universit\u00e4t Hamburg"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Parallel Prog"],"published-print":{"date-parts":[[2022,4]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>GPUs are capable of delivering peak performance in TFLOPs, however, peak performance is often difficult to achieve due to several performance bottlenecks. Memory divergence is one such performance bottleneck that makes it harder to exploit locality, cause cache thrashing, and high miss rate, therefore, impeding GPU performance. As data locality is crucial for performance, there have been several efforts to exploit data locality in GPUs. However, there is a lack of quantitative analysis of data locality, which could pave the way for optimizations. In this paper, we quantitatively study the data locality and its limits in GPUs at different granularities. We show that, in contrast to previous studies, there is a significantly higher inter-warp locality at the L1 data cache for memory-divergent workloads. We further show that about 50% of the cache capacity and other scarce resources such as NoC bandwidth are wasted due to data over-fetch caused by memory divergence. While the low spatial utilization of cache lines justifies the sectored-cache design to only fetch those sectors of a cache line that are needed during a request, our limit study reveals the lost spatial locality for which additional memory requests are needed to fetch the other sectors of the same cache line. The lost spatial locality presents opportunities for further optimizing the cache design.<\/jats:p>","DOI":"10.1007\/s10766-022-00729-2","type":"journal-article","created":{"date-parts":[[2022,4,5]],"date-time":"2022-04-05T20:02:20Z","timestamp":1649188940000},"page":"189-216","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["A Quantitative Study of Locality in GPU Caches for Memory-Divergent Workloads"],"prefix":"10.1007","volume":"50","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2325-1705","authenticated-orcid":false,"given":"Sohan","family":"Lal","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Bogaraju Sharatchandra","family":"Varma","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ben","family":"Juurlink","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2022,4,5]]},"reference":[{"key":"729_CR1","doi-asserted-by":"crossref","unstructured":"Al-Kiswany, S., Gharaibeh, A., Santos-Neto, E., Yuan, G., Ripeanu, M.: StoreGPU: Exploiting graphics processing units to accelerate distributed storage systems. In: Proceedings of the 17th International symposium on high performance distributed computing, HPDC (2008)","DOI":"10.1145\/1383422.1383443"},{"key":"729_CR2","doi-asserted-by":"crossref","unstructured":"Bakhoda, A., Yuan, G.L., Fung, W.W.L., Wong, H., Aamodt, T.M.: Analyzing CUDA workloads using a detailed GPU simulator. In: Proceedings of the IEEE international symposium on performance analysis of systems and software, ISPASS, https:\/\/github.com\/gpgpu-sim\/ (2009)","DOI":"10.1109\/ISPASS.2009.4919648"},{"key":"729_CR3","doi-asserted-by":"crossref","unstructured":"Cederman, D., Chatterjee, B., Tsigas, P.: Understanding the performance of concurrent data structures on graphics processors. In: Proceedings of the 18th international conference on parallel processing, Euro-Par (2012)","DOI":"10.1007\/978-3-642-32820-6_87"},{"key":"729_CR4","doi-asserted-by":"crossref","unstructured":"Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J., Lee, S.H., Skadron, K.: Rodinia: A benchmark suite for heterogeneous computing. In: Proceedings of the IEEE International Symposium on Workload Characterization, IISWC (2009)","DOI":"10.1109\/IISWC.2009.5306797"},{"key":"729_CR5","doi-asserted-by":"crossref","unstructured":"Chen, X., Chang, L.W., Rodrigues, C.I., Lv, J., Wang, Z., Hwu, W.M.: Adaptive cache management for energy-efficient GPU computing. In: Proceedings of the 47th IEEE\/ACM International Symposium on Microarchitecture, MICRO (2014)","DOI":"10.1109\/MICRO.2014.11"},{"key":"729_CR6","unstructured":"Hennessy, J., Patterson, D.: Computer architecture: a quantitative approach, Fifth Edition (2012)"},{"key":"729_CR7","doi-asserted-by":"crossref","unstructured":"Hong, S., Oguntebi, T., Olukotun, K.: Efficient parallel graph exploration on multi-core CPU and GPU. In: International conference on parallel architectures and compilation techniques, PACT (2011)","DOI":"10.1109\/PACT.2011.14"},{"key":"729_CR8","doi-asserted-by":"crossref","unstructured":"Huang, J.C., Lee, J.H., Kim, H., Lee, H.H.S.: GPUMech: GPU performance modeling technique based on interval analysis. In: Proceedings of the international symposium on microarchitecture, MICRO (2014)","DOI":"10.1109\/MICRO.2014.59"},{"key":"729_CR9","doi-asserted-by":"crossref","unstructured":"Jia, W., Shaw, K.A., Martonosi, M.: Characterizing and improving the use of demand-fetched caches in GPUs. In: Proceedings of the 26th ACM international conference on supercomputing, ICS (2012)","DOI":"10.1145\/2304576.2304582"},{"key":"729_CR10","doi-asserted-by":"crossref","unstructured":"Kumar, S., Wilkerson, C.: Exploiting spatial locality in data caches using spatial footprints (1998)","DOI":"10.1145\/279361.279404"},{"key":"729_CR11","doi-asserted-by":"crossref","unstructured":"Lal, S., Lucas, J., Andersch, M., Alvarez-Mesa, M., Elhossini, A., Juurlink, B.: GPGPU Workload characteristics and performance analysis. In: Proceedings of the 14th international conference on embedded computer systems: architectures, modeling, and simulation, SAMOS (2014)","DOI":"10.1109\/SAMOS.2014.6893202"},{"key":"729_CR12","doi-asserted-by":"crossref","unstructured":"Lal, S., Juurlink, B.: A quantitative study of locality in GPU caches. In: Proceedings of the international conference on embedded computer systems: architectures, modeling, and simulation, SAMOS (2020)","DOI":"10.1007\/978-3-030-60939-9_16"},{"key":"729_CR13","doi-asserted-by":"crossref","unstructured":"Li, A., van\u00a0den Braak, G.J., Kumar, A., Corporaal, H.: Adaptive and transparent cache bypassing for GPUs. In: Proceedings of the international conference for high performance computing, networking, storage and analysis, SC (2015)","DOI":"10.1145\/2807591.2807606"},{"key":"729_CR14","doi-asserted-by":"crossref","unstructured":"Li, A., Song, S.L., Liu, W., Liu, X., Kumar, A., Corporaal, H.: Locality-aware CTA clustering for modern GPUs. In: Proceedings of the international conference on architectural support for programming languages and operating systems, ASPLOS (2017)","DOI":"10.1145\/3037697.3037709"},{"key":"729_CR15","doi-asserted-by":"crossref","unstructured":"Li, C., Song, S.L., Dai, H., Sidelnik, A., Hari, S.K.S., Zhou, H.: Locality-driven dynamic GPU cache bypassing. In: Proceedings of the 29th ACM on international conference on supercomputing, ICS (2015)","DOI":"10.1145\/2751205.2751237"},{"key":"729_CR16","doi-asserted-by":"crossref","unstructured":"Meng, J., Tarjan, D., Skadron, K.: Dynamic warp subdivision for integrated branch and memory divergence tolerance. In: Proceedings of the 37th annual international symposium on Computer architecture, ISCA (2010)","DOI":"10.1145\/1815961.1815992"},{"key":"729_CR17","doi-asserted-by":"crossref","unstructured":"Narasiman, V., Shebanow, M., Lee, C.J., Miftakhutdinov, R., Mutlu, O., Patt, Y.N.: Improving GPU performance via large warps and two-level warp scheduling. In: Proceedings of the 44th annual IEEE\/ACM international symposium on microarchitecture, MICRO (2011)","DOI":"10.1145\/2155620.2155656"},{"key":"729_CR18","unstructured":"NVIDIA: CUDA: compute unified device architecture. http:\/\/developer.nvidia.com\/object\/gpucomputing.html (2007)"},{"key":"729_CR19","doi-asserted-by":"crossref","unstructured":"Reguly, I.Z., Giles, M.: Efficient sparse matrix-vector multiplication on cache-based GPUs. In: Proceedings of the innovative parallel computing, InPar (2012)","DOI":"10.1109\/InPar.2012.6339602"},{"key":"729_CR20","doi-asserted-by":"crossref","unstructured":"Rhu, M., Sullivan, M., Leng, J., Erez, M.: A locality-aware memory hierarchy for energy-efficient GPU architectures. In: Proceedings of the 46th annual IEEE\/ACM international symposium on microarchitecture, MICRO (2013)","DOI":"10.1145\/2540708.2540717"},{"key":"729_CR21","doi-asserted-by":"crossref","unstructured":"Rogers, T.G., O\u2019Connor, M., Aamodt, T.M.: Cache-conscious wavefront scheduling. In: Proceedings of the 45th annual IEEE\/ACM international symposium on microarchitecture, MICRO (2012)","DOI":"10.1109\/MICRO.2012.16"},{"key":"729_CR22","doi-asserted-by":"crossref","unstructured":"Sandokji, S., Essa, F., Fadel, M.: A survey of techniques for warp scheduling in GPUs. In: Proceedings of the international conference on intelligent computing and information systems, (ICICIS) (2015)","DOI":"10.1109\/IntelCIS.2015.7397284"},{"key":"729_CR23","doi-asserted-by":"crossref","unstructured":"Sartori, J., Kumar, R.: Branch and data herding: reducing control and memory divergence for error-tolerant GPU applications. In: Proceedings of the 21st international conference on parallel architectures and compilation techniques, PACT (2012)","DOI":"10.1145\/2370816.2370879"},{"key":"729_CR24","doi-asserted-by":"crossref","unstructured":"Shi, Z., Huang, X., Jain, A., Lin, C.: Applying deep learning to the cache replacement problem. In: Proceedings of the IEEE\/ACM international symposium on microarchitecture, MICRO (2019)","DOI":"10.1145\/3352460.3358319"},{"key":"729_CR25","doi-asserted-by":"crossref","unstructured":"Tang, X., Pattnaik, A., Kayiran, O., Jog, A., Kandemir, M.T., Das, C.: Quantifying data locality in dynamic parallelism in GPUs. Proc. ACM Meas. Anal. Comput, Syst (2018)","DOI":"10.1145\/3309697.3331473"},{"key":"729_CR26","doi-asserted-by":"crossref","unstructured":"Tarjan, D., Meng, J., Skadron, K.: Increasing memory miss tolerance for SIMD cores. In: Proceedings of the conference on high performance computing networking, storage and analysis, SC (2009)","DOI":"10.1145\/1654059.1654082"},{"key":"729_CR27","doi-asserted-by":"crossref","unstructured":"Vijaykumar, N., Ebrahimi, E., Hsieh, K., Gibbons, P.B., Mutlu, O.: The locality descriptor: a holistic cross-layer abstraction to express data locality in GPUs. In: Proceedings of the international symposium on computer architecture (ISCA) (2018)","DOI":"10.1109\/ISCA.2018.00074"},{"key":"729_CR28","doi-asserted-by":"crossref","unstructured":"Wang, L., Jahre, M., Adileho, A., Eeckhout, L.: MDM: The GPU memory divergence model. In: Proceedings of the international symposium on microarchitecture (MICRO) (2020)","DOI":"10.1109\/MICRO50266.2020.00085"},{"key":"729_CR29","doi-asserted-by":"crossref","unstructured":"Xiao, S., Lin, H., Feng, W.C.: Accelerating protein sequence search in a heterogeneous computing system. In: IEEE International parallel and distributed processing symposium, IPDPS (2011)","DOI":"10.1109\/IPDPS.2011.115"},{"key":"729_CR30","doi-asserted-by":"crossref","unstructured":"Xie, X., Liang, Y., Wang, Y., Sun, G., Wang, T.: Coordinated static and dynamic cache bypassing for GPUs. In: 2015 IEEE 21st International symposium on high performance computer architecture, HPCA (2015)","DOI":"10.1109\/HPCA.2015.7056023"},{"key":"729_CR31","doi-asserted-by":"crossref","unstructured":"Zhang, E.Z., Jiang, Y., Guo, Z., Tian, K., Shen, X.: On-the-fly elimination of dynamic irregularities for GPU computing. In: Proceedings of the international conference on architectural support for programming languages and operating systems, ASPLOS (2011)","DOI":"10.1145\/1950365.1950408"}],"container-title":["International Journal of Parallel Programming"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10766-022-00729-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10766-022-00729-2\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10766-022-00729-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,4,29]],"date-time":"2022-04-29T22:05:33Z","timestamp":1651269933000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10766-022-00729-2"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,4]]},"references-count":31,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2022,4]]}},"alternative-id":["729"],"URL":"https:\/\/doi.org\/10.1007\/s10766-022-00729-2","relation":{},"ISSN":["0885-7458","1573-7640"],"issn-type":[{"type":"print","value":"0885-7458"},{"type":"electronic","value":"1573-7640"}],"subject":[],"published":{"date-parts":[[2022,4]]},"assertion":[{"value":"15 April 2021","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"9 March 2022","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"5 April 2022","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}