{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:24:01Z","timestamp":1750220641930,"version":"3.41.0"},"reference-count":36,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2019,12,17]],"date-time":"2019-12-17T00:00:00Z","timestamp":1576540800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2019,12,31]]},"abstract":"<jats:p>Heterogeneous microprocessors integrate a CPU and GPU on the same chip, providing fast CPU-GPU communication and enabling cores to compute on data \u201cin place.\u201d This permits exploiting a finer granularity of parallelism on the integrated GPUs, and enables the use of GPUs for accelerating more complex and irregular codes. One challenge, however, is exposing enough parallelism such that both the CPU and GPU are effectively utilized to achieve maximum gain.<\/jats:p>\n          <jats:p>\n            In this article, we propose exploiting nested parallelism for integrated CPU-GPU chips. We look for loop structures in which one or more regular data parallel loops are nested within a parallel outer loop that can contain irregular code (e.g., with control divergence). By scheduling the outer loop on multiple CPU cores, multiple dynamic instances of the inner regular loop(s) can be scheduled on the GPU cores. This boosts GPU utilization and parallelizes the outer loop. We find that such\n            <jats:italic>nested MIMD-SIMD parallelization<\/jats:italic>\n            provides greater levels of parallelism for integrated CPU-GPU chips, and additionally there is ample opportunity to perform such parallelization in OpenMP programs.\n          <\/jats:p>\n          <jats:p>Our results show nested MIMD-SIMD parallelization provides a 16.1x and 8.67x speedup over sequential execution on a simulator and a physical machine, respectively. Our technique beats CPU-only parallelization by 4.13x and 2.40x, respectively, and GPU-only parallelization by 2.74x and 2.26x, respectively. Compared to the next-best scheme (either CPU- or GPU-only parallelization) per benchmark, our approach provides a 1.46x and 1.23x speedup for the simulator and physical machine, respectively.<\/jats:p>","DOI":"10.1145\/3368304","type":"journal-article","created":{"date-parts":[[2019,12,18]],"date-time":"2019-12-18T13:21:11Z","timestamp":1576675271000},"page":"1-27","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["Nested MIMD-SIMD Parallelization for Heterogeneous Microprocessors"],"prefix":"10.1145","volume":"16","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4277-9994","authenticated-orcid":false,"given":"Daniel","family":"Gerzhoy","sequence":"first","affiliation":[{"name":"University of Maryland at College Park, College Park, MD"}]},{"given":"Xiaowu","family":"Sun","sequence":"additional","affiliation":[{"name":"University of Maryland at College Park, College Park, MD"}]},{"given":"Michael","family":"Zuzak","sequence":"additional","affiliation":[{"name":"University of Maryland at College Park, College Park, MD"}]},{"given":"Donald","family":"Yeung","sequence":"additional","affiliation":[{"name":"University of Maryland at College Park, College Park, MD"}]}],"member":"320","published-online":{"date-parts":[[2019,12,17]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"Intel Corporation. &lsqb;n.d.&rsqb;. Intel Sandy Bridge Microarchitecture. Available at http:\/\/www.intel.com.  Intel Corporation. &lsqb;n.d.&rsqb;. Intel Sandy Bridge Microarchitecture. Available at http:\/\/www.intel.com."},{"key":"e_1_2_1_2_1","unstructured":"N. Brookwood. 2010. AMD Fusion Family of APUs: Enabling a Superior Immersive PC Experience. White Paper. AMD.  N. Brookwood. 2010. AMD Fusion Family of APUs: Enabling a Superior Immersive PC Experience. White Paper. AMD."},{"key":"e_1_2_1_3_1","unstructured":"Apple Inc. &lsqb;n.d.&rsqb; iPhone. Available at: https:\/\/www.apple.com\/eg\/iphone\/.  Apple Inc. &lsqb;n.d.&rsqb; iPhone. Available at: https:\/\/www.apple.com\/eg\/iphone\/."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.5555\/2015039.2015535"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/2749469.2750393"},{"key":"e_1_2_1_6_1","volume-title":"Proceedings of the Workshop on Programming Models for Emerging Architectures held in conjunction with the Symposium on Parallel Architectures and Compilation Techniques.","author":"Guevara Marisabel","year":"2009","unstructured":"Marisabel Guevara , Chris Gregg , Kim Hazelwood , and Kevin Skadron . 2009 . Enabling task parallelism in the CUDA scheduler . In Proceedings of the Workshop on Programming Models for Emerging Architectures held in conjunction with the Symposium on Parallel Architectures and Compilation Techniques. Marisabel Guevara, Chris Gregg, Kim Hazelwood, and Kevin Skadron. 2009. Enabling task parallelism in the CUDA scheduler. In Proceedings of the Workshop on Programming Models for Emerging Architectures held in conjunction with the Symposium on Parallel Architectures and Compilation Techniques."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/2381056.2381081"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/SAAHPC.2012.12"},{"key":"e_1_2_1_9_1","unstructured":"OpenMP. 2014. The OpenMP API Specification for Parallel Programming. Available at http:\/\/www.openmp.org.  OpenMP. 2014. The OpenMP API Specification for Parallel Programming. Available at http:\/\/www.openmp.org."},{"key":"e_1_2_1_10_1","first-page":"1","article-title":"gem5-gpu: A heterogeneous CPU-GPU simulator","volume":"13","author":"Power Jason","year":"2014","unstructured":"Jason Power , Joel Hestness , Mar S. Orr , Mark D. Hill , and David A. Wood . 2014 . gem5-gpu: A heterogeneous CPU-GPU simulator . IEEE Computer Architecture Letters 13 , 1 (Jan. 2014), 34--36. Jason Power, Joel Hestness, Mar S. Orr, Mark D. Hill, and David A. Wood. 2014. gem5-gpu: A heterogeneous CPU-GPU simulator. IEEE Computer Architecture Letters 13, 1 (Jan. 2014), 34--36.","journal-title":"IEEE Computer Architecture Letters"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/SAAHPC.2011.29"},{"volume-title":"Proceedings of the ACM International Conference on Computing Frontiers.","author":"Spafford Kyle","key":"e_1_2_1_12_1","unstructured":"Kyle Spafford , Jeremy S. Meredith , Seyong Lee , Dong Li , Philip C. Roth , and Jeffrey S. Vetter . 2012. The tradeoffs of fused memory hierarchies in heterogeneous computing architectures . In Proceedings of the ACM International Conference on Computing Frontiers. Kyle Spafford, Jeremy S. Meredith, Seyong Lee, Dong Li, Philip C. Roth, and Jeffrey S. Vetter. 2012. The tradeoffs of fused memory hierarchies in heterogeneous computing architectures. In Proceedings of the ACM International Conference on Computing Frontiers."},{"volume-title":"Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units.","author":"Danalis Anthony","key":"e_1_2_1_13_1","unstructured":"Anthony Danalis , Gabriel Marin , Collin McCurdy , Jeremy S. Meredith , Philip C. Roth , Kyle Spafford , Vinod Tipparaju , and Jeffrey S. Vetter . 2010. The scalable heterogeneous computing (SHOC) benchmark suite . In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units. Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. 2010. The scalable heterogeneous computing (SHOC) benchmark suite. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units."},{"key":"e_1_2_1_14_1","doi-asserted-by":"crossref","unstructured":"J. Dongarra and P. Luszczek. 2005. Introduction to the HPC Challenge Benchmark Suite. Technical Report. University of Tennessee--Knoxville.  J. Dongarra and P. Luszczek. 2005. Introduction to the HPC Challenge Benchmark Suite. Technical Report. University of Tennessee--Knoxville.","DOI":"10.21236\/ADA439315"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2012.57"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2010.5650274"},{"key":"e_1_2_1_17_1","unstructured":"SPEC. 2015. SPEC\u2019s Benchmarks. Retrieved November 5 2019 from http:\/\/www.spec.org\/benchmarks.html.  SPEC. 2015. SPEC\u2019s Benchmarks. Retrieved November 5 2019 from http:\/\/www.spec.org\/benchmarks.html."},{"key":"e_1_2_1_18_1","volume-title":"Proceedings of the 10th International Workshop on Programmability and Architectures for Heterogeneous Multicores.","author":"Zuzak Michael","year":"2017","unstructured":"Michael Zuzak and Donald Yeung . 2017 . Exploiting multi-loop parallelism on heterogeneous microprocessors . In Proceedings of the 10th International Workshop on Programmability and Architectures for Heterogeneous Multicores. Michael Zuzak and Donald Yeung. 2017. Exploiting multi-loop parallelism on heterogeneous microprocessors. In Proceedings of the 10th International Workshop on Programmability and Architectures for Heterogeneous Multicores."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/2628071.2628088"},{"volume-title":"Proceedings of the 18th International Conference on High Performance Computing.","author":"Vignesh","key":"e_1_2_1_20_1","unstructured":"Vignesh T. Ravi and Gagan Agrawal. 2011. A dynamic scheduling framework for emerging heterogeneous systems . In Proceedings of the 18th International Conference on High Performance Computing. Vignesh T. Ravi and Gagan Agrawal. 2011. A dynamic scheduling framework for emerging heterogeneous systems. In Proceedings of the 18th International Conference on High Performance Computing."},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/1810085.1810106"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/2555243.2555254"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2004.840491"},{"volume-title":"Proceedings of the 13th Euromicro Conference on Parallel, Distributed and Network-Based Processing. 244--250","author":"Dorta A. J.","key":"e_1_2_1_26_1","unstructured":"A. J. Dorta , C. Rodriguez , F. d. Sande , and A. Gonzalez-Escribano . 2005. The OpenMP Source Code Repository . In Proceedings of the 13th Euromicro Conference on Parallel, Distributed and Network-Based Processing. 244--250 . DOI:10.1109\/EMPDP.2005.41 10.1109\/EMPDP.2005.41 A. J. Dorta, C. Rodriguez, F. d. Sande, and A. Gonzalez-Escribano. 2005. The OpenMP Source Code Repository. In Proceedings of the 13th Euromicro Conference on Parallel, Distributed and Network-Based Processing. 244--250. DOI:10.1109\/EMPDP.2005.41"},{"key":"e_1_2_1_27_1","volume-title":"SPEC OMP 2001","author":"SPEC.","year":"2001","unstructured":"SPEC. 2001 . SPEC OMP 2001 . Retrieved November 5, 2019 from https:\/\/www.spec.org\/omp2001\/. SPEC. 2001. SPEC OMP 2001. Retrieved November 5, 2019 from https:\/\/www.spec.org\/omp2001\/."},{"volume-title":"Advanced Compiler Design and Implementation. Morgan Kaufmann","author":"Muchnick Steven","key":"e_1_2_1_28_1","unstructured":"Steven Muchnick . 1997. Advanced Compiler Design and Implementation. Morgan Kaufmann , San Francisco, CA . Steven Muchnick. 1997. Advanced Compiler Design and Implementation. Morgan Kaufmann, San Francisco, CA."},{"key":"e_1_2_1_29_1","unstructured":"Michael Kruse and Hal Finkel. 2018. Loop optimization framework. arXiv:1811.00632.  Michael Kruse and Hal Finkel. 2018. Loop optimization framework. arXiv:1811.00632."},{"key":"e_1_2_1_30_1","volume-title":"SPEC OMP 2012","author":"SPEC.","year":"2012","unstructured":"SPEC. 2012 . SPEC OMP 2012 . Retrieved November 5, 2019 from https:\/\/www.spec.org\/omp2012\/. SPEC. 2012. SPEC OMP 2012. Retrieved November 5, 2019 from https:\/\/www.spec.org\/omp2012\/."},{"key":"e_1_2_1_31_1","volume-title":"Proceedings of the International Symposium on Code Generation and Optimization.","author":"Kruse Michael","year":"2018","unstructured":"Michael Kruse and Tobias Grosser . 2018 . DeLICM: Scalar dependence removal at zero memory cost . In Proceedings of the International Symposium on Code Generation and Optimization. Michael Kruse and Tobias Grosser. 2018. DeLICM: Scalar dependence removal at zero memory cost. In Proceedings of the International Symposium on Code Generation and Optimization."},{"key":"e_1_2_1_32_1","first-page":"2","article-title":"The gem5 simulator","volume":"39","author":"Binkert N.","year":"2018","unstructured":"N. Binkert , B. Beckmann , G. Black , S. K. Reinhardt , A. Saidi , A. Basu , J. Hestness , 2018 . The gem5 simulator . ACM SIGARCH Computer Architecture News 39 , 2 (May 2011), 1--7. N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, et al. 2018. The gem5 simulator. ACM SIGARCH Computer Architecture News 39, 2 (May 2011), 1--7.","journal-title":"ACM SIGARCH Computer Architecture News"},{"key":"e_1_2_1_33_1","volume-title":"Intel Core i7-6770HQ Processor. Retrieved","author":"Intel Corporation","year":"2019","unstructured":"Intel Corporation . &lsqb;n.d.&rsqb;. Intel Core i7-6770HQ Processor. Retrieved November 5, 2019 from https:\/\/ark.intel.com\/products\/93341\/Intel-Core-i7-6770HQ-Processor-6M-Cache-up-to-3_50-GHz. Intel Corporation. &lsqb;n.d.&rsqb;. Intel Core i7-6770HQ Processor. Retrieved November 5, 2019 from https:\/\/ark.intel.com\/products\/93341\/Intel-Core-i7-6770HQ-Processor-6M-Cache-up-to-3_50-GHz."},{"volume-title":"Proceedings of the International Symposium on Performance Analysis of Systems and Software.","author":"Bakhoda Ali","key":"e_1_2_1_34_1","unstructured":"Ali Bakhoda , George L. Yuan , Wilson W. L. Fung , Henry Wong , and Tor M. Aamodt . 2009. Analyzing CUDA workloads using a detailed GPU simulator . In Proceedings of the International Symposium on Performance Analysis of Systems and Software. Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the International Symposium on Performance Analysis of Systems and Software."},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2013.6522332"},{"key":"e_1_2_1_36_1","unstructured":"Gem5. 2009. Gem5 M5threads. Retrieved November 5 2019 from https:\/\/github.com\/gem5\/m5threads.  Gem5. 2009. Gem5 M5threads. Retrieved November 5 2019 from https:\/\/github.com\/gem5\/m5threads."},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/2909437.2909450"},{"volume-title":"Proceedings of the 2008 International Symposium on Code Generation and Optimization.","author":"Raman Easwaran","key":"e_1_2_1_38_1","unstructured":"Easwaran Raman , Guilherme Ottoni , Arun Raman , Matthew J. Bridges , and David I. August . 2008. Parallel-stage decoupled software pipelining . In Proceedings of the 2008 International Symposium on Code Generation and Optimization. Easwaran Raman, Guilherme Ottoni, Arun Raman, Matthew J. Bridges, and David I. August. 2008. Parallel-stage decoupled software pipelining. In Proceedings of the 2008 International Symposium on Code Generation and Optimization."}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3368304","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3368304","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T22:02:06Z","timestamp":1750197726000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3368304"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,12,17]]},"references-count":36,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2019,12,31]]}},"alternative-id":["10.1145\/3368304"],"URL":"https:\/\/doi.org\/10.1145\/3368304","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2019,12,17]]},"assertion":[{"value":"2019-03-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-10-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-12-17","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}