{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T00:10:12Z","timestamp":1750291812261,"version":"3.41.0"},"reference-count":44,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2014,2,1]],"date-time":"2014-02-01T00:00:00Z","timestamp":1391212800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2014,2]]},"abstract":"<jats:p>\n            Graphics processing units, or GPUs, provide TFLOPs of additional performance potential in commodity computer systems that frequently go unused by most applications. Even with the emergence of languages such as CUDA and OpenCL, programming GPUs remains a difficult challenge for a variety of reasons, including the inherent algorithmic characteristics and data structure choices used by applications as well as the tedious performance optimization cycle that is necessary to achieve high performance. The goal of this work is to increase the applicability of GPUs beyond CUDA\/OpenCL to implicitly data-parallel applications written in C\/C++ using speculative parallelization. To achieve this goal, we propose\n            <jats:italic>Paragon<\/jats:italic>\n            : a static\/dynamic compiler platform to speculatively run possibly data-parallel portions of sequential applications on the GPU while cooperating with the system CPU. For such loops, Paragon utilizes the GPU in an opportunistic way while orchestrating a cooperative relation between the CPU and GPU to reduce the overhead of miss-speculations. Paragon monitors the dependencies for the loops running speculatively on the GPU and nonspeculatively on the CPU using a lightweight distributed conflict detection designed specifically for GPUs, and transfers the execution to the CPU in case a conflict is detected. Paragon resumes the execution on the GPU after the CPU resolves the dependency. Our experiments show that Paragon achieves 4x on average and up to 30x speedup compared to unsafe CPU execution with four threads and 7x on average and up to 64x speedup versus sequential execution across a set of sequential but implicitly data-parallel applications.\n          <\/jats:p>","DOI":"10.1145\/2579617","type":"journal-article","created":{"date-parts":[[2014,3,18]],"date-time":"2014-03-18T12:09:07Z","timestamp":1395144547000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Leveraging GPUs using cooperative loop speculation"],"prefix":"10.1145","volume":"11","author":[{"given":"Mehrzad","family":"Samadi","sequence":"first","affiliation":[{"name":"University of Michigan, Ann Arbor, MI"}]},{"given":"Amir","family":"Hormati","sequence":"additional","affiliation":[{"name":"Google Inc., Seattle, WA"}]},{"given":"Janghaeng","family":"Lee","sequence":"additional","affiliation":[{"name":"University of Michigan, Ann Arbor, MI"}]},{"given":"Scott","family":"Mahlke","sequence":"additional","affiliation":[{"name":"University of Michigan, Ann Arbor, MI"}]}],"member":"320","published-online":{"date-parts":[[2014,2]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/1375527.1375562"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-11970-5_14"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.5555\/1025127.1025992"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/362686.362692"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/2.546612"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/1345206.1345242"},{"volume-title":"Proc. of the 39th Annual International Symposium on Computer Architecture. 49--60","author":"Brunie N.","key":"e_1_2_1_7_1"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.5555\/2386208.2386228"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/1941553.1941561"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/PRDC.2009.55"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.2011.63"},{"volume-title":"Proc. of the 2010 IEEE International Symposium on Parallel and Distributed Processing. 1--12","author":"Diamos G.","key":"e_1_2_1_12_1"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/2155620.2155655"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2010.62"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/1133981.1133984"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/2259016.2259029"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPP.2008.59"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/1504176.1504181"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/1250734.1250759"},{"volume-title":"Proc. of the 16th Workshop on Languages and Compilers for Parallel Computing.","author":"Lee S. I.","key":"e_1_2_1_20_1"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/1815961.1816021"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/1596655.1596670"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10766-010-0155-0"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/1542476.1542495"},{"volume-title":"Proc. of the 39th Annual International Symposium on Computer Architecture. 72--83","author":"Menon J.","key":"e_1_2_1_25_1"},{"key":"e_1_2_1_26_1","unstructured":"NVIDIA. 2010. GPUs Are Only Up To 14 Times Faster than CPUs says Intel. Retrieved http:\/\/blogs.nvidia.com\/ntersect\/2010\/06\/gpus-are-only-up-to-14-times-faster-than-cpus-says-intel.html.  NVIDIA. 2010. GPUs Are Only Up To 14 Times Faster than CPUs says Intel. Retrieved http:\/\/blogs.nvidia.com\/ntersect\/2010\/06\/gpus-are-only-up-to-14-times-faster-than-cpus-says-intel.html."},{"volume-title":"Proc. of the 11th Static Analysis Symposium. 165--180","author":"Nystrom E.","key":"e_1_2_1_27_1"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/1370082.1370090"},{"key":"e_1_2_1_29_1","unstructured":"Polybench. 2011. The Polyhedral Benchmark suite. Retrieved from http:\/\/www.cse.ohio-state.edu\/pouchet\/software\/polybench.  Polybench. 2011. The Polyhedral Benchmark suite. Retrieved from http:\/\/www.cse.ohio-state.edu\/pouchet\/software\/polybench."},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/1375581.1375594"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/71.250105"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/71.752782"},{"volume-title":"Proc. of the 1st Workshop on General Purpose Processing on Graphics Processing Units. 1--4.","author":"Roger D.","key":"e_1_2_1_33_1"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/1356058.1356084"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/2254064.2254067"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/1082469.1082471"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/1168857.1168898"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2008.4771802"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-03013-0_7"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/193209.193217"},{"volume-title":"High Performance Compilers for Parallel Computing","author":"Wolfe M.","key":"e_1_2_1_41_1"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/1735688.1735697"},{"volume-title":"Proc. of the 2010 IEEE International Symposium on Parallel and Distributed Processing. 1--12","year":"2010","author":"Xiao S.","key":"e_1_2_1_43_1"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/1950365.1950408"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2579617","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2579617","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T23:43:50Z","timestamp":1750290230000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2579617"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2014,2]]},"references-count":44,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2014,2]]}},"alternative-id":["10.1145\/2579617"],"URL":"https:\/\/doi.org\/10.1145\/2579617","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2014,2]]},"assertion":[{"value":"2013-02-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2013-10-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2014-02-01","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}