{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,20]],"date-time":"2025-12-20T22:26:51Z","timestamp":1766269611717,"version":"3.41.0"},"reference-count":30,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2014,12,8]],"date-time":"2014-12-08T00:00:00Z","timestamp":1417996800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2015,1,9]]},"abstract":"<jats:p>\n            The shift toward parallel processor architectures has made programming and code generation increasingly challenging. To address this\n            <jats:italic>programmability<\/jats:italic>\n            challenge, this article presents a technique to fully automatically generate efficient and readable code for parallel processors (with a focus on GPUs). This is made possible by combining algorithmic skeletons, traditional compilation, and \u201c\n            <jats:italic>algorithmic species<\/jats:italic>\n            ,\u201d a classification of program code. Compilation starts by automatically annotating C code with class information (the algorithmic species). This code is then fed into the skeleton-based source-to-source compiler\n            <jats:sc>bones<\/jats:sc>\n            to generate CUDA code. To generate efficient code,\n            <jats:sc>bones<\/jats:sc>\n            also performs optimizations including host-accelerator transfer optimization and kernel fusion. This results in a unique approach, integrating a skeleton-based compiler for the first time into an automated flow. The benefits are demonstrated experimentally for PolyBench GPU kernels, showing geometric mean speed-ups of 1.4\u00d7 and 2.4\u00d7 compared to\n            <jats:sc>ppcg<\/jats:sc>\n            and\n            <jats:sc>Par4All<\/jats:sc>\n            , and for five Rodinia GPU benchmarks, showing a gap of only 1.2\u00d7 compared to hand-optimized code.\n          <\/jats:p>","DOI":"10.1145\/2665079","type":"journal-article","created":{"date-parts":[[2014,12,8]],"date-time":"2014-12-08T16:17:14Z","timestamp":1418055434000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":14,"title":["Bones"],"prefix":"10.1145","volume":"11","author":[{"given":"Cedric","family":"Nugteren","sequence":"first","affiliation":[{"name":"Eindhoven University of Technology, Eindhoven, The Netherlands"}]},{"given":"Henk","family":"Corporaal","sequence":"additional","affiliation":[{"name":"Eindhoven University of Technology, Eindhoven, The Netherlands"}]}],"member":"320","published-online":{"date-parts":[[2014,12,8]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"Marco Aldinucci Marco Danelutto Peter Kilpatrick and Massimo Torquati. 2013. FastFlow: High-level and efficient streaming on multi-core. Programming Multi-core and Many-core Computing Systems 13 (January 2013). Wiley.  Marco Aldinucci Marco Danelutto Peter Kilpatrick and Massimo Torquati. 2013. FastFlow: High-level and efficient streaming on multi-core. Programming Multi-core and Many-core Computing Systems 13 (January 2013). Wiley."},{"volume-title":"Proceedings of the IMPACT Workshop.","year":"2012","author":"Amini Mehdi","key":"e_1_2_1_2_1"},{"volume-title":"Proceedings of the CPC Workshop. INRIA.","year":"2010","author":"Baghdadi Soufiane","key":"e_1_2_1_3_1"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-11970-5_14"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/1375581.1375595"},{"volume-title":"Algorithmic Skeletons: Structured Management of Parallel Computation","year":"1991","author":"Cole Murray","key":"e_1_2_1_6_1"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.5555\/645674.663458"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1016\/S0167-8191(00)00034-X"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-45293-2_13"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/1863482.1863487"},{"key":"e_1_2_1_11_1","first-page":"509","article-title":"Data parallel skeletons for GPU clusters and multi-GPU systems","volume":"22","author":"Ernsting Steffen","year":"2011","journal-title":"Advances in Parallel Computing"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/InPar.2012.6339595"},{"volume-title":"Proceedings of the LCPC Workshop. Springer.","year":"2012","author":"Guelton Serge","key":"e_1_2_1_13_1"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2010.62"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/2259016.2259038"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.5555\/645671.665526"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/CGO.2013.6494995"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/2503210.2503268"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCC.2011.73"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/MuCoCoS.2013.6633604"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/2400682.2400699"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-45293-2_14"},{"volume-title":"Proceedings of the International Conference on Code Generation and Optimization. IEEE.","author":"Park Eunjung","key":"e_1_2_1_23_1"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2010.14"},{"key":"e_1_2_1_25_1","unstructured":"Louis-Noel Pouchet and Scott Grauer-Gray. 2013. PolyBench: The Polyhedral Benchmark Suite. (2013). On-line: http:\/\/www.cse.ohio-state.edu\/&sim;pouchet\/software\/polybench\/.  Louis-Noel Pouchet and Scott Grauer-Gray. 2013. PolyBench: The Polyhedral Benchmark Suite. (2013). On-line: http:\/\/www.cse.ohio-state.edu\/&sim;pouchet\/software\/polybench\/."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-10672-9_8"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPPW.2012.18"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS.2011.269"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/2400682.2400713"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/1735688.1735697"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2665079","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2665079","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T06:13:25Z","timestamp":1750227205000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2665079"}},"subtitle":["An Automatic Skeleton-Based C-to-CUDA Compiler for GPUs"],"short-title":[],"issued":{"date-parts":[[2014,12,8]]},"references-count":30,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2015,1,9]]}},"alternative-id":["10.1145\/2665079"],"URL":"https:\/\/doi.org\/10.1145\/2665079","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2014,12,8]]},"assertion":[{"value":"2013-11-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2014-08-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2014-12-08","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}