{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,4]],"date-time":"2025-12-04T09:55:28Z","timestamp":1764842128915,"version":"3.41.0"},"reference-count":68,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2018,8,28]],"date-time":"2018-08-28T00:00:00Z","timestamp":1535414400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Microsoft Azure for Research program"},{"name":"CCES CEPID\/FAPESP","award":["2014\/25694-8 and 2017\/21339-7"],"award-info":[{"award-number":["2014\/25694-8 and 2017\/21339-7"]}]},{"name":"AWS Cloud Credits for Research program"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2018,9,30]]},"abstract":"<jats:p>Computation offloading is a programming model in which program fragments (e.g., hot loops) are annotated so that their execution is performed in dedicated hardware or accelerator devices. Although offloading has been extensively used to move computation to GPUs, through directive-based annotation standards like OpenMP, offloading computation to very large computer clusters can become a complex and cumbersome task. It typically requires mixing programming models (e.g., OpenMP and MPI) and languages (e.g., C\/C++ and Scala), dealing with various access control mechanisms from different cloud providers (e.g., AWS and Azure), and integrating all this into a single application. This article introduces computer cluster nodes as simple OpenMP offloading devices that can be used either from a local computer or from the cluster head-node. It proposes a methodology that transforms OpenMP directives to Spark runtime calls with fully integrated communication management, in a way that a cluster appears to the programmer as yet another accelerator device. Experiments using LLVM 3.8, OpenMP 4.5 on well known cloud infrastructures (Microsoft Azure and Amazon EC2) show the viability of the proposed approach, enable a thorough analysis of its performance, and make a comparison with an MPI implementation. The results show that although data transfers can impose overheads, cloud offloading from a local machine can still achieve promising speedups for larger granularity: up to 115\u00d7 in 256 cores for the<jats:italic>2MM<\/jats:italic>benchmark using 1GB sparse matrices. In addition, the parallel implementation of a complex and relevant scientific application reveals a 80\u00d7 speedup on a 320 core machine when executed directly from the headnode of the cluster.<\/jats:p>","DOI":"10.1145\/3226112","type":"journal-article","created":{"date-parts":[[2018,8,30]],"date-time":"2018-08-30T13:45:11Z","timestamp":1535636711000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":12,"title":["Cluster Programming using the OpenMP Accelerator Model"],"prefix":"10.1145","volume":"15","author":[{"given":"Herv\u00e9","family":"Yviquel","sequence":"first","affiliation":[{"name":"Institute of Computing, University of Campinas (UNICAMP), S\u00e3o Paulo, Brazil"}]},{"given":"Lauro","family":"Cruz","sequence":"additional","affiliation":[{"name":"Institute of Computing, University of Campinas (UNICAMP), S\u00e3o Paulo, Brazil"}]},{"given":"Guido","family":"Araujo","sequence":"additional","affiliation":[{"name":"Institute of Computing, University of Campinas (UNICAMP), S\u00e3o Paulo, Brazil"}]}],"member":"320","published-online":{"date-parts":[[2018,8,28]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPP.2017.44"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/360827.360844"},{"volume-title":"Proceedings of the International Conference on Parallel Processing (ICPP\u201986)","year":"1986","author":"Cytron Ron","key":"e_1_2_1_3_1"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2005.13"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/989393.989437"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/71.97902"},{"volume-title":"Proceedings of 6th Symposium on Operating Systems Design and Implementation. 137--149","year":"2004","author":"Dean Jeffrey","key":"e_1_2_1_8_1"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1016\/0167-8191(96)00024-5"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/99.660313"},{"key":"e_1_2_1_11_1","unstructured":"OpenMP. 2013. OpenMP Application Program Interface. Technical report. OpenMP. 2013. OpenMP Application Program Interface. Technical report."},{"volume-title":"Above the clouds: A Berkeley view of cloud computing. Technical report","author":"Armbrust Michael","key":"e_1_2_1_12_1"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.is.2014.07.006"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/MSST.2010.5496972"},{"key":"e_1_2_1_15_1","first-page":"281","article-title":"Map-reduce for machine learning on multicore","volume":"19","author":"Chu Cheng","year":"2007","journal-title":"Adv. Neural Info. Process. Syst."},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1101\/gr.107524.110"},{"volume-title":"Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud\u201910)","year":"2010","author":"Zaharia Matei","key":"e_1_2_1_17_1"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/2038916.2038944"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-42553-5_26"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-31153-1_6"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btu343"},{"key":"e_1_2_1_22_1","unstructured":"OpenACC. 2013. The openacc application programming interface. Technical report. OpenACC. 2013. The openacc application programming interface. Technical report."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-65578-9_4"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.5555\/3019057.3019063"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-24595-9_3"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.5555\/3018869.3018870"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.5555\/977395.977673"},{"key":"e_1_2_1_28_1","doi-asserted-by":"crossref","unstructured":"L. Peter Deutsch. 1996. GZIP file format specification version 4.3. L. Peter Deutsch. 1996. GZIP file format specification version 4.3.","DOI":"10.17487\/rfc1952"},{"volume-title":"Zlib: A massively spiffy yet delicately unobtrusive compression library","year":"2017","author":"Mark Adler Gailly","key":"e_1_2_1_29_1"},{"volume-title":"Proceedings of the IEEE International Conference on Consumer Electronics (ICCE\u201917)","author":"Pescador F.","key":"e_1_2_1_30_1"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11227-016-1865-x"},{"key":"e_1_2_1_32_1","volume-title":"Mobile Information Systems","volume":"2016","author":"Liang Tyng Yeu","year":"2016"},{"key":"e_1_2_1_33_1","unstructured":"Blender: free and open source 3D creation suite. Retrieved 2018 from https:\/\/www.blender.org\/. Blender: free and open source 3D creation suite. Retrieved 2018 from https:\/\/www.blender.org\/."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-24369-6_7"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1016\/0167-8191(96)00024-5"},{"volume-title":"Extending openmp to clusters","author":"Hoeflinger Jay P.","key":"e_1_2_1_36_1"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.5555\/364048.364051"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.5555\/1788374.1788386"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1142\/S0129626412500107"},{"key":"e_1_2_1_40_1","unstructured":"Louis-Noel Pouchet. 2015. PolyBench: The polyhedral benchmark suite. http:\/\/web.cse.ohio-state.edu\/&sim;pouchet.2\/software\/polybench\/. Louis-Noel Pouchet. 2015. PolyBench: The polyhedral benchmark suite. http:\/\/web.cse.ohio-state.edu\/&sim;pouchet.2\/software\/polybench\/."},{"volume-title":"K\u00e9zia Andrade, and Gleison Souza.","year":"2015","author":"Quintao Pereira Fernando Magno","key":"e_1_2_1_41_1"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1021\/jp961623v"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1016\/0009-2614(96)00941-4"},{"key":"e_1_2_1_44_1","unstructured":"Leandro Zanotto Gabriel Heerdt Paulo C. T. Souza Guido Araujo and Munirv Skaf. High performance collision cross section calculation\u2014HPCCS. J. Comput. Chem. Leandro Zanotto Gabriel Heerdt Paulo C. T. Souza Guido Araujo and Munirv Skaf. High performance collision cross section calculation\u2014HPCCS. J. Comput. Chem."},{"key":"e_1_2_1_45_1","unstructured":"Apache Livy: A REST Service for Apache Spark. Retrieved 2017 from https:\/\/livy.incubator.apache.org\/. Apache Livy: A REST Service for Apache Spark. Retrieved 2017 from https:\/\/livy.incubator.apache.org\/."},{"key":"e_1_2_1_46_1","unstructured":"spark-jobserver: REST job server for Apache Spark. Retrieved 2018 from https:\/\/github.com\/spark-jobserver\/spark-jobserver. spark-jobserver: REST job server for Apache Spark. Retrieved 2018 from https:\/\/github.com\/spark-jobserver\/spark-jobserver."},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11036-012-0368-0"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/1814433.1814441"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/INFCOM.2012.6195845"},{"volume-title":"Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation. 93--106","year":"2012","author":"Gordon M. S.","key":"e_1_2_1_50_1"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/INFCOM.2013.6566921"},{"volume-title":"Proceedings of the 12th International Conference on Advanced Computing and Communication.","year":"2004","author":"Nadiminti Krishna","key":"e_1_2_1_52_1"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.5555\/1018421.1019750"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.advengsoft.2014.12.006"},{"volume-title":"Proceedings of the 3rd International Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering. 1--21","author":"Hajdukovi\u0107 Mroslav","key":"e_1_2_1_55_1"},{"key":"e_1_2_1_56_1","doi-asserted-by":"crossref","first-page":"100","DOI":"10.1007\/978-3-319-14896-0_9","article-title":"Parallelisation of the 3D fast fourier transform using the hybrid openmp\/MPI decomposition","volume":"8934","author":"Nikl Vojtech","year":"2014","journal-title":"Mathematical and Engineering Methods in Computer Science"},{"volume-title":"Ong","year":"2014","author":"Haynes Ronald D.","key":"e_1_2_1_57_1"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACCPD.2014.6"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/SBAC-PAD.2014.46"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1142\/S0129626411000151"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-45550-1_19"},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.5555\/2033345.2033405"},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1145\/2464996.2465017"},{"volume-title":"Gasnet: A portable high-performance communication layer for global address-space languages. Technical report.","year":"2002","author":"Bonachea Dan","key":"e_1_2_1_64_1"},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1109\/ASAP.2017.7995280"},{"key":"e_1_2_1_66_1","volume-title":"Proceedings of the International Workshop on Scaling OpenMP for Exascale Performance and Portability.","volume":"10468","author":"Grinberg Leopold","year":"2017"},{"key":"e_1_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-65578-9_2"},{"key":"e_1_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-65578-9_13"},{"volume-title":"Proceedings of the International Workshop on Scaling OpenMP for Exascale Performance and Portability. 325--337","author":"Hahnfeld Jonas","key":"e_1_2_1_69_1"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3226112","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3226112","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T21:15:08Z","timestamp":1750281308000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3226112"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,8,28]]},"references-count":68,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2018,9,30]]}},"alternative-id":["10.1145\/3226112"],"URL":"https:\/\/doi.org\/10.1145\/3226112","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2018,8,28]]},"assertion":[{"value":"2018-02-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-05-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2018-08-28","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}