{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,21]],"date-time":"2025-02-21T00:50:30Z","timestamp":1740099030190,"version":"3.37.3"},"publisher-location":"Cham","reference-count":33,"publisher":"Springer International Publishing","isbn-type":[{"type":"print","value":"9783319748955"},{"type":"electronic","value":"9783319748962"}],"license":[{"start":{"date-parts":[[2018,1,1]],"date-time":"2018-01-01T00:00:00Z","timestamp":1514764800000},"content-version":"unspecified","delay-in-days":0,"URL":"http:\/\/www.springer.com\/tdm"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2018]]},"DOI":"10.1007\/978-3-319-74896-2_9","type":"book-chapter","created":{"date-parts":[[2018,1,30]],"date-time":"2018-01-30T05:22:32Z","timestamp":1517289752000},"page":"160-182","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":7,"title":["Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices"],"prefix":"10.1007","author":[{"given":"Jonas","family":"Hahnfeld","sequence":"first","affiliation":[]},{"given":"Christian","family":"Terboven","sequence":"additional","affiliation":[]},{"given":"James","family":"Price","sequence":"additional","affiliation":[]},{"given":"Hans Joachim","family":"Pflug","sequence":"additional","affiliation":[]},{"given":"Matthias S.","family":"M\u00fcller","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2018,1,31]]},"reference":[{"unstructured":"Vulkan - Industry Forged. https:\/\/www.khronos.org\/vulkan\/ . Accessed 6 July 2017","key":"9_CR1"},{"doi-asserted-by":"crossref","unstructured":"Abraham, M.J., Murtola, T., Schulz, R., Pll, S., Smith, J.C., Hess, B., Lindahl, E.: GROMACS: high performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 12, 19\u201325 (2015). http:\/\/www.sciencedirect.com\/science\/article\/pii\/S2352711015000059","key":"9_CR2","DOI":"10.1016\/j.softx.2015.06.001"},{"doi-asserted-by":"crossref","unstructured":"Aji, A.M., Dinan, J., Buntinas, D., Balaji, P., Feng, W.-C., Bisset, K.R., Thakur, R.: MPI-ACC: an integrated and extensible approach to data movement in accelerator-based systems. In: 2012 IEEE 14th International Conference on High Performance Computing and Communication 2012 IEEE 9th International Conference on Embedded Software and Systems, pp. 647\u2013654, June 2012","key":"9_CR3","DOI":"10.1109\/HPCC.2012.92"},{"doi-asserted-by":"crossref","unstructured":"Allada, V., Benjegerdes, T., Bode, B.: Performance analysis of memory transfers and GEMM subroutines on NVIDIA Tesla GPU cluster. In: 2009 IEEE International Conference on Cluster Computing and Workshops, pp. 1\u20139, August 2009","key":"9_CR4","DOI":"10.1109\/CLUSTR.2009.5289124"},{"doi-asserted-by":"crossref","unstructured":"Augonnet, C., Clet-Ortega, J., Thibault, S., Namyst, R.: Data-aware task scheduling on multi-accelerator based platforms. In: 2010 IEEE 16th International Conference on Parallel and Distributed Systems, pp. 291\u2013298 (Dec 2010)","key":"9_CR5","DOI":"10.1109\/ICPADS.2010.129"},{"doi-asserted-by":"crossref","unstructured":"Beri, T., Bansal, S., Kumar, S.: A scheduling and runtime framework for a cluster of heterogeneous machines with multiple accelerators. In: 2015 IEEE International Parallel and Distributed Processing Symposium, pp. 146\u2013155, May 2015","key":"9_CR6","DOI":"10.1109\/IPDPS.2015.12"},{"doi-asserted-by":"crossref","unstructured":"Bernaschi, M., Salvadore, F.: Multi-Kepler GPU vs. Multi-Intel MIC: a two test case performance study. In: 2014 International Conference on High Performance Computing Simulation (HPCS), pp. 1\u20138, July 2014","key":"9_CR7","DOI":"10.1109\/HPCSim.2014.6903662"},{"doi-asserted-by":"crossref","unstructured":"Boku, T., Ishikawa, K.I., Kuramashi, Y., Meadows, L., D\u2018Mello, M., Troute, M., Vemuri, R.: A performance evaluation of CCS QCD benchmark on the COMA (Intel(R) Xeon Phi, KNC) system (2016)","key":"9_CR8","DOI":"10.22323\/1.256.0261"},{"unstructured":"Davis, T.: The SuiteSparse Matrix Collection (formerly known as the University of Florida Sparse Matrix Collection). https:\/\/www.cise.ufl.edu\/research\/sparse\/matrices\/ . Accessed 30 May 2017","key":"9_CR9"},{"key":"9_CR10","series-title":"Lecture Notes in Computer Science","doi-asserted-by":"publisher","first-page":"489","DOI":"10.1007\/978-3-319-46079-6_34","volume-title":"High Performance Computing","author":"T Deakin","year":"2016","unstructured":"Deakin, T., Price, J., Martineau, M., McIntosh-Smith, S.: GPU-STREAM v2.0: benchmarking the achievable memory bandwidth of many-core processors across diverse parallel programming models. In: Taufer, M., Mohr, B., Kunkel, J.M. (eds.) ISC High Performance 2016. LNCS, vol. 9945, pp. 489\u2013507. Springer, Cham (2016). https:\/\/doi.org\/10.1007\/978-3-319-46079-6_34"},{"unstructured":"Hahnfeld, J.: CGxx - Object-Oriented Implementation of the Conjugate Gradients Method, August 2017. https:\/\/github.com\/hahnjo\/CGxx","key":"9_CR11"},{"doi-asserted-by":"crossref","unstructured":"Hahnfeld, J.: Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices, July 2017, Bachelor thesis","key":"9_CR12","DOI":"10.1007\/978-3-319-74896-2_9"},{"doi-asserted-by":"crossref","unstructured":"Hahnfeld, J., Cramer, T., Klemm, M., Terboven, C., M\u00fcller, M.S.: A Pattern for Overlapping Communication and Computation with OpenMP Target Directives (2017)","key":"9_CR13","DOI":"10.1007\/978-3-319-65578-9_22"},{"unstructured":"Hahnfeld, J., Terboven, C., Price, J., Pflug, H.J., M\u00fcller, M.: Measurement data for paper \u201cEvaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices\u201d (2017). http:\/\/dx.doi.org\/10.18154\/RWTH-2017-10493","key":"9_CR14"},{"issue":"6","key":"9_CR15","doi-asserted-by":"crossref","first-page":"409","DOI":"10.6028\/jres.049.044","volume":"49","author":"MR Hestenes","year":"1952","unstructured":"Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bur. Stan. 49(6), 409\u2013436 (1952)","journal-title":"J. Res. Natl. Bur. Stan."},{"doi-asserted-by":"crossref","unstructured":"Hoshino, T., Maruyama, N., Matsuoka, S., Takaki, R.: CUDA vs OpenACC: performance case studies with kernel benchmarks and a memory-bound CFD application. In: 2013 13th IEEE\/ACM International Symposium on Cluster, Cloud, and Grid Computing, pp. 136\u2013143, May 2013","key":"9_CR16","DOI":"10.1109\/CCGrid.2013.12"},{"doi-asserted-by":"publisher","unstructured":"J\u00e4\u00e4skel\u00e4inen, P., de La Lama, C.S., Schnetter, E., Raiskila, K., Takala, J., Berg, H.: pocl: A performance-portable OpenCL Implementation. Int. J. Parallel Program. 43(5), 752\u2013785 (2015). https:\/\/doi.org\/10.1007\/s10766-014-0320-y","key":"9_CR17","DOI":"10.1007\/s10766-014-0320-y"},{"issue":"7","key":"9_CR18","doi-asserted-by":"crossref","first-page":"1814","DOI":"10.1109\/TPDS.2014.2321742","volume":"26","author":"G Jo","year":"2015","unstructured":"Jo, G., Nah, J., Lee, J., Kim, J., Lee, J.: Accelerating LINPACK with MPI-OpenCL on clusters of Multi-GPU nodes. IEEE Trans. Parallel Distrib. Syst. 26(7), 1814\u20131825 (2015)","journal-title":"IEEE Trans. Parallel Distrib. Syst."},{"doi-asserted-by":"publisher","unstructured":"Krieder, S.J., Wozniak, J.M., Armstrong, T., Wilde, M., Katz, D.S., Grimmer, B., Foster, I.T., Raicu, I.: Design and evaluation of the GeMTC framework for GPU-enabled many-task computing. In: Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, HPDC 2014, pp. 153\u2013164. ACM, New York (2014). https:\/\/doi.org\/10.1145\/2600212.2600228","key":"9_CR19","DOI":"10.1145\/2600212.2600228"},{"doi-asserted-by":"crossref","unstructured":"Lawlor, O.S.: Message passing for GPGPU clusters: CudaMPI. In: 2009 IEEE International Conference on Cluster Computing and Workshops, pp. 1\u20138, August 2009","key":"9_CR20","DOI":"10.1109\/CLUSTR.2009.5289129"},{"unstructured":"McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, pp. 19\u201325, December 1995","key":"9_CR21"},{"doi-asserted-by":"publisher","unstructured":"Meng, Q., Humphrey, A., Schmidt, J., Berzins, M.: Preliminary experiences with the Uintah framework on Intel Xeon Phi and Stampede. In: Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery, XSEDE 2013, pp. 48:1\u201348:8. ACM, New York (2013). https:\/\/doi.org\/10.1145\/2484762.2484779","key":"9_CR22","DOI":"10.1145\/2484762.2484779"},{"doi-asserted-by":"publisher","unstructured":"Mu, D., Chen, P., Wang, L.: Accelerating the discontinuous Galerkin method for seismic wave propagation simulations using multiple GPUs with CUDA and MPI. Earthquake Sci. 26(6), 377\u2013393 (2013). https:\/\/doi.org\/10.1007\/s11589-013-0047-7","key":"9_CR23","DOI":"10.1007\/s11589-013-0047-7"},{"doi-asserted-by":"publisher","unstructured":"Quintana-Ort\u00ed, G., Igual, F.D., Quintana-Ort\u00ed, E.S., van de Geijn, R.A.: Solving dense linear systems on platforms with multiple hardware accelerators. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2009, pp. 121\u2013130. ACM, New York (2009). https:\/\/doi.org\/10.1145\/1504176.1504196","key":"9_CR24","DOI":"10.1145\/1504176.1504196"},{"doi-asserted-by":"crossref","unstructured":"Stuart, J.A., Owens, J.D.: Message passing on data-parallel architectures. In: 2009 IEEE International Symposium on Parallel Distributed Processing, pp. 1\u201312, May 2009","key":"9_CR25","DOI":"10.1109\/IPDPS.2009.5161065"},{"doi-asserted-by":"publisher","unstructured":"Stuart, J.A., Balaji, P., Owens, J.D.: Extending MPI to accelerators. In: Proceedings of the 1st Workshop on Architectures and Systems for Big Data, ASBD 2011, pp. 19\u201323. ACM, New York (2011). https:\/\/doi.org\/10.1145\/2377978.2377981","key":"9_CR26","DOI":"10.1145\/2377978.2377981"},{"unstructured":"V\u00e1zquez, F., Garz\u00f3n, E.M.: The sparse matrix vector product on GPUs (2009)","key":"9_CR27"},{"doi-asserted-by":"crossref","unstructured":"Vinogradov, S., Fedorova, J., Curran, D., Cownie, J.: OpenMP 4.0 vs. OpenCL: performance comparison. In: OpenMPCon 2015, October 2015","key":"9_CR28","DOI":"10.1016\/B978-0-12-803819-2.00005-7"},{"key":"9_CR29","series-title":"Lecture Notes in Computer Science","doi-asserted-by":"publisher","first-page":"330","DOI":"10.1007\/978-3-642-38750-0_25","volume-title":"Supercomputing","author":"S Wienke","year":"2013","unstructured":"Wienke, S., an Mey, D., M\u00fcller, M.S.: Accelerators for technical computing: is it worth the pain? A TCO perspective. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds.) ISC 2013. LNCS, vol. 7905, pp. 330\u2013342. Springer, Heidelberg (2013). https:\/\/doi.org\/10.1007\/978-3-642-38750-0_25"},{"key":"9_CR30","series-title":"Lecture Notes in Computer Science","doi-asserted-by":"publisher","first-page":"812","DOI":"10.1007\/978-3-319-09873-9_68","volume-title":"Euro-Par 2014 Parallel Processing","author":"S Wienke","year":"2014","unstructured":"Wienke, S., Terboven, C., Beyer, J.C., M\u00fcller, M.S.: A pattern-based comparison of OpenACC and OpenMP for accelerator computing. In: Silva, F., Dutra, I., Santos Costa, V. (eds.) Euro-Par 2014. LNCS, vol. 8632, pp. 812\u2013823. Springer, Cham (2014). https:\/\/doi.org\/10.1007\/978-3-319-09873-9_68"},{"doi-asserted-by":"publisher","unstructured":"Wozniak, J.M., Armstrong, T.G., Wilde, M., Katz, D.S., Lusk, E., Foster, I.T.: Swift\/T: scalable data flow programming for many-task applications. In: Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2013, pp. 309\u2013310. ACM, New York (2013). https:\/\/doi.org\/10.1145\/2442516.2442559","key":"9_CR31","DOI":"10.1145\/2442516.2442559"},{"unstructured":"Yamazaki, I., Tomov, S., Dongarra, J.: One-sided dense matrix factorizations on a multicore with multiple GPU accelerators. Procedia Comput. Sci. 9, 37\u201346 (2012). http:\/\/www.sciencedirect.com\/science\/article\/pii\/S1877050912001263 . Proceedings of the International Conference on Computational Science, ICCS 2012","key":"9_CR32"},{"doi-asserted-by":"publisher","unstructured":"Yan, Y., Lin, P.H., Liao, C., de Supinski, B.R., Quinlan, D.J.: Supporting multiple accelerators in high-level programming models. In: Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM 2015, pp. 170\u2013180. ACM, New York (2015). https:\/\/doi.org\/10.1145\/2712386.2712405","key":"9_CR33","DOI":"10.1145\/2712386.2712405"}],"container-title":["Lecture Notes in Computer Science","Accelerator Programming Using Directives"],"original-title":[],"link":[{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1007\/978-3-319-74896-2_9","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2019,10,9]],"date-time":"2019-10-09T19:37:54Z","timestamp":1570649874000},"score":1,"resource":{"primary":{"URL":"http:\/\/link.springer.com\/10.1007\/978-3-319-74896-2_9"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018]]},"ISBN":["9783319748955","9783319748962"],"references-count":33,"URL":"https:\/\/doi.org\/10.1007\/978-3-319-74896-2_9","relation":{},"ISSN":["0302-9743","1611-3349"],"issn-type":[{"type":"print","value":"0302-9743"},{"type":"electronic","value":"1611-3349"}],"subject":[],"published":{"date-parts":[[2018]]}}}