{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,19]],"date-time":"2026-02-19T02:35:01Z","timestamp":1771468501493,"version":"3.50.1"},"reference-count":30,"publisher":"Springer Science and Business Media LLC","issue":"8","license":[{"start":{"date-parts":[[2017,1,2]],"date-time":"2017-01-02T00:00:00Z","timestamp":1483315200000},"content-version":"unspecified","delay-in-days":0,"URL":"http:\/\/www.springer.com\/tdm"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Computing"],"published-print":{"date-parts":[[2017,8]]},"DOI":"10.1007\/s00607-016-0537-2","type":"journal-article","created":{"date-parts":[[2017,1,2]],"date-time":"2017-01-02T13:50:11Z","timestamp":1483365011000},"page":"791-811","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":15,"title":["LU factorization on heterogeneous systems: an energy-efficient approach towards high performance"],"prefix":"10.1007","volume":"99","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0255-8182","authenticated-orcid":false,"given":"Cheng","family":"Chen","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jianbin","family":"Fang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Tao","family":"Tang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Canqun","family":"Yang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2017,1,2]]},"reference":[{"issue":"17","key":"537_CR1","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1109\/TSP.2015.2440219","volume":"63","author":"X Luciani","year":"2015","unstructured":"Luciani X, Albera L (2015) Joint eigenvalue decomposition of non-defective matrices based on the LU factorization with application to ICA. IEEE Trans Signal Process 63(17):1","journal-title":"IEEE Trans Signal Process"},{"key":"537_CR2","unstructured":"Petitet A, Whaley RC, Dongarra J, Cleary A (2004) HPL-a portable implementation of the high-performance linpack benchmark for distributed-memory computers. http:\/\/www.netlib.org\/benchmark\/hpl\/"},{"key":"537_CR3","unstructured":"http:\/\/www.top500.org"},{"issue":"5","key":"537_CR4","first-page":"223","volume":"45","author":"AM Castaldo","year":"2010","unstructured":"Castaldo AM, Clint Whaley R, Samuel S (2010) Scaling LAPACK panel operations using parallel cache assignment. ACM Trans Math Softw 45(5):223\u2013232","journal-title":"ACM Trans Math Softw"},{"key":"537_CR5","doi-asserted-by":"crossref","unstructured":"Xu W, Lu Y, Li Q, Zhou E, Song Z, Dong Y, Zhang W (2014) Hybrid hierarchy storage system in MilkyWay-2 supercomputer. Front Comput Sci 8(3):367\u2013377","DOI":"10.1007\/s11704-014-3499-6"},{"key":"537_CR6","unstructured":"Kogge P, Borkar S, Dan C, Carlson W, Dally W, Denneau M, Franzon P, Harrod W, Hiller J, Stephen K (2008) Exascale computing study: technology challenges in achieving exascale systems. DARPA Information Processing Techniques Office"},{"key":"537_CR7","doi-asserted-by":"crossref","unstructured":"Heinecke A, Vaidyanathan K, Smelyanskiy M, Kobotov A, Dubtsov R, Henry G, Shet AG, Chrysos G, Dubey P (2013) Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel Xeon Phi coprocessor. In: 2013 IEEE 27th international symposium on parallel and distributed processing (IPDPS), pp 126\u2013137","DOI":"10.1109\/IPDPS.2013.113"},{"key":"537_CR8","doi-asserted-by":"crossref","unstructured":"Fatica M (2009) Accelerating linpack with CUDA on heterogenous clusters. In: Proceedings of 2nd workshop on general purpose processing on graphics processing units, GPGPU-2, pp 46\u201351","DOI":"10.1145\/1513895.1513901"},{"key":"537_CR9","doi-asserted-by":"crossref","unstructured":"Endo T, Matsuoka S, Nukada A, Maruyama N (2010) Linpack evaluation on a supercomputer with heterogeneous accelerators. In: 2010 IEEE international symposium on parallel and distributed processing (IPDPS), pp 1\u20138","DOI":"10.1109\/IPDPS.2010.5470353"},{"key":"537_CR10","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1109\/TPDS.2014.2367831","volume":"26","author":"Gangwon Jo","year":"2015","unstructured":"Jo Gangwon, Nah Jeongho, Lee Jun, Kim Jungwon, Lee Jaejin (2015) Accelerating LINPACK with MPI-OpenCL on clusters of multi-GPU nodes. IEEE Trans Parallel Distrib Syst 26:1","journal-title":"IEEE Trans Parallel Distrib Syst"},{"issue":"5","key":"537_CR11","doi-asserted-by":"crossref","first-page":"854","DOI":"10.1007\/s11390-011-0184-1","volume":"26","author":"F Wang","year":"2011","unstructured":"Wang F, Yang CQ, Du YF, Chen J, Yi HZ, Xu WX (2011) Optimizing linpack benchmark on GPU-accelerated petascale supercomputer. J Comput Sci Technol 26(5):854\u2013865","journal-title":"J Comput Sci Technol"},{"issue":"24","key":"537_CR12","doi-asserted-by":"crossref","first-page":"1613","DOI":"10.1109\/TPDS.2012.242","volume":"24","author":"J Kurzak","year":"2013","unstructured":"Kurzak J, Luszczek P, Faverge M, Dongarra J (2013) LU factorization with partial pivoting for a multicore system with accelerators. IEEE Trans Parallel Distrib Syst 24(24):1613\u20131621","journal-title":"IEEE Trans Parallel Distrib Syst"},{"issue":"3\u20134","key":"537_CR13","doi-asserted-by":"crossref","first-page":"211","DOI":"10.1007\/s00450-011-0169-x","volume":"26","author":"M Deisher","year":"2011","unstructured":"Deisher M, Smelyanskiy M, Nickerson B, Lee VW, Chuvelev M, Dubey P (2011) Designing and dynamically load balancing hybrid LU for multi\/many-core. Comput Sci Res Dev 26(3\u20134):211\u2013220","journal-title":"Comput Sci Res Dev"},{"key":"537_CR14","unstructured":"Chen X, Chang LW, Rodrigues CI, Lv J, Wang Z, Hwu WM (2015) Adaptive cache management for energy-efficient GPU computing. In: Proceedings of the 47th annual IEEE\/ACM international symposium on microarchitecture, pp 343\u2013355"},{"key":"537_CR15","doi-asserted-by":"crossref","unstructured":"Dongarra JJ, Duff LS, Sorensen DC, Vander Vorst HA (1998) Numerical linear algebra for high-performance computers. Society for Industrial and Applied Mathematics, Siam","DOI":"10.1137\/1.9780898719611"},{"issue":"6","key":"537_CR16","doi-asserted-by":"crossref","first-page":"737","DOI":"10.1147\/rd.416.0737","volume":"41","author":"FG Gustavson","year":"1997","unstructured":"Gustavson FG (1997) Recursion leads to automatic variable blocking for dense liner algebra algorithms. IBM J Res Dev 41(6):737\u2013755","journal-title":"IBM J Res Dev"},{"issue":"1","key":"537_CR17","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1002\/cpe.4330020102","volume":"2","author":"EF Velde Van De","year":"1990","unstructured":"Van De Velde EF (1990) Experiments with multicomputer LU-decomposition. Concurr Pract Exper 2(1):1\u20136","journal-title":"Concurr Pract Exper"},{"key":"537_CR18","doi-asserted-by":"crossref","unstructured":"Fox GC, Johnson MA, Lyzenga GA, Otto SW, Salmon JK, Walker DW (1988) Solving problems on concurrent processors. Vol. 1: general techniques and regular problems, Prentice Hall, Old Tappan","DOI":"10.1063\/1.4822815"},{"key":"537_CR19","unstructured":"Hipes PG, Kuppermann A (1989) Gauss\u2013Jordan inversion with pivoting on the caltech mark ii hypercube. In: Hypercube concurrent computers and applications, pp 1621\u20131634"},{"issue":"3","key":"537_CR20","doi-asserted-by":"crossref","first-page":"153","DOI":"10.1007\/s00450-011-0161-5","volume":"26","author":"M Bach","year":"2011","unstructured":"Bach M, Kretz M, Lindenstruth V, Rohr D (2011) Optimized HPL for AMD GPU and multi-core CPU usage. Comput Sci Res Dev 26(3):153\u2013164","journal-title":"Comput Sci Res Dev"},{"key":"537_CR21","unstructured":"Michael K, Gunnels J, Brokenshire D, Benton B (2009) Petascale computing with accelerators. In: Proceedings of the 14th ACM SIGPLAN symposium on principles and practice of parallel programming, PPoPP\u201909, pp 241\u2013250"},{"key":"537_CR22","unstructured":"Dongarra J, Gates M, Haidar A, Jia Y, Kabir K, Luszczek P, Tomov S (2013) Portable HPC programming on intel many-integrated-core hardware with MAGMA Port to Xeon Phi. In: International conference on parallel processing and applied mathematics, Springer, pp 571\u2013581"},{"key":"537_CR23","doi-asserted-by":"crossref","unstructured":"Beckingsale D, Gaudin W, Herdman A, Jarvis S (2015) Resident block-structured adaptive mesh refinement on thousands of graphics processing units. In: 2015 44th international conference on parallel processing (ICPP), pp 61\u201370","DOI":"10.1109\/ICPP.2015.15"},{"key":"537_CR24","doi-asserted-by":"crossref","unstructured":"Tan L, Kothapalli S, Chen L, Hussaini O, Bissiri R, Chen Z (2014) A survey of power and energy efficient techniques for high performance numerical linear algebra operations. In: Parallel Comput, December 2014","DOI":"10.1016\/j.parco.2014.09.001"},{"key":"537_CR25","doi-asserted-by":"crossref","unstructured":"Haidar A, Dong T, Luszczek P, Tomov S, Dongarra J (2015) Optimization for performance and energy for batched matrix computations on GPUs. In: Proceedings of the 8th workshop on general purpose processing uGPUs, GPGPU-8, pp 59\u201369","DOI":"10.1145\/2716282.2716288"},{"key":"537_CR26","doi-asserted-by":"crossref","unstructured":"Haidar A, Dong T, Tomov S, Luszczek P, Dongarra J (2015) Framework for batched and gpu-resident factorization algorithms to block householder transformations. In: ISC high performance, pp 07\u201325","DOI":"10.1007\/978-3-319-20119-1_3"},{"key":"537_CR27","doi-asserted-by":"crossref","unstructured":"Liu C, Li J, Huang W, Rubio J, Speight E, Lin X (2012) Power-efficient time-sensitive mapping in heterogeneous systems. In: Proceedings of the 21st international conference on parallel architectures and compilation techniques, PACT \u201912, pp 23\u201332","DOI":"10.1145\/2370816.2370822"},{"key":"537_CR28","doi-asserted-by":"crossref","unstructured":"Hong S, Kim H (2010) An integrated gpu power and performance model. In: Proceedings of the 37th annual international symposium on computer architecture, ISCA \u201910, pp 280\u2013289","DOI":"10.1145\/1815961.1815998"},{"key":"537_CR29","doi-asserted-by":"crossref","unstructured":"Alonso P, Dolz MF, Igual FD, Mayo R, Quintana-Ort ES (2012) Reducing energy consumption of dense linear algebra operations on hybrid CPU\u2013GPU platforms. In: 2012 IEEE 10th international symposium on parallel and distributed processing with applications, pp 56\u201362","DOI":"10.1109\/ISPA.2012.16"},{"key":"537_CR30","unstructured":"Intel Math Kernel Library (Intel MKL)"}],"container-title":["Computing"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/link.springer.com\/article\/10.1007\/s00607-016-0537-2\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1007\/s00607-016-0537-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1007\/s00607-016-0537-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2019,9,16]],"date-time":"2019-09-16T23:26:42Z","timestamp":1568676402000},"score":1,"resource":{"primary":{"URL":"http:\/\/link.springer.com\/10.1007\/s00607-016-0537-2"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2017,1,2]]},"references-count":30,"journal-issue":{"issue":"8","published-print":{"date-parts":[[2017,8]]}},"alternative-id":["537"],"URL":"https:\/\/doi.org\/10.1007\/s00607-016-0537-2","relation":{},"ISSN":["0010-485X","1436-5057"],"issn-type":[{"value":"0010-485X","type":"print"},{"value":"1436-5057","type":"electronic"}],"subject":[],"published":{"date-parts":[[2017,1,2]]}}}