{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,29]],"date-time":"2025-09-29T08:15:30Z","timestamp":1759133730433,"version":"3.41.0"},"reference-count":25,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2009,3,1]],"date-time":"2009-03-01T00:00:00Z","timestamp":1235865600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100004359","name":"Vetenskapsr\u00e4det","doi-asserted-by":"publisher","award":["VR 621-2001-3284"],"award-info":[{"award-number":["VR 621-2001-3284"]}],"id":[{"id":"10.13039\/501100004359","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001729","name":"Swedish Foundation for Strategic Research","doi-asserted-by":"publisher","award":["A3 02:128"],"award-info":[{"award-number":["A3 02:128"]}],"id":[{"id":"10.13039\/501100001729","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Math. Softw."],"published-print":{"date-parts":[[2009,3]]},"abstract":"<jats:p>The minimal block storage Distributed Square Block Packed (DSBP) format for distributed memory computing on symmetric and triangular matrices is presented. Three algorithm variants (Basic, Static, and Dynamic) of the blocked right-looking Cholesky factorization are designed for the DSBP format, implemented, and evaluated. On our target machine, all variants outperform standard full-storage implementations while saving almost half the storage. Communication overhead is shown to be virtually eliminated by the Static and Dynamic variants, both of which take advantage of hardware parallelism to hide communication costs. The Basic variant is shown to yield comparable or slightly better performance than the full-storage ScaLAPACK routine PDPOTRF while clearly outperformed by both Static and Dynamic. Models of execution assuming zero communication costs and overhead are developed. For medium- and larger-sized problems, the Static schedule is near optimal on our target machine based on comparisons with these models and measurements of synchronization overhead.<\/jats:p>","DOI":"10.1145\/1499096.1499100","type":"journal-article","created":{"date-parts":[[2009,4,6]],"date-time":"2009-04-06T16:34:22Z","timestamp":1239035662000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":12,"title":["Distributed SBP Cholesky factorization algorithms with near-optimal scheduling"],"prefix":"10.1145","volume":"36","author":[{"given":"Fred G.","family":"Gustavson","sequence":"first","affiliation":[{"name":"IBM T.J. Watson Research Center, Yorktown Heights, NY and Ume\u00e5 University, Ume\u00e5, Sweden"}]},{"given":"Lars","family":"Karlsson","sequence":"additional","affiliation":[{"name":"Ume\u00e5 University, Ume\u00e5, Sweden"}]},{"given":"Bo","family":"K\u00e5gstr\u00f6m","sequence":"additional","affiliation":[{"name":"Ume\u00e5 University, Ume\u00e5, Sweden"}]}],"member":"320","published-online":{"date-parts":[[2009,4,7]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"Agarwal R. C. and Gustavson F. G. 1988. A parallel implementation of matrix multiplication and LU factorization on the IBM 3090. In Aspects of Computation on Asynchronous and Parallel Processors M. Wright Ed. IFIP North-Holland Amsterdam The Netherlands 217--221.  Agarwal R. C. and Gustavson F. G. 1988. A parallel implementation of matrix multiplication and LU factorization on the IBM 3090. In Aspects of Computation on Asynchronous and Parallel Processors M. Wright Ed. IFIP North-Holland Amsterdam The Netherlands 217--221."},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1145\/76263.76287"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1147\/rd.386.0673"},{"key":"e_1_2_1_4_1","doi-asserted-by":"crossref","unstructured":"Baboulin M. Giraud L. and Gratton S. 2005a. A parallel distributed solver for large dense symmetric systems: Applications to geodesy and electromagnetism problems. Int. J. High Perform. Comput. Appl.\/19 353--363.  Baboulin M. Giraud L. and Gratton S. 2005a. A parallel distributed solver for large dense symmetric systems: Applications to geodesy and electromagnetism problems. Int. J. High Perform. Comput. Appl.\/19 353--363.","DOI":"10.1177\/1094342005056134"},{"key":"e_1_2_1_5_1","unstructured":"Baboulin M. Giraud L. Gratton S. and Langou J. 2005b. A distributed packed storage for large parallel calculations. Tech. rep. TR\/PA\/05\/30. CERFACS Toulouse France.  Baboulin M. Giraud L. Gratton S. and Langou J. 2005b. A distributed packed storage for large parallel calculations. Tech. rep. TR\/PA\/05\/30. CERFACS Toulouse France."},{"key":"e_1_2_1_6_1","unstructured":"Brent R. P. and Luk F. T. 1982. Computing the Cholesky factorization using a systolic architecture. Tech. rep. TR 82--H521. Department of Computer Science Cornell University Cornell University Ithaca NY.   Brent R. P. and Luk F. T. 1982. Computing the Cholesky factorization using a systolic architecture. Tech. rep. TR 82--H521. Department of Computer Science Cornell University Cornell University Ithaca NY."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/1006209.1006251"},{"key":"e_1_2_1_8_1","unstructured":"Buttari A. Langou J. Kurzak J. and Dongarra J. 2007. A class of parallel tiled linear algebra algorithms for multicore architectures. Tech. rep. UT-CS-07-600. University of Tennessee Knoxville TN.  Buttari A. Langou J. Kurzak J. and Dongarra J. 2007. A class of parallel tiled linear algebra algorithms for multicore architectures. Tech. rep. UT-CS-07-600. University of Tennessee Knoxville TN."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/1248377.1248397"},{"key":"e_1_2_1_10_1","first-page":"3","article-title":"Design and implementation of the ScaLAPACK LU, QR, and Cholesky factorization routines. Sci","volume":"5","author":"Choi J.","year":"1996","unstructured":"Choi , J. , Dongarra , J. J. , Ostrouchov , S. , Petitet , A. P. , Walker , D. W. , and Whaley , R. C. 1996 . Design and implementation of the ScaLAPACK LU, QR, and Cholesky factorization routines. Sci . Program. 5 , 3 (Fall), 173--184. Choi, J., Dongarra, J. J., Ostrouchov, S., Petitet, A. P., Walker, D. W., and Whaley, R. C. 1996. Design and implementation of the ScaLAPACK LU, QR, and Cholesky factorization routines. Sci. Program. 5, 3 (Fall), 173--184.","journal-title":"Program."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1177\/109434209200600105"},{"volume-title":"SIAM Conference on Parallel Processing for Scientific Computing, R. S. et al, Ed. SIAM Publications","author":"Dackland K.","key":"e_1_2_1_12_1","unstructured":"Dackland , K. , Elmroth , E. , and K\u00e5gstr\u00f6m , B . 1993. A ring--oriented approach for block matrix factorizations on shared and distributed memory architectures . In SIAM Conference on Parallel Processing for Scientific Computing, R. S. et al, Ed. SIAM Publications , Philadelphia, PA, 330--338. Dackland, K., Elmroth, E., and K\u00e5gstr\u00f6m, B. 1993. A ring--oriented approach for block matrix factorizations on shared and distributed memory architectures. In SIAM Conference on Parallel Processing for Scientific Computing, R. S. et al, Ed. SIAM Publications, Philadelphia, PA, 330--338."},{"key":"e_1_2_1_13_1","doi-asserted-by":"crossref","unstructured":"D'Azevedo E. and Dongarra J. 1998. Packed storage extension for ScaLAPACK. Tech. rep. UT-CS-98-385. University of Tennessee Knoxville TN.   D'Azevedo E. and Dongarra J. 1998. Packed storage extension for ScaLAPACK. Tech. rep. UT-CS-98-385. University of Tennessee Knoxville TN.","DOI":"10.2172\/754353"},{"volume-title":"Numerical Linear Algebra on High-Performance Computers","author":"Dongarra J. J.","key":"e_1_2_1_14_1","unstructured":"Dongarra , J. J. , Duff , I. S. , Sorenson , D. C. , and van der Vorst , H. A. 1998. Numerical Linear Algebra on High-Performance Computers . SIAM , Philadelphia, PA . Dongarra, J. J., Duff, I. S., Sorenson, D. C., and van der Vorst, H. A. 1998. Numerical Linear Algebra on High-Performance Computers. SIAM, Philadelphia, PA."},{"key":"e_1_2_1_15_1","unstructured":"Geist G. A. and Heath M. T. 1985. Parallel Cholesky factorization on a hypercube multiprocessor. Tech. rep. ORNL--6190. Oak Ridge National Lab. Oak Ridge TN.  Geist G. A. and Heath M. T. 1985. Parallel Cholesky factorization on a hypercube multiprocessor. Tech. rep. ORNL--6190. Oak Ridge National Lab. Oak Ridge TN."},{"volume-title":"Proceedings of the fourth SIAM Conference on Parallel Processing for Scientific Computing. 68--95","author":"Gerasoulis A.","key":"e_1_2_1_16_1","unstructured":"Gerasoulis , A. and Nelken , I . 1989. Scheduling linear algebra parallel algorithms on MIMD architectures . In Proceedings of the fourth SIAM Conference on Parallel Processing for Scientific Computing. 68--95 . Gerasoulis, A. and Nelken, I. 1989. Scheduling linear algebra parallel algorithms on MIMD architectures. In Proceedings of the fourth SIAM Conference on Parallel Processing for Scientific Computing. 68--95."},{"key":"e_1_2_1_17_1","volume-title":"Matrix Computations","author":"Golub G. H.","unstructured":"Golub , G. H. and van Loan , C. F. 1996. Matrix Computations , 3 rd ed. Johns Hopkins University Press , Baltimore, MD . Golub, G. H. and van Loan, C. F. 1996. Matrix Computations, 3rd ed. Johns Hopkins University Press, Baltimore, MD.","edition":"3"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/1356052.1356053"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/88.242438"},{"key":"e_1_2_1_20_1","volume-title":"PARA 2006: State of the Art in Scientific and Parallel Computing, B. K\u00e5gstr\u00f6m et al., Eds. Lecture Notes in Computer Science","volume":"4699","author":"Gustavson F. G.","unstructured":"Gustavson , F. G. , Gunnels , J. A. , and Sexton , J. C . 2007a. Minimal data copy for dense linear algebra factorization . In PARA 2006: State of the Art in Scientific and Parallel Computing, B. K\u00e5gstr\u00f6m et al., Eds. Lecture Notes in Computer Science , vol. 4699 . Springer, Berlin, Germany, 540--549. Gustavson, F. G., Gunnels, J. A., and Sexton, J. C. 2007a. Minimal data copy for dense linear algebra factorization. In PARA 2006: State of the Art in Scientific and Parallel Computing, B. K\u00e5gstr\u00f6m et al., Eds. Lecture Notes in Computer Science, vol. 4699. Springer, Berlin, Germany, 540--549."},{"volume-title":"PARA 2006: State of the Art in Scientific and Parallel Computing, B. K\u00e5gstr\u00f6m et al., Eds. Lecture Notes in Computer Science, vol 4699","author":"Gustavson F. G.","key":"e_1_2_1_21_1","unstructured":"Gustavson , F. G. , Karlsson , L. , and K\u00e5gstr\u00f6m , B . 2007b. Three algorithms for Cholesky factorization on distributed memory using packed storage . In PARA 2006: State of the Art in Scientific and Parallel Computing, B. K\u00e5gstr\u00f6m et al., Eds. Lecture Notes in Computer Science, vol 4699 . Springer. Berlin, Germany, 550--559. Also IBM Tech. rep. RC24137. Gustavson, F. G., Karlsson, L., and K\u00e5gstr\u00f6m, B. 2007b. Three algorithms for Cholesky factorization on distributed memory using packed storage. In PARA 2006: State of the Art in Scientific and Parallel Computing, B. K\u00e5gstr\u00f6m et al., Eds. Lecture Notes in Computer Science, vol 4699. Springer. Berlin, Germany, 550--559. Also IBM Tech. rep. RC24137."},{"key":"e_1_2_1_22_1","volume-title":"MPI: A message passing interface standard","author":"Forum","year":"1995","unstructured":"MPI Forum . 1995 . MPI: A message passing interface standard . http:\/\/www.mpi-forum.org\/. MPI Forum. 1995. MPI: A message passing interface standard. http:\/\/www.mpi-forum.org\/."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/4021.4025"},{"key":"e_1_2_1_24_1","unstructured":"Strazdins P. 1998. A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. Tech. rep. TR-CS-98-07. Canberra 0200 ACT Australia.  Strazdins P. 1998. A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. Tech. rep. TR-CS-98-07. Canberra 0200 ACT Australia."},{"volume-title":"Using PLAPACK","author":"van de Geijn R. A.","key":"e_1_2_1_25_1","unstructured":"van de Geijn , R. A. 1997. Using PLAPACK . MIT Press , Cambridge, MA . van de Geijn, R. A. 1997. Using PLAPACK. MIT Press, Cambridge, MA."}],"container-title":["ACM Transactions on Mathematical Software"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1499096.1499100","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/1499096.1499100","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T13:38:39Z","timestamp":1750253919000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/1499096.1499100"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2009,3]]},"references-count":25,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2009,3]]}},"alternative-id":["10.1145\/1499096.1499100"],"URL":"https:\/\/doi.org\/10.1145\/1499096.1499100","relation":{},"ISSN":["0098-3500","1557-7295"],"issn-type":[{"type":"print","value":"0098-3500"},{"type":"electronic","value":"1557-7295"}],"subject":[],"published":{"date-parts":[[2009,3]]},"assertion":[{"value":"2007-07-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2008-10-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2009-04-07","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}