{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,3,1]],"date-time":"2025-03-01T18:10:21Z","timestamp":1740852621313,"version":"3.38.0"},"reference-count":11,"publisher":"SAGE Publications","issue":"4","license":[{"start":{"date-parts":[[1992,12,1]],"date-time":"1992-12-01T00:00:00Z","timestamp":723168000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["The International Journal of Supercomputing Applications"],"published-print":{"date-parts":[[1992,12]]},"abstract":"<jats:p> We describe a subset of the level-1, level-2, and level-3 BLAS implemented for each node of the Connection Machine system CM-200. The routines, collectively called LBLAS, have interfaces consistent with lan guages with an array syntax such as Fortran 90. One novel feature, important for distributed memory archi tectures, is the capability of performing computations on multiple instances of objects in a single call. The number of instances and their allocation across mem ory units, and the strides for the different axes within the local memories, are derived from an array descrip tor that contains type, shape, and data distribution in formation. Another novel feature of the LBLAS is a se lection of loop order for rank-1 updates and matrix- matrix multiplication based on array shapes, strides, and DRAM page faults. The peak efficiencies for the routines are in excess of 75%. Matrix-vector multiplica tion achieves a peak efficiency of 92%. The optimiza tion of loop ordering has a success rate exceeding 99.8% for matrices for which the sum of the lengths of the axes is at most 60. The success rate is even higher for all possible matrix shapes. The performance loss when a nonoptimal choice is made is less than \u223c15% of peak and typically less than 1% of peak. We also show that the performance gain for high-rank updates may be as much as a factor of 6 over rank-1 updates. <\/jats:p>","DOI":"10.1177\/109434209200600403","type":"journal-article","created":{"date-parts":[[2007,3,5]],"date-time":"2007-03-05T01:17:47Z","timestamp":1173057467000},"page":"322-350","source":"Crossref","is-referenced-by-count":6,"title":["Local Basic Linear Algebra Subroutines (Lblas) for Distributed Memory Architectures and Languages With Array Syntax"],"prefix":"10.1177","volume":"6","author":[{"given":"S. Lennart","family":"Johnsson","sequence":"first","affiliation":[{"name":"THINKING MACHINES CORPORATION CAMBRIDGE, MASSACHUSETTS\r02142"}]},{"given":"Luis F.","family":"Ortiz","sequence":"additional","affiliation":[{"name":"THINKING MACHINES CORPORATION CAMBRIDGE, MASSACHUSETTS\r02142"}]}],"member":"179","published-online":{"date-parts":[[1992,12,1]]},"reference":[{"volume-title":"A set of level 3 basic linear algebra subprograms. Technical Report Reprint No. 1","year":"1988","author":"Dongarra, J.J.","key":"atypb1"},{"key":"atypb2","unstructured":"Dongarra, J.J., Du Croz, J., Hammarling, S., and Hansan, R.J. 1986. An extended set of Fortran basic linear algebra subprograms . Technical Report Technical Memorandum 41. Argonne National Laboratories ,"},{"volume-title":"The IBM RISC system\/6000 and linear algebra operations. LAPACK working note 28. Technical Report CS-90-122","year":"1990","author":"Mathematics and Computer Science Division.","key":"atypb3"},{"volume-title":"Language and compiler issues in scalable high performance libraries","year":"1992","author":"Johnsson, S.L.","key":"atypb4"},{"volume-title":"High performance GEMM-based level-3 BLAS: sample routines for double precision real data. In: High performance computing II, edited by M. Durand and F. El Dabaghi","year":"1991","author":"K\u00e5gstr\u00f6m, B.","key":"atypb5"},{"key":"atypb6","doi-asserted-by":"publisher","DOI":"10.1145\/355841.355847"},{"issue":"5","key":"atypb7","volume":"14","author":"Lichtenstein, W.","year":"1993","journal-title":"SIAM J. Sci. Comput."},{"key":"atypb8","doi-asserted-by":"crossref","unstructured":"Ling, P. 1992. A set of high performance level-3 BLAS structured and tuned for the IBM 3090 VF and implemented in Fortran 77. Technical Report UMINF-179.90. University of Ume\u00e5, Department of Information Processing.","DOI":"10.1007\/BF01206242"},{"volume-title":"Multiplication of matrices of arbitrary shape on a data parallel computer. Technical Report 216","year":"1991","author":"Mathur, K.K.","key":"atypb9"},{"volume-title":"Fortran 90 explained. Oxford, UK","year":"1991","author":"Metcalf, M.","key":"atypb10"},{"key":"atypb11","doi-asserted-by":"publisher","DOI":"10.1016\/0167-8191(90)90093-O"}],"container-title":["The International Journal of Supercomputing Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/109434209200600403","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/109434209200600403","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,1]],"date-time":"2025-03-01T17:42:31Z","timestamp":1740850951000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/109434209200600403"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[1992,12]]},"references-count":11,"journal-issue":{"issue":"4","published-print":{"date-parts":[[1992,12]]}},"alternative-id":["10.1177\/109434209200600403"],"URL":"https:\/\/doi.org\/10.1177\/109434209200600403","relation":{},"ISSN":["0890-2720"],"issn-type":[{"type":"print","value":"0890-2720"}],"subject":[],"published":{"date-parts":[[1992,12]]}}}