{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2023,10,24]],"date-time":"2023-10-24T11:11:59Z","timestamp":1698145919961},"reference-count":35,"publisher":"Wiley","issue":"7","license":[{"start":{"date-parts":[[2006,10,25]],"date-time":"2006-10-25T00:00:00Z","timestamp":1161734400000},"content-version":"vor","delay-in-days":4407,"URL":"http:\/\/onlinelibrary.wiley.com\/termsAndConditions#vor"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Concurrency: Pract. Exper."],"published-print":{"date-parts":[[1994,10]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Matrix multiplication is a key primitive in block matrix algorithms such as those found in LAPACK. We present results from our study of matrix multiplication algorithms on the Intel Touchstone Delta, a distributed memory message\u2010passing architecture with a two\u2010dimensional mesh topology. We analyze and compare three algorithms and obtain an implementation, BiMMeR, that uses communication primitives highly suited to the Delta and exploits the single node assembly\u2010coded matrix multiplication. Our algorithm is completely general, i.e. able to deal with various data layouts as well as arbitrary mesh aspect ratios and matrix dimensions, and has achieved parallel efficiency of 86 %, with overall peak performance in excess of 8 Gflops on 256 nodes for an 8800 \u00d7 8800 matrix. We describe BiMMeR's design and implementation and present performance results that demonstrate scalability and robust behavior over varying mesh topologies.<\/jats:p>","DOI":"10.1002\/cpe.4330060703","type":"journal-article","created":{"date-parts":[[2006,11,18]],"date-time":"2006-11-18T06:53:57Z","timestamp":1163832837000},"page":"571-594","source":"Crossref","is-referenced-by-count":21,"title":["Matrix multiplication on the Intel Touchstone Delta"],"prefix":"10.1002","volume":"6","author":[{"given":"Steven","family":"Huss\u2010Lederman","sequence":"first","affiliation":[]},{"given":"Elaine M.","family":"Jacobson","sequence":"additional","affiliation":[]},{"given":"Anna","family":"Tsao","sequence":"additional","affiliation":[]},{"given":"Guodong","family":"Zhang","sequence":"additional","affiliation":[]}],"member":"311","published-online":{"date-parts":[[2006,10,25]]},"reference":[{"key":"e_1_2_1_2_2","doi-asserted-by":"publisher","DOI":"10.1137\/1.9781611971811"},{"key":"e_1_2_1_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/SUPERC.1990.129995"},{"key":"e_1_2_1_4_2","volume-title":"LAPACK Users' Guide","author":"Anderson E.","year":"1992"},{"key":"e_1_2_1_5_2","unstructured":"C. H.Bischof LAPACK: Linear Algebra Software for Supercomputers Preprint MCS\u2010P236\u20100491 Argonne National Laboratory July 1991."},{"key":"e_1_2_1_6_2","first-page":"120","volume-title":"ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers","author":"Choi J.","year":"1992"},{"key":"e_1_2_1_7_2","volume-title":"Level 3 BLAS for distributed memory concurrent computers","author":"Choi J.","year":"1992"},{"key":"e_1_2_1_8_2","volume-title":"Solving Problems on Concurrent Processors","author":"Fox G.","year":"1988"},{"key":"e_1_2_1_9_2","unstructured":"K. K.MathurandS. L.Johnsson \u2018Multiplication of matrices of arbitrary shape on a data parallel computer\u2019 Thinking Machines Corporation 1992 preprint 1992; also released as Technical Report TR\u2010216."},{"key":"e_1_2_1_10_2","doi-asserted-by":"publisher","DOI":"10.1016\/0196-8858(92)90011-K"},{"key":"e_1_2_1_11_2","first-page":"367","volume-title":"A parallel implementation of the invariant subspace decomposition algorithm for dense symmetric matrices","author":"Huss\u2010Lederman S.","year":"1993"},{"key":"e_1_2_1_12_2","volume-title":"Solving Linear Systems on Vector and Shared Memory Computers","author":"Dongarra J. J.","year":"1991"},{"key":"e_1_2_1_13_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/77626.79170","article-title":"A set of level 3 basic linear algebra subprograms","volume":"16","author":"Dongarra J. J.","year":"1990","journal-title":"ACM Trans. Math. Software"},{"key":"e_1_2_1_14_2","first-page":"142","volume-title":"Proceedings, Scalable Parallel Libraries Conference","author":"Huss\u2010Lederman S.","year":"1993"},{"key":"e_1_2_1_15_2","unstructured":"Kuck & Associates Inc.CLASSPACK Basic Math Library User's Guide Release 1.2 Document #9202003 1992."},{"key":"e_1_2_1_16_2","volume-title":"Matrix Computations","author":"Golub G.","year":"1989"},{"key":"e_1_2_1_17_2","unstructured":"L. E.Cannon A Cellular Computer to Implement the Kalman Filter Algorithm Ph.D. Thesis Montana State University 1969."},{"key":"e_1_2_1_18_2","first-page":"1042","volume-title":"Domain decomposition in distributed and shared memory environments","author":"Fox G.","year":"1987"},{"key":"e_1_2_1_19_2","doi-asserted-by":"publisher","DOI":"10.1016\/0167-8191(87)90060-3"},{"key":"e_1_2_1_20_2","volume-title":"The distributed solution of linear systems using the torus wrap data mapping","author":"Ashcraft C. C.","year":"1990"},{"key":"e_1_2_1_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/SHPCC.1992.232679"},{"key":"e_1_2_1_22_2","unstructured":"C. H.Bischof S.Huss\u2010Lederman E. M.Jacobson X.Sun andA.Tsao \u2018On the impact of HPF data layout on the design of efficient and maintainable parallel linear algebra libraries\u2019 Technical Report ANL\/MCS\u2010TM\u2010184 Argonne National Lab (also available from the archives of the HPF Forum)."},{"key":"e_1_2_1_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/DMCC.1991.633353"},{"issue":"6","key":"e_1_2_1_24_2","first-page":"775","article-title":"Performance analysis of k\u2010ary n\u2010cube interconnection networks","volume":"39","author":"Dally W. J.","year":"1990","journal-title":"IEEE Trans."},{"key":"e_1_2_1_25_2","volume-title":"Intel Touchstone Delta System Description","author":"Intel Supercomputing Systems Division","year":"1991"},{"key":"e_1_2_1_26_2","volume-title":"Paragon Supercomputers","author":"Intel Supercomputing Systems Division","year":"1992"},{"key":"e_1_2_1_27_2","unstructured":"Intel Corporation i860 64\u2010Bit Microprocessor Programmer's Reference Manual 1990."},{"key":"e_1_2_1_28_2","volume-title":"i860 Microprocessor Architecture","author":"Margulis Neal","year":"1990"},{"key":"e_1_2_1_29_2","first-page":"534","volume-title":"Performance and Assembly language programming of the iPSC\/860 system","author":"Scott D. S.","year":"1991"},{"key":"e_1_2_1_30_2","unstructured":"R.Littlefield \u2018Characterizing and tuning communications performance for real applications' presentation overheads \u2019 Proceedings First Intel Delta Applications Workshop CCSF\u201014\u201092 February 1992 Caltech Concurrent Supercomputing Facilities Pasadena California 1992 pp.179\u2013190."},{"key":"e_1_2_1_31_2","unstructured":"G.Regnier \u2018Delta message passing protocol\u2019 presentation overheads Proceedings First Intel Delta Applications Workshop CCSF\u201014\u201092 February 1992 Caltech Concurrent Supercomputing Facilities Pasadena California 1992 pp.173\u2013178."},{"key":"e_1_2_1_32_2","doi-asserted-by":"crossref","unstructured":"T. H.Dunigan \u2018Communication performance of the Intel Touchstone Delta mesh\u2019 Technical Report ORNL\/TM\u201011983 Oak Ridge National Laboratory January 1992.","DOI":"10.2172\/5955605"},{"key":"e_1_2_1_33_2","doi-asserted-by":"crossref","unstructured":"S.Huss\u2010Lederman E. M.Jacobson A.Tsao andG.Zhang \u2018Optimizing communication primitives on the Intel Touchstone Delta\u2019 Technical Report Supercomputing Research Center 1994 to be published.","DOI":"10.1002\/cpe.4330060703"},{"key":"e_1_2_1_34_2","unstructured":"M.Barnett D. G.Payne andR.van de GeijnOptimal Broadcasting in Mesh\u2010Connected Architectures preprint; also appears as University of Texas Computer Science Technical Report TR\u201091\u201038 December 1991."},{"key":"e_1_2_1_35_2","volume-title":"Intel Touchstone Delta Message Passing Performance","author":"Regnier G.","year":"1991"},{"key":"e_1_2_1_36_2","unstructured":"R. A.Van de Geijn \u2018Massively parallel LINPACK benchmark on the Intel Touchstone Delta and iPSC\/860 systems: progress report\u2019 Computer Science Technical Report TR\u201091\u201028 University of Texas 1991."}],"container-title":["Concurrency: Practice and Experience"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/api.wiley.com\/onlinelibrary\/tdm\/v1\/articles\/10.1002%2Fcpe.4330060703","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/pdf\/10.1002\/cpe.4330060703","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,10,24]],"date-time":"2023-10-24T00:13:29Z","timestamp":1698106409000},"score":1,"resource":{"primary":{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/10.1002\/cpe.4330060703"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[1994,10]]},"references-count":35,"journal-issue":{"issue":"7","published-print":{"date-parts":[[1994,10]]}},"alternative-id":["10.1002\/cpe.4330060703"],"URL":"https:\/\/doi.org\/10.1002\/cpe.4330060703","archive":["Portico"],"relation":{},"ISSN":["1040-3108","1096-9128"],"issn-type":[{"value":"1040-3108","type":"print"},{"value":"1096-9128","type":"electronic"}],"subject":[],"published":{"date-parts":[[1994,10]]}}}