{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2023,10,25]],"date-time":"2023-10-25T16:10:53Z","timestamp":1698250253771},"reference-count":13,"publisher":"Wiley","issue":"1","license":[{"start":{"date-parts":[[2006,10,24]],"date-time":"2006-10-24T00:00:00Z","timestamp":1161648000000},"content-version":"vor","delay-in-days":5379,"URL":"http:\/\/onlinelibrary.wiley.com\/termsAndConditions#vor"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Concurrency: Pract. Exper."],"published-print":{"date-parts":[[1992,2]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>We present in this paper a study of the computation and communication costs on RP3 and on some issues about algorithm designs on a three\u2010level memory hierarchy multi\u2010processor. Using very simple algorithms (vector\u2010add, vector\u2010sum, saxpy, \u2026 ), we compare different implementations which differ on data localization (global or local) and data cacheability (cacheable or non\u2010cacheable). This comparison is done using a performance monitoring system (VPMC) that records instructions, data movement, cache requests and misses. The output of the VPMC was then used as input to an analytical performance model which we used to compute the elemental computation and communication times of every basic algorithm. Regarding cacheability (marking the data cacheable instead of non\u2010cacheable), we found it worthwhile as long as data are blocked adequately. For our simple 1\u2010D data structures, a block size equal to a multiple of the cache line size gives the best results. However, considering possible load imbalance, a block size equal to the cache line seems optimal. Regarding localization (copying data from global to local, working on local data instead of global and copying data back), we found it ineffective, at least with the RP3 local and global communication speed ratios (1:10:15).<\/jats:p>","DOI":"10.1002\/cpe.4330040105","type":"journal-article","created":{"date-parts":[[2006,11,17]],"date-time":"2006-11-17T16:22:00Z","timestamp":1163780520000},"page":"57-78","source":"Crossref","is-referenced-by-count":2,"title":["Computation and data movement on RP3"],"prefix":"10.1002","volume":"4","author":[{"given":"Luigi","family":"Brochard","sequence":"first","affiliation":[]},{"given":"Alex","family":"Freau","sequence":"additional","affiliation":[]}],"member":"311","published-online":{"date-parts":[[2006,10,24]]},"reference":[{"key":"e_1_2_1_2_2","doi-asserted-by":"publisher","DOI":"10.1002\/cpe.4330040106"},{"key":"e_1_2_1_3_2","first-page":"35","volume-title":"\u2018RP3 Performance Monitoring Hardware. Instrumentation for Parallel Computer Systms\u2019","author":"Brantley W. C.","year":"1989"},{"key":"e_1_2_1_4_2","volume-title":"Support Environment for RP3 Performance Monitor. Performance Instrumentation and Visualization for Parallel Computer Systems","author":"Brantley W. C.","year":"1990"},{"key":"e_1_2_1_5_2","doi-asserted-by":"publisher","DOI":"10.1142\/S0129053389000299"},{"key":"e_1_2_1_6_2","doi-asserted-by":"publisher","DOI":"10.1016\/0167-8191(88)90094-4"},{"key":"e_1_2_1_7_2","doi-asserted-by":"publisher","DOI":"10.1137\/0909041"},{"key":"e_1_2_1_8_2","unstructured":"C.Moler \u2018Matrix computation on distributed memory multiprocessors\u2019 Proceedings of the First Conference on Hypercube Multiprocessors 1985 pp.181\u2013195."},{"key":"e_1_2_1_9_2","unstructured":"L.Brochard \u2018Scalability granularity and parallelism of numerical algorithms\u2019 IBM Research Report Report RC 14786 1989."},{"key":"e_1_2_1_10_2","doi-asserted-by":"crossref","unstructured":"K.Gallivan W.Jalby U.MeierandA.Sameh \u2018The impact of hierarchical memory systems on linear algebra algorithm design\u2019 Int. J. Supercomput. Appl. 12\u201348(1988).","DOI":"10.1177\/109434208800200103"},{"key":"e_1_2_1_11_2","volume-title":"\u2018Solving problems on concurrent processors\u2019","author":"Fox G.","year":"1988"},{"key":"e_1_2_1_12_2","doi-asserted-by":"publisher","DOI":"10.1016\/0167-8191(89)90004-5"},{"key":"e_1_2_1_13_2","unstructured":"G. F.Pfister W. C.Brantley D. A.George S. L.Harvey W. J.KleinfelderK. P.McAuliffe E. A.Melton V. A.NortonandJ.Weiss \u2018The IBM Research Parallel Processor Prototype (RP3): Introduction and Architectre\u2019 Proceedings of the 1985 International Conference on Parallel Processing 1985 pp.764\u2013771."},{"key":"e_1_2_1_14_2","unstructured":"W. C.Brantley K. P.McAuliffeandJ.Weiss \u2018RP3 Processor\u2010Memory Element\u2019 Proceedings of the 1985 International Conference on Parallel Processing 1985 pp.782\u2013789."}],"container-title":["Concurrency: Practice and Experience"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/api.wiley.com\/onlinelibrary\/tdm\/v1\/articles\/10.1002%2Fcpe.4330040105","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/pdf\/10.1002\/cpe.4330040105","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,10,24]],"date-time":"2023-10-24T13:47:11Z","timestamp":1698155231000},"score":1,"resource":{"primary":{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/10.1002\/cpe.4330040105"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[1992,2]]},"references-count":13,"journal-issue":{"issue":"1","published-print":{"date-parts":[[1992,2]]}},"alternative-id":["10.1002\/cpe.4330040105"],"URL":"https:\/\/doi.org\/10.1002\/cpe.4330040105","archive":["Portico"],"relation":{},"ISSN":["1040-3108","1096-9128"],"issn-type":[{"value":"1040-3108","type":"print"},{"value":"1096-9128","type":"electronic"}],"subject":[],"published":{"date-parts":[[1992,2]]}}}