{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,20]],"date-time":"2025-02-20T05:16:48Z","timestamp":1740028608733,"version":"3.37.3"},"reference-count":0,"publisher":"IOS Press","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2014]]},"abstract":"<jats:p>The work presented refers to the calculation of two-dimensional Fourier transforms of distributed data, of which matrix transposition is a major ingredient. The motivation stems from a well-known parallel scalability bottleneck related to Fourier filtering within the global gyrokinetic NEMORB code, whose aim is to simulate plasma turbulence within a Tokamak fusion device. Since this pure MPI code is very HPC-resource demanding, with good scaling up to 65536 tasks, such a bottleneck naturally impairs further parallel scalability. To overcome this limitation the filtering algorithm is modified. Firstly, exploring the Fourier transform Hermitian symmetry of the purely real-valued input data to discard the redundant Fourier modes. This not only reduces the amount of needed computing power but also the amount of communication between MPI tasks during the transpose phase. Secondly, several distributed transpose algorithms are tested and the high count of zeros yielded by the NEMORB's low pass filter is used to further reduce the inter-task data exchange by avoiding communicating zeros. Finally, the behavior of the transpose at the scale of a full petaflop system is explored. Execution time, MPI initialization and memory footprint can become critical on such large runs. Four implementation strategies have been tested and scaling results are shown on four petaflop systems, among which two belong to the PRACE Research Infrastructure.<\/jats:p>","DOI":"10.3233\/978-1-61499-381-0-415","type":"book-chapter","created":{"date-parts":[[2025,2,19]],"date-time":"2025-02-19T15:30:51Z","timestamp":1739979051000},"source":"Crossref","is-referenced-by-count":0,"title":["NEMORB's Fourier Filter and Distributed Matrix Transposition on Petaflop Systems"],"prefix":"10.3233","author":[{"family":"Ribeiro Tiago","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"family":"Haefele Matthieu","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"7437","container-title":["Advances in Parallel Computing","Parallel Computing: Accelerating Computational Science and Engineering (CSE)"],"original-title":[],"deposited":{"date-parts":[[2025,2,19]],"date-time":"2025-02-19T15:36:03Z","timestamp":1739979363000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.medra.org\/servlet\/aliasResolver?alias=iospressISSNISBN&issn=0927-5452&volume=25&spage=415"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2014]]},"references-count":0,"URL":"https:\/\/doi.org\/10.3233\/978-1-61499-381-0-415","relation":{},"ISSN":["0927-5452"],"issn-type":[{"value":"0927-5452","type":"print"}],"subject":[],"published":{"date-parts":[[2014]]}}}