{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,2,21]],"date-time":"2025-02-21T06:41:40Z","timestamp":1740120100800,"version":"3.37.3"},"reference-count":6,"publisher":"World Scientific Pub Co Pte Ltd","issue":"03n04","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Parallel Process. Lett."],"published-print":{"date-parts":[[2017,12]]},"abstract":"<jats:p> Basic Linear Algebra Subprograms (BLAS) and Linear Algebra Package (LAPACK) form basic building blocks for several High Performance Computing (HPC) applications and hence dictate performance of the HPC applications. Performance in such tuned packages is attained through tuning of several algorithmic and architectural parameters such as number of parallel operations in the Directed Acyclic Graph of the BLAS\/LAPACK routines, sizes of the memories in the memory hierarchy of the underlying platform, bandwidth of the memory, and structure of the compute resources in the underlying platform. In this paper, we closely investigate the impact of the Floating Point Unit (FPU) micro-architecture for performance tuning of BLAS and LAPACK. We present theoretical analysis for pipeline depth of different floating point operations like multiplier, adder, square root, and divider followed by characterization of BLAS and LAPACK to determine several parameters required in the theoretical framework for deciding optimum pipeline depth of the floating operations. A simple design of a Processing Element (PE) is presented and shown that the PE outperforms the most recent custom realizations of BLAS and LAPACK by 1.1X to 1.5X in GFlops\/W, and 1.9X to 2.1X in Gflops\/mm<jats:sup>2<\/jats:sup>. Compared to multicore, General Purpose Graphics Processing Unit (GPGPU), Field Programmable Gate Array (FPGA), and ClearSpeed CSX700, performance improvement of 1.8-80x is reported in\u00a0PE. <\/jats:p>","DOI":"10.1142\/s0129626417500062","type":"journal-article","created":{"date-parts":[[2017,12,5]],"date-time":"2017-12-05T22:27:31Z","timestamp":1512512851000},"page":"1750006","source":"Crossref","is-referenced-by-count":8,"title":["Accelerating BLAS and LAPACK via Efficient Floating Point Architecture Design"],"prefix":"10.1142","volume":"27","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3708-5621","authenticated-orcid":false,"given":"Farhad","family":"Merchant","sequence":"first","affiliation":[{"name":"School of Computer Science and Engineering, Nanyang Technological University, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Anupam","family":"Chattopadhyay","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, Nanyang Technological University, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Soumyendu","family":"Raha","sequence":"additional","affiliation":[{"name":"Department of Computational and Data Science, Indian Institute of Science, Bangalore, India 560012, India"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"S. K.","family":"Nandy","sequence":"additional","affiliation":[{"name":"Department of Computational and Data Science, Indian Institute of Science, Bangalore, India 560012, India"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ranjani","family":"Narayan","sequence":"additional","affiliation":[{"name":"Morphing Machines Pvt. Ltd, India"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"219","published-online":{"date-parts":[[2017,12,5]]},"reference":[{"key":"S0129626417500062BIB001","doi-asserted-by":"publisher","DOI":"10.1137\/1.9780898719604"},{"key":"S0129626417500062BIB002","doi-asserted-by":"publisher","DOI":"10.1109\/40.782563"},{"key":"S0129626417500062BIB005","doi-asserted-by":"publisher","DOI":"10.1145\/98267.98290"},{"key":"S0129626417500062BIB006","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2015.4"},{"key":"S0129626417500062BIB012","doi-asserted-by":"publisher","DOI":"10.1145\/2735839"},{"key":"S0129626417500062BIB014","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2014.2315627"}],"container-title":["Parallel Processing Letters"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.worldscientific.com\/doi\/pdf\/10.1142\/S0129626417500062","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2019,8,6]],"date-time":"2019-08-06T16:51:40Z","timestamp":1565110300000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.worldscientific.com\/doi\/abs\/10.1142\/S0129626417500062"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2017,12]]},"references-count":6,"journal-issue":{"issue":"03n04","published-online":{"date-parts":[[2017,12,5]]},"published-print":{"date-parts":[[2017,12]]}},"alternative-id":["10.1142\/S0129626417500062"],"URL":"https:\/\/doi.org\/10.1142\/s0129626417500062","relation":{},"ISSN":["0129-6264","1793-642X"],"issn-type":[{"type":"print","value":"0129-6264"},{"type":"electronic","value":"1793-642X"}],"subject":[],"published":{"date-parts":[[2017,12]]}}}