{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,18]],"date-time":"2026-04-18T16:41:19Z","timestamp":1776530479921,"version":"3.51.2"},"reference-count":26,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2020,5,19]],"date-time":"2020-05-19T00:00:00Z","timestamp":1589846400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"DyConPV","award":["0324166B"],"award-info":[{"award-number":["0324166B"]}]},{"name":"German Federal Ministry for Economic Affairs and Energy (BMWi) via eco4wind","award":["0324125B"],"award-info":[{"award-number":["0324125B"]}]},{"name":"DFG via Research Unit FOR 2401"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Math. Softw."],"published-print":{"date-parts":[[2020,6,30]]},"abstract":"<jats:p>Basic Linear Algebra Subroutines For Embedded Optimization (BLASFEO) is a dense linear algebra library providing high-performance implementations of BLAS- and LAPACK-like routines for use in embedded optimization and other applications targeting relatively small matrices. BLASFEO defines an application programming interface (API) which uses a packed matrix format as its native format. This format is analogous to the internal memory buffers of optimized BLAS, but it is exposed to the user and it removes the packing cost from the routine call. For matrices fitting in cache, BLASFEO outperforms optimized BLAS implementations, both open source and proprietary. This article investigates the addition of a standard BLAS API to the BLASFEO framework, and proposes an implementation switching between two or more algorithms optimized for different matrix sizes. Thanks to the modular assembly framework in BLASFEO, tailored linear algebra kernels with mixed column- and panel-major arguments are easily developed. This BLAS API has lower performance than the BLASFEO API, but it nonetheless outperforms optimized BLAS and especially LAPACK libraries for matrices fitting in cache. Therefore, it can boost a wide range of applications, where standard BLAS and LAPACK libraries are employed and the matrix size is moderate. In particular, this article investigates the benefits in scientific programming languages such as Octave, SciPy, and Julia.<\/jats:p>","DOI":"10.1145\/3378671","type":"journal-article","created":{"date-parts":[[2020,5,22]],"date-time":"2020-05-22T23:58:23Z","timestamp":1590191903000},"page":"1-36","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":12,"title":["The BLAS API of BLASFEO"],"prefix":"10.1145","volume":"46","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-9440-9699","authenticated-orcid":false,"given":"Gianluca","family":"Frison","sequence":"first","affiliation":[{"name":"University of Freiburg, Freiburg, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Tommaso","family":"Sartor","sequence":"additional","affiliation":[{"name":"University of Freiburg, Freiburg, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Andrea","family":"Zanelli","sequence":"additional","affiliation":[{"name":"University of Freiburg, Freiburg, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Moritz","family":"Diehl","sequence":"additional","affiliation":[{"name":"University of Freiburg, Freiburg, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2020,5,19]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"crossref","unstructured":"E. Anderson Z. Bai C. Bischof S. Blackford J. Demmel J. Dongarra J. Du Croz A. Greenbaum S. Hammarling A. McKenney and D. Sorensen. 1999. LAPACK Users\u2019 Guide (3rd ed.). SIAM.  E. Anderson Z. Bai C. Bischof S. Blackford J. Demmel J. Dongarra J. Du Croz A. Greenbaum S. Hammarling A. McKenney and D. Sorensen. 1999. LAPACK Users\u2019 Guide (3rd ed.). SIAM.","DOI":"10.1137\/1.9780898719604"},{"key":"e_1_2_1_2_1","unstructured":"BLASFEO. 2016. Retrieved from https:\/\/github.com\/giaf\/blasfeo.  BLASFEO. 2016. Retrieved from https:\/\/github.com\/giaf\/blasfeo."},{"key":"e_1_2_1_3_1","unstructured":"Blaze. 2012. Retrieved from http:\/\/bitbucket.org\/blaze-lib\/blaze.  Blaze. 2012. Retrieved from http:\/\/bitbucket.org\/blaze-lib\/blaze."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2008.10.002"},{"key":"e_1_2_1_5_1","unstructured":"Eigen. 2010. Eigen v3. Retreved from http:\/\/eigen.tuxfamily.org\/.  Eigen. 2010. Eigen v3. Retreved from http:\/\/eigen.tuxfamily.org\/."},{"key":"e_1_2_1_6_1","volume-title":"Proceedings of the IFAC World Congress.","author":"Ferreau H. J.","unstructured":"H. J. Ferreau , S. Almer , R. Verschueren , M. Diehl , D. Frick , A. Domahidi , J. L. Jerez , G. Stathopoulos , and C. Jones . 2017. Embedded optimization methods for industrial automatic control . In Proceedings of the IFAC World Congress. H. J. Ferreau, S. Almer, R. Verschueren, M. Diehl, D. Frick, A. Domahidi, J. L. Jerez, G. Stathopoulos, and C. Jones. 2017. Embedded optimization methods for industrial automatic control. In Proceedings of the IFAC World Congress."},{"key":"e_1_2_1_7_1","volume-title":"Proceedings of the European Control Conference (ECC\u201914)","author":"Frison G.","unstructured":"G. Frison , H. B. Sorensen , B. Dammann , and J. B. J\u00f8rgensen . 2014. High-performance small-scale solvers for linear model predictive control . In Proceedings of the European Control Conference (ECC\u201914) . 128--133. G. Frison, H. B. Sorensen, B. Dammann, and J. B. J\u00f8rgensen. 2014. High-performance small-scale solvers for linear model predictive control. In Proceedings of the European Control Conference (ECC\u201914). 128--133."},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3210754"},{"key":"e_1_2_1_11_1","volume-title":"Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.","author":"Heinecke Alexander","year":"2016","unstructured":"Alexander Heinecke , Greg Henry , Maxwell Hutchinson , and Hans Pabst . 2016 . LIBXSMM: Accelerating small matrix multiplications by runtime code generation . In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Alexander Heinecke, Greg Henry, Maxwell Hutchinson, and Hans Pabst. 2016. LIBXSMM: Accelerating small matrix multiplications by runtime code generation. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.automatica.2011.08.020"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/TAC.2014.2351991"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/1356052.1356053"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/1377603.1377607"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/292395.292412"},{"key":"e_1_2_1_17_1","volume-title":"European Conference on Parallel Processing, Springer. 659--671","author":"Masliah I.","unstructured":"I. Masliah , A. Abdelfattah , A. Haidar , S. Tomov , M. Baboulin , J. Falcou , and J. Dongarra . 2016. High-performance matrix-matrix multiplications of very small matrices . In European Conference on Parallel Processing, Springer. 659--671 . I. Masliah, A. Abdelfattah, A. Haidar, S. Tomov, M. Baboulin, J. Falcou, and J. Dongarra. 2016. High-performance matrix-matrix multiplications of very small matrices. In European Conference on Parallel Processing, Springer. 659--671."},{"key":"e_1_2_1_18_1","unstructured":"Intel. 2019. Math Kernel Library. Retrieved from https:\/\/software.intel.com\/en-us\/mkl.  Intel. 2019. Math Kernel Library. Retrieved from https:\/\/software.intel.com\/en-us\/mkl."},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/355841.355847"},{"key":"e_1_2_1_20_1","volume-title":"A Guide to NumPy","author":"Oliphant Travis E.","unstructured":"Travis E. Oliphant . 2006. A Guide to NumPy . Trelgol Publishing USA. Travis E. Oliphant. 2006. A Guide to NumPy. Trelgol Publishing USA."},{"key":"e_1_2_1_21_1","unstructured":"OpenBLAS. 2011. OpenBLAS: An optimized BLAS library. Retrieved from http:\/\/www.openblas.net\/.  OpenBLAS. 2011. OpenBLAS: An optimized BLAS library. Retrieved from http:\/\/www.openblas.net\/."},{"key":"e_1_2_1_22_1","volume-title":"International Symposium on Code Generation and Optimization (CGO). 117--127","author":"Daniele","unstructured":"Daniele G. Spampinato and Markus P\u00fcschel. 2016. A basic linear algebra compiler for structured matrices . In International Symposium on Code Generation and Optimization (CGO). 117--127 . Daniele G. Spampinato and Markus P\u00fcschel. 2016. A basic linear algebra compiler for structured matrices. In International Symposium on Code Generation and Optimization (CGO). 117--127."},{"key":"e_1_2_1_23_1","volume-title":"International Symposium on Code Generation and Optimization (CGO). 327--339","author":"Spampinato D. G.","unstructured":"D. G. Spampinato , D. Fabregat-Traver , P. Bientinesi , and M. P\u00fcschel . 2018. Program generation for small-scale linear algebra applications . In International Symposium on Code Generation and Optimization (CGO). 327--339 . D. G. Spampinato, D. Fabregat-Traver, P. Bientinesi, and M. P\u00fcschel. 2018. Program generation for small-scale linear algebra applications. In International Symposium on Code Generation and Optimization (CGO). 327--339."},{"key":"e_1_2_1_24_1","unstructured":"Tim Peters. 2004. PEP 20\u2014The Zen of Python. Retrieved from https:\/\/www.python.org\/dev\/peps\/pep-0020\/.  Tim Peters. 2004. PEP 20\u2014The Zen of Python. Retrieved from https:\/\/www.python.org\/dev\/peps\/pep-0020\/."},{"key":"e_1_2_1_25_1","volume-title":"Proceedings of the 9th SIAM Conference on Parallel Processing for Scientific Computing.","author":"Whaley R. C.","unstructured":"R. C. Whaley and J. Dongarra . 1999. Automatically tuned linear algebra software . In Proceedings of the 9th SIAM Conference on Parallel Processing for Scientific Computing. R. C. Whaley and J. Dongarra. 1999. Automatically tuned linear algebra software. In Proceedings of the 9th SIAM Conference on Parallel Processing for Scientific Computing."},{"key":"e_1_2_1_26_1","first-page":"6","article-title":"2009. The libflame library for dense matrix computations","volume":"11","author":"Van Zee F. G.","year":"2009","unstructured":"F. G. Van Zee , E. Chan , R. A. van de Geijn , E. S. Quintana-Orti , and G. Quintana-Orti . 2009. The libflame library for dense matrix computations . In IEEE Computing in Science and Engineering 11 , 6 ( 2009 ). F. G. Van Zee, E. Chan, R. A. van de Geijn, E. S. Quintana-Orti, and G. Quintana-Orti. 2009. The libflame library for dense matrix computations. In IEEE Computing in Science and Engineering 11, 6 (2009).","journal-title":"IEEE Computing in Science and Engineering"},{"key":"e_1_2_1_27_1","first-page":"3","article-title":"Van Zee and Robert A. van de Geijn. 2015. BLIS: A framework for rapidly instantiating BLAS functionality","volume":"41","author":"Field","year":"2015","unstructured":"Field G . Van Zee and Robert A. van de Geijn. 2015. BLIS: A framework for rapidly instantiating BLAS functionality . ACM Transactions on Mathematical Software 41 , 3 ( 2015 ), 14:1--14:33. Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Transactions on Mathematical Software 41, 3 (2015), 14:1--14:33.","journal-title":"ACM Transactions on Mathematical Software"},{"key":"e_1_2_1_28_1","first-page":"2","article-title":"van de Geijn. 2016. The BLIS framework: Experiments in portability","volume":"42","author":"Van Zee Field G.","year":"2016","unstructured":"Field G. Van Zee , Tyler Smith , Francisco D. Igual , Mikhail Smelyanskiy , Xianyi Zhang , Michael Kistler , Vernon Austel , John Gunnels , Tze Meng Low , Bryan Marker , Lee Killough , and Robert A . van de Geijn. 2016. The BLIS framework: Experiments in portability . ACM Transactions on Mathematical Software 42 , 2 ( 2016 ), 12:1--12:19. Field G. Van Zee, Tyler Smith, Francisco D. Igual, Mikhail Smelyanskiy, Xianyi Zhang, Michael Kistler, Vernon Austel, John Gunnels, Tze Meng Low, Bryan Marker, Lee Killough, and Robert A. van de Geijn. 2016. The BLIS framework: Experiments in portability. ACM Transactions on Mathematical Software 42, 2 (2016), 12:1--12:19.","journal-title":"ACM Transactions on Mathematical Software"}],"container-title":["ACM Transactions on Mathematical Software"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3378671","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3378671","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T22:41:19Z","timestamp":1750200079000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3378671"}},"subtitle":["Optimizing Performance for Small Matrices"],"short-title":[],"issued":{"date-parts":[[2020,5,19]]},"references-count":26,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2020,6,30]]}},"alternative-id":["10.1145\/3378671"],"URL":"https:\/\/doi.org\/10.1145\/3378671","relation":{},"ISSN":["0098-3500","1557-7295"],"issn-type":[{"value":"0098-3500","type":"print"},{"value":"1557-7295","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,5,19]]},"assertion":[{"value":"2019-02-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-01-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2020-05-19","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}