{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,11]],"date-time":"2026-03-11T17:53:24Z","timestamp":1773251604727,"version":"3.50.1"},"reference-count":97,"publisher":"SAGE Publications","issue":"1","license":[{"start":{"date-parts":[[2004,2,1]],"date-time":"2004-02-01T00:00:00Z","timestamp":1075593600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["The International Journal of High Performance Computing Applications"],"published-print":{"date-parts":[[2004,2]]},"abstract":"<jats:p> Achieving peak performance from the computational kernels that dominate application performance often requires extensive machine-dependent tuning by hand. Automatic tuning systems have emerged in response, and they typically operate by (1) generating a large number of possible, reasonable implementations of a kernel, and (2) selecting the fastest implementation by a combination of heuristic modeling, heuristic pruning, and empirical search (i.e. actually running the code). This paper presents quantitative data that motivate the development of such a search-based system, using dense matrix multiply as a case study. The statistical distributions of performance within spaces of reasonable implementations, when observed on a variety of hardware platforms, lead us to pose and address two general problems which arise during the search process. First, we develop a heuristic for stopping an exhaustive compiletime search early if a near-optimal implementation is found. Secondly, we show how to construct run-time decision rules, based on run-time inputs, for selecting from among a subset of the best implementations when the space of inputs can be described by continuously varying features. We address both problems by using statistical modeling techniques that exploit the large amount of performance data collected during the search. We demonstrate these methods on actual performance data collected by the PHiPAC tuning system for dense matrix multiply. We close with a survey of recent projects that use or otherwise advocate an empirical search-based approach to code generation and algorithm selection, whether at the level of computational kernels, compiler and run-time systems, or problem-solving environments. Collectively, these efforts suggest a number of possible software architectures for constructing platform-adapted libraries and applications. <\/jats:p>","DOI":"10.1177\/1094342004041293","type":"journal-article","created":{"date-parts":[[2004,4,22]],"date-time":"2004-04-22T01:27:18Z","timestamp":1082597238000},"page":"65-94","source":"Crossref","is-referenced-by-count":72,"title":["Statistical Models for Empirical Search-Based Performance Tuning"],"prefix":"10.1177","volume":"18","author":[{"given":"Richard","family":"Vuduc","sequence":"first","affiliation":[{"name":"COMPUTER SCIENCE DIVISION DEPARTMENT OF ELECTRICAL ENGINEERING AND                        COMPUTER SCIENCES UNIVERSITY OF CALIFORNIA AT BERKELEY, BERKELEY, CA 94720, USA"}]},{"given":"James W.","family":"Demmel","sequence":"additional","affiliation":[{"name":"COMPUTER SCIENCE DIVISION DEPARTMENT OF ELECTRICAL ENGINEERING AND                        COMPUTER SCIENCES AND DEPARTMENT OF MATHEMATICS UNIVERSITY OF CALIFORNIA AT                        BERKELEY, BERKELEY, CA 94720, USA"}]},{"given":"Jeff A.","family":"Bilmes","sequence":"additional","affiliation":[{"name":"DEPARTMENT OF ELECTRICAL ENGINEERING UNIVERSITY OF WASHINGTON, SEATTLE,                        WA, USA"}]}],"member":"179","published-online":{"date-parts":[[2004,2,1]]},"reference":[{"key":"atypb1","doi-asserted-by":"crossref","unstructured":"Andersen, B. S., Gustavson, F., Karaivanov, A., Wasniewski, J., and Yalamov, P. Y. September 1999. LAWRA\u2013Linear algebra with recursive algorithms . In Proceedings of the Conference on Parallel Processing and Applied Mathematics, Kazimierz Dolny, Poland.","DOI":"10.3846\/13926292.1999.9637105"},{"key":"atypb2","doi-asserted-by":"publisher","DOI":"10.1145\/945394.945395"},{"key":"atypb3","doi-asserted-by":"crossref","unstructured":"Arnold, M., Fink, S., Grove, D., Hind, M., and Sweeney, P. F. December 2000. Adaptive optimization in the Jalape\u00f1o JVM: The controller\u2019s analytical model. In MICRO-33: Third ACM Workshop on Feedback-Directed Dynamic Optimization, Monterey, CA .","DOI":"10.1145\/354222.353175"},{"key":"atypb4","unstructured":"Ball, T. and Larus, J. R. December 1996. Efficient path profiling . In Proceedings of MICRO 96, Paris, France, pp. 46\u201357 ."},{"key":"atypb5","unstructured":"Barnes, R. November 1999. Feedback-directed data cache optimizations for the x86 . In Proceedings of the 32nd Annual International Symposium on Microarchitecture, Second Workshop on Feedback-Directed Optimization, Haifa, Israel."},{"key":"atypb6","doi-asserted-by":"crossref","unstructured":"Baumgartner, G., Bernholdt, D. E., Cociorva, D., Harrison, R., Hirata, S., Lam, C.C., Nooijen, M., Pitzer, R., Ramanujam, J., and Saddayappan, P. November 2002. A highlevel approach to synthesis of high-performance codes for quantum chemistry . In Proceedings of the IEEE\/ACM Conference on Supercomputing, Baltimore, MD.","DOI":"10.1109\/SC.2002.10056"},{"key":"atypb7","doi-asserted-by":"crossref","unstructured":"Beckmann, O. and Kelley, P. H. J. August 1997. Run-time interprocedural data placement optimization for lazy parallel libraries. In EuroPar, Lecture Notes in Computer Science, Springer-Verlag, Berlin .","DOI":"10.1007\/BFb0002749"},{"key":"atypb8","unstructured":"Bickel, P. J. and Doksum, K. A. 1977. Mathematical Statistics: Basic Ideas and Selected Topics, Holden-Day, San Francisco, CA ."},{"key":"atypb9","doi-asserted-by":"publisher","DOI":"10.1006\/jpdc.1995.1141"},{"key":"atypb10","doi-asserted-by":"crossref","unstructured":"Bilmes, J., Asanovi\\#263; K., Chin, C., and Demmel, J. July 1997. Optimizing matrix multiply using PHiPAC: a Portable, High-Performance, ANSI C coding methodology . In Proceedings of the International Conference on Supercomputing, Vienna, Austria.","DOI":"10.1145\/263580.263662"},{"key":"atypb11","unstructured":"Bilmes, J., Asanovi&263; K., Demmel, J., Lam, D., and Chin, C. October 1998. The PHiPAC v1.0 matrix-multiply distribution, Technical Report UCB\/CSD-98-1020, University of California, Berkeley, CA ."},{"key":"atypb12","doi-asserted-by":"publisher","DOI":"10.1080\/01621459.1952.10501182"},{"key":"atypb13","unstructured":"Blackford, S. et al. 2001. Document for the Basic Linear Algebra Subprograms (BLAS) standard: BLAS Technical Forum, http:\/\/www.netlib.org\/blas\/blast-forum."},{"key":"atypb14","doi-asserted-by":"crossref","unstructured":"Brewer, E. July 1995. High-level optimization via automated statistical modeling . In Symposium on Parallel Architectures and Algorithms, Santa Barbara, CA.","DOI":"10.1145\/209936.209946"},{"key":"atypb15","unstructured":"Bruening, D., Garnett, T., and Amarsinghe, S. March 2003. An infrastructure for adaptive dynamic optimization . In Proceedings of the 1st International Symposium on Code Generation and Optimization, San Francisco, CA."},{"key":"atypb16","doi-asserted-by":"crossref","unstructured":"Carr, S. and Kennedy, K. 1992. Compiler blockability of numerical algorithms . In Proceedings of Supercomputing, Minneapolis, MN, pp. 114\u2013124 .","DOI":"10.1109\/SUPERC.1992.236704"},{"key":"atypb17","doi-asserted-by":"publisher","DOI":"10.1002\/spe.4380211204"},{"key":"atypb18","doi-asserted-by":"crossref","unstructured":"Chatterjee, S., Parker, E., Hanlon, P. J., and Lebeck, A. R. June 2001. Exact analysis of the cache behavior of nested loops . In Proceedings of the ACM SIGPLAN 2001 Conference on Programming Language Design and Implementation, Snowbird, UT, pp. 286\u2013297 .","DOI":"10.1145\/378795.378859"},{"key":"atypb19","doi-asserted-by":"crossref","unstructured":"Chen, Z., Dongarra, J., Luszczek, P., and Roche, K. January 2003. Self-adapting software for numerical linear algebra and LAPACK for clusters, Technical Report UT-CS-03-499, University of Tennessee .","DOI":"10.1007\/3-540-44863-2_65"},{"key":"atypb20","unstructured":"Chow, K. and Wu, Y. November 1999. Feedback-directed selection and characterization of compiler optimizations. In Second Workshop on Feedback-Directed Optimization, Haifa, Israel ."},{"key":"atypb21","unstructured":"Chow, Y. S., Robbins, H., and Siegmund, D. 1971. Great Expectations: The Theory of Optimal Stopping, Houghton-Mifflin, Boston, MA ."},{"key":"atypb22","doi-asserted-by":"publisher","DOI":"10.1023\/A:1015729001611"},{"key":"atypb23","unstructured":"Cooper, K. D., Harvey, T. J., Subramanian, D., and Torczon, L. January 2002b. Compilation order matters, Technical Report, Rice University, Houston, TX ."},{"key":"atypb24","unstructured":"Darcy, J. D. 2002. Finding a fast quicksort implementation for Java, http:\/\/www.sonic.net\/jddarcy\/Research\/cs339-quicksort.pdf."},{"key":"atypb25","doi-asserted-by":"crossref","unstructured":"Diniz, P. and Rinard, M. June 1997. Dynamic feedback: An effective technique for adaptive computing . In Proceedings of Programming Language Design and Implementation, Las Vegas, NV.","DOI":"10.1145\/258915.258923"},{"key":"atypb26","doi-asserted-by":"publisher","DOI":"10.1007\/s101070100263"},{"key":"atypb27","doi-asserted-by":"publisher","DOI":"10.1145\/77626.79170"},{"key":"atypb28","doi-asserted-by":"crossref","unstructured":"Dongarra, J. and Eijkhout, V. June 2003. Self-adapting numerical software and automatic tuning of heuristics . In Proceedings of the International Conference on Computational Science, Melbourne, Australia.","DOI":"10.1007\/3-540-44864-0_78"},{"key":"atypb29","doi-asserted-by":"crossref","unstructured":"Fraguela, B. B., Doallo, R., and Zapta, E. L. October 1999. Automatic analytic modeling for the estimation of cache misses . In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, Newport Beach, CA, pp. 221\u2013231 .","DOI":"10.1109\/PACT.1999.807544"},{"key":"atypb30","doi-asserted-by":"crossref","unstructured":"Frens, J. D. and Wise, D. S. July 1997. Auto-blocking matrixmultiplication or tracking BLAS3 performance from source code . In Proceedings of the 6th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Las Vegas, NV, pp. 206\u2013216 .","DOI":"10.1145\/263764.263789"},{"key":"atypb31","doi-asserted-by":"crossref","unstructured":"Frey, B. 1998. Graphical Models for Machine Learning and Digital Communications, MIT Press, Boston, MA .","DOI":"10.7551\/mitpress\/3348.001.0001"},{"key":"atypb32","unstructured":"Frigo, M. and Johnson, S. May 1998. FFTW: An adaptive software architecture for the FFT . In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Seattle, WA."},{"key":"atypb33","unstructured":"Frigo, M., Leiserson, C. E., Prokop, H., and Ramachandran, S. October 1999. Cache-oblivious algorithms . In Proceedings of the 40th Annual Symposium on Foundations of Computer Science, New York, NY."},{"key":"atypb34","doi-asserted-by":"crossref","unstructured":"Gatlin, K. S. and Carter, L. November 1999. Architecture-cognizant divide and conquer algorithms . In Proceedings of Supercomputing, Portland, OR.","DOI":"10.1145\/331532.331557"},{"key":"atypb35","doi-asserted-by":"crossref","unstructured":"Geist, A., Gropp, W., Huss-Lederman, S., Lumsdaine, A., Lusk, E., Saphir, W., Skjellum, T., and Snir, M. 1996. MPI-2: extending the message-passing interface . In Proceedings of the 2nd European Conference on Parallel Processing (Euro-Par\u201996), Lyon, France, Lecture Notes in Computer Science Vol. 1123\u20131124, Springer-Verlag, Berlin, pp. 128\u2013135 . http:\/\/www.mpi-forum.org.","DOI":"10.1007\/3-540-61626-8_16"},{"key":"atypb36","doi-asserted-by":"publisher","DOI":"10.1145\/325478.325479"},{"key":"atypb37","unstructured":"Goto, K. and van de Geijn, R. November 2002. On reducing TLB misses in matrix multiplication, Technical Report TR-2002-55, University of Texas at Austin ."},{"key":"atypb38","doi-asserted-by":"publisher","DOI":"10.1145\/872726.806987"},{"key":"atypb39","doi-asserted-by":"publisher","DOI":"10.1145\/143103.143146"},{"key":"atypb40","doi-asserted-by":"publisher","DOI":"10.1145\/504210.504213"},{"key":"atypb41","doi-asserted-by":"crossref","unstructured":"Hong, J. W. and Kung, H. T. May 1981. I\/O complexity: the red\u2013blue pebble game . In Proceedings of the 13th Annual ACM Symposium on Theory of Computing, Milwaukee, WI, pp. 326\u2013333 .","DOI":"10.1145\/800076.802486"},{"key":"atypb42","doi-asserted-by":"crossref","unstructured":"Houstis, E. N., Catlin, A. C., Rice, J. R., Verykios, V. S., Ramakrishnan, N., and Houstis, C. E. 2000. PYTHIAII: a knowledge\/database system for managing performance data and recommending scientific software . ACM Transactions on Mathematical Software 26(2): 277\u2013253 .","DOI":"10.1145\/353474.353475"},{"key":"atypb43","doi-asserted-by":"crossref","unstructured":"Huss-Lederman, S., Jacobson, E. M., Johnson, J. R., Tsao, A., and Turnbull, T. November 1996. Implementation of Strassen\u2019s algorithm for matrix multiplication . In Proceedings of Supercomputing, Pittsburgh, PA.","DOI":"10.1145\/369028.369096"},{"key":"atypb44","unstructured":"Im, E.J. and Yelick, K. March 1999. Optimizing sparse matrix vector multiplication on SMPs . In Proceedings of the 9th SIAM Conference on Parallel Processing for Scientific Computing, San Antonio, TX."},{"key":"atypb45","unstructured":"Jordan, M. I. 1995. Why the logistic function? Technical Report 9503, MIT, Cambridge, MA ."},{"key":"atypb46","doi-asserted-by":"crossref","unstructured":"Joshi, R., Nelson, G., and Randall, K. August 2001. Denali: A goal-directed superoptimizer, Technical Report 171, Compaq SRC.","DOI":"10.1145\/512529.512566"},{"key":"atypb47","doi-asserted-by":"publisher","DOI":"10.1145\/292395.292412"},{"key":"atypb48","doi-asserted-by":"publisher","DOI":"10.1145\/778559.778562"},{"key":"atypb49","unstructured":"Kisuki, T., Knijnenburg, P. M., O\u2019Boyle, M. F., and Wijshoff, H. 2000. Iterative compilation in program optimization . In Proceedings of the 8th International Workshop on Compilers for Parallel Computers, Aussois, France, pp. 35\u201344 ."},{"key":"atypb50","doi-asserted-by":"publisher","DOI":"10.1002\/spe.4380010203"},{"key":"atypb51","unstructured":"Ko, A. N. and Izaguirre, J. A. 2003. MDSimAid: automatic optimization of fast electrostatics algorithms for molecular simulations . In Proceedings of the International Conference on Computational Science, Melbourne, Australia, LNCS Vol. 2659, Springer-Verlag, Berlin."},{"key":"atypb52","unstructured":"Lagoudakis, M. G. and Littman, M. L. June 2000. Algorithm selection using reinforcement learning . In Proceedings of the 17th International Conference on Machine Learning, Stanford, CA, pp. 511\u2013518 ."},{"key":"atypb53","doi-asserted-by":"crossref","unstructured":"Lam, M. S., Rothberg, E. E., and Wolf, M. E. April 1991. The cache performance and optimizations of blocked algorithms . In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, Santa Clara, CA.","DOI":"10.1145\/106972.106981"},{"key":"atypb54","unstructured":"Leone, M. and Dybvig, R. K. September 1997. Dynamo: a staged compiler architecture for dynamic program optimization, Technical Report TR-490, Department of Computer Science, Indiana University ."},{"key":"atypb55","doi-asserted-by":"crossref","unstructured":"Liniker, P., Beckmann, O., and Kelly, P. H. J. August 2002. Delayed evaluation, self-optimising software components as a programming model. In Euro-Par, Paderborn, Germany .","DOI":"10.1007\/3-540-45706-2_92"},{"key":"atypb56","doi-asserted-by":"publisher","DOI":"10.1145\/128745.128747"},{"key":"atypb57","doi-asserted-by":"crossref","unstructured":"McCalpin, J. D. and Smotherman, M. March 1995. Automatic benchmark generation for cache optimization of matrix algorithms . In Proceedings of the 33rd Annual Southeast Conference, Clemson, SC, USA, R. Geist and S. Junkins, editors, ACM, New York, pp. 195\u2013204 .","DOI":"10.1145\/1122018.1122054"},{"key":"atypb58","doi-asserted-by":"publisher","DOI":"10.1145\/233561.233564"},{"key":"atypb59","doi-asserted-by":"crossref","unstructured":"Massalin, H. 1987. Superoptimizer\u2013a look at the smallest program . In Proceedings of the 2nd International Conference on Architectural Support for Programming Languages and Operating Systems, Palo Alto, CA, pp. 122\u2013126 .","DOI":"10.1145\/36204.36194"},{"key":"atypb60","doi-asserted-by":"crossref","unstructured":"Mirkovic, D., Mahasoom, R., and Johnsson, L. May 2000. An adaptive software library for fast Fourier transforms . In Proceedings of the International Conference on Supercomputing, Sante Fe, NM, pp. 215\u2013224 .","DOI":"10.1145\/335231.335252"},{"key":"atypb61","doi-asserted-by":"publisher","DOI":"10.1023\/A:1018782528453"},{"key":"atypb62","doi-asserted-by":"crossref","unstructured":"Mitchell, N., Carter, L., and Ferrante, J. May 2001. A modal model of memory . In Proceedings of the International Conference on Computational Science, San Francisco, CA, LNCS Vol. 2073, Springer-Verlag, Berlin, pp. 81\u201396 .","DOI":"10.1007\/3-540-45545-0_18"},{"key":"atypb63","unstructured":"Nisbet, A. June 1998. GAPS: Iterative feedback directed parallelization using genetic algorithms . In Proceedings of the Workshop on Profile and Feedback Directed Compilation, Paris, France."},{"key":"atypb64","doi-asserted-by":"publisher","DOI":"10.1007\/BF02613966"},{"key":"atypb65","unstructured":"Olsen, J. H. and Skov, S. C. 2002. Cache-oblivious algorithms in practice, Master\u2019s thesis, University of Copenhagen, Copenhagen, Denmark."},{"key":"atypb66","doi-asserted-by":"crossref","unstructured":"Parello, D., Temam, O., and Verdun, J.M. November 2002. On increasing architecture awareness in program optimizations to bridge the gap between peak and sustained processor performance\u2014matrix multiply revisited . In Proceedings of the IEEE\/ACM Conference on Supercomputing, Baltimore, MD.","DOI":"10.1109\/SC.2002.10054"},{"key":"atypb67","doi-asserted-by":"crossref","unstructured":"Petrank, E. and Rawitz, D. January 2002. The hardness of cache conscious data placement . In Proceedings of the 29th ACM SIGPLAN-SIGACT Symposium on the Principles of Programming Languages, Portland, OR, ACM, New York, pp. 101\u2013112 .","DOI":"10.1145\/503272.503283"},{"key":"atypb68","doi-asserted-by":"crossref","unstructured":"Pike, G. and Hilfinger, P. November 2002. Better tiling and array contraction for compiling scientific programs . In Proceedings of the IEEE\/ACM Conference on Supercomputing, Baltimore, MD.","DOI":"10.1109\/SC.2002.10040"},{"key":"atypb69","doi-asserted-by":"crossref","unstructured":"Platt, J. January 1999. Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel Methods \u2014 Support Vector Learning, B. Sch\u00f6lkopf, C. Burges, and A. Smola, editors, MIT Press, Cambridge, MA , pp. 185\u2013208.","DOI":"10.7551\/mitpress\/1130.003.0016"},{"key":"atypb70","doi-asserted-by":"crossref","unstructured":"Pugh, W. and Shpeisman, T. August 1998. Generation of efficient code for sparse matrix computations . In Proceedings of the 11th Workshop on Languages and Compilers for Parallel Computing, Chapel Hill, NC, LNCS, Springer-Verlag, Berlin.","DOI":"10.1007\/3-540-48319-5_14"},{"key":"atypb71","doi-asserted-by":"crossref","unstructured":"P\u00fcschel, M., Singer, B., Veloso, M., and Moura, J. M. F. May 2001. Fast automatic generation of DSP algorithms . In Proceedings of the International Conference on Computational Science, San Francisco, CA, LNCS Vol. 2073, Springer-Verlag, Berlin, pp. 97\u2013106 .","DOI":"10.1007\/3-540-45545-0_19"},{"key":"atypb72","doi-asserted-by":"publisher","DOI":"10.1145\/365723.365734"},{"key":"atypb73","doi-asserted-by":"publisher","DOI":"10.1016\/S0065-2458(08)60520-3"},{"key":"atypb74","doi-asserted-by":"crossref","unstructured":"Santiago, N. G., Rover, D. T., and Rodriguez, D. August 2002. A statistical approach for the analysis of the relation between low-level performance information, the code, and the environment . In Proceedings of the ICPP 4th Workshop on High Performance Scientific and Engineering Computing with Applications, Vancouver, BC, Canada, pp. 282\u2013289 .","DOI":"10.1109\/ICPPW.2002.1039742"},{"key":"atypb75","doi-asserted-by":"crossref","unstructured":"Savage, J. E. 1995. Extending the Hong-Kung model to memory hierarchies. In Computing and Combinatorics, D.Z. Du and M. Li, editors, LNCS Vol. 959, Springer-Verlag, Berlin , pp. 270\u2013281.","DOI":"10.1007\/BFb0030842"},{"key":"atypb76","unstructured":"Schwartz, D. A., Judd, R. R., Harrod, W. J., and Manley, D. P. March 2000. VSIPL 1.0 API. http:\/\/www.vsipl.org."},{"key":"atypb77","doi-asserted-by":"crossref","unstructured":"Siek, J. G. and Lumsdaine, A. 1998. A rational approach to portable high performance: the Basic Linear Algebra Instruction Set (BLAIS) and the Fixed Algorithm Size Template (FAST) library . In Proceedings of ECOOP, Brussels, Belgium.","DOI":"10.1007\/3-540-49255-0_153"},{"key":"atypb78","doi-asserted-by":"crossref","unstructured":"Smith, M. D. January 2000. Overcoming the challenges to feedback-directed optimization . In Proceedings of the ACM SIGPLAN Workshop on Dynamic and Adaptive Compilation and Optimization (Dynamo), Boston, MA.","DOI":"10.1145\/351397.351408"},{"key":"atypb79","unstructured":"Smola, A. J. and Sch\u00f6lkopf, B. 1998. A tutorial on support vector regression, Technical Report NC2-TR-1998-030, European Community ESPRIT Working Group in Neural and Computational Learning Theory. http:\/\/www.neurocolt.com."},{"key":"atypb80","doi-asserted-by":"crossref","unstructured":"Stephenson, M., Amarasinghe, S., Martin, M., and O\u2019Reilly, U. M. June 2003. Meta optimization: improving compiler heuristics with machine learning . In Proceedings of the ACM Conference on Programming Language Design and Implementation, San Diego, CA.","DOI":"10.1145\/781139.781141"},{"key":"atypb81","unstructured":"Stodghill, P. August 1997. A relational approach to the automatic generation of sequential sparse matrix codes, Ph.D. Thesis, Cornell University."},{"key":"atypb82","doi-asserted-by":"crossref","unstructured":"Thottethodi, M., Chatterjee, S., and Lebeck, A. R. November 1998. Tuning Strassen\u2019s matrix multiplication for memory efficiency . In Proceedings of Supercomputing \u201998, Orlando, FL.","DOI":"10.1109\/SC.1998.10045"},{"key":"atypb83","doi-asserted-by":"publisher","DOI":"10.1137\/S0895479896297744"},{"key":"atypb84","unstructured":"Triantafyllis, S., Vachharajani, M., Vachharajani, N., and August, D. I. March 2003. Compiler optimization-space exploration. In Proceedings of the International Symposium on Code Generation and Optimization , San Francisco, CA, pp. 204\u2013215 ."},{"key":"atypb85","doi-asserted-by":"crossref","unstructured":"Tapus, C., Chung, I.H., and Hollingsworth, J. K. November 2002. Active Harmony: towards automated performance tuning . In Proceedings of the IEEE\/ACM Conference on Supercomputing, Baltimore, MD.","DOI":"10.1109\/SC.2002.10062"},{"key":"atypb86","doi-asserted-by":"crossref","unstructured":"Vadhiyar, S. S., Fagg, G. E., and Dongarra, J. November 2000. Automatically tuned collective operations . In Proceedings of Supercomputing 2000, Dallas, TX.","DOI":"10.1109\/SC.2000.10024"},{"key":"atypb87","unstructured":"van der Mark, P., Rohou, E., Bodin, F., Chamski, Z., and Eisenbeis, C. September 1999. Using iterative compilation for managing software pipeline-unrolling trade-offs . In Proceedings of the 4th International Workshop on Compilers for Embedded Systems, St. Goar, Germany."},{"key":"atypb88","unstructured":"Vapnik, V. N. 1998. Statistical Learning Theory, Wiley, New York ."},{"key":"atypb89","doi-asserted-by":"crossref","unstructured":"Veldhuizen, T. 1998. Arrays in Blitz++ . In Proceedings of ISCOPE, LNCS Vol. 1505, Springer-Verlag, Berlin.","DOI":"10.1007\/3-540-49372-7_24"},{"key":"atypb90","unstructured":"Veldhuizen, T. L. and Gannon, D. 1998. Active libraries: rethinking the roles of compilers and libraries . In Proceedings of the SIAM Workshop on Object Oriented Methodsfor Interoperable Scientific and Engineering Computing, Philadelphia, PA."},{"key":"atypb91","unstructured":"Voss, M. J. and Eigenmann, R. August 2000. ADAPT: automated de-coupled adaptive program transformation . In Proceedings of the International Conference on Parallel Processing, Toronto, Canada."},{"key":"atypb92","doi-asserted-by":"crossref","unstructured":"Vuduc, R., Demmel, J. W., Yelick, K. A., Kamil, S., Nishtala, R., and Lee, B. November 2002. Performance optimizations and bounds for sparse matrix-vector multiply . In Proceedings of Supercomputing, Baltimore, MD.","DOI":"10.1109\/SC.2002.10025"},{"key":"atypb93","doi-asserted-by":"publisher","DOI":"10.1016\/S0167-8191(00)00087-9"},{"key":"atypb94","doi-asserted-by":"crossref","unstructured":"Wise, D. S., Frens, J. D., Gu, Y., and Alexander, G. A. 2001. Language support for Morton-order matrices . In Proceedings of the 8th ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming, Snowbird, UT, ACM, New York, pp. 24\u201333 .","DOI":"10.1145\/379539.379559"},{"key":"atypb95","doi-asserted-by":"crossref","unstructured":"Wolf, M. E. and Lam, M. S. June 1991. A data locality optimizing algorithm . In Proceedings of the ACM SIGPLAN \u201991 Conference on Programming Language Design and Implementation, Toronto, Ontario, Canada.","DOI":"10.1145\/113445.113449"},{"key":"atypb96","doi-asserted-by":"crossref","unstructured":"Yi, Q., Adve, V., and Kennedy, K. June 2000. Transforming loops to recursion for multi-level memory hierarchies . In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, Vancouver, BC, Canada, pp. 169\u2013181 .","DOI":"10.1145\/349299.349323"},{"key":"atypb97","doi-asserted-by":"crossref","unstructured":"Yotov, K., Li, X., Ren, G., Cibulskis, M., DeJong, G., Garzaran, M., Padua, D., Pingali, K., Stodghill, P., and Wu, P. June 2003. A comparison of empirical and model-driven optimization . In Proceedings of the ACM Conference on Programming Language Design and Implementation, San Diego, CA.","DOI":"10.1145\/781139.781140"}],"container-title":["The International Journal of High Performance Computing Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1094342004041293","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/1094342004041293","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,2]],"date-time":"2025-03-02T21:42:32Z","timestamp":1740951752000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/1094342004041293"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2004,2]]},"references-count":97,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2004,2]]}},"alternative-id":["10.1177\/1094342004041293"],"URL":"https:\/\/doi.org\/10.1177\/1094342004041293","relation":{},"ISSN":["1094-3420","1741-2846"],"issn-type":[{"value":"1094-3420","type":"print"},{"value":"1741-2846","type":"electronic"}],"subject":[],"published":{"date-parts":[[2004,2]]}}}