{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,7]],"date-time":"2025-08-07T09:02:08Z","timestamp":1754557328412,"version":"3.41.0"},"reference-count":46,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2015,1,9]],"date-time":"2015-01-09T00:00:00Z","timestamp":1420761600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"EU FP7 Adept project number 610490"},{"name":"European Research Council under the European Community's Seventh Framework Programme (FP7\/2007-2013)\/ERC","award":["259295"],"award-info":[{"award-number":["259295"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2015,1,9]]},"abstract":"<jats:p>Superscalar in-order processors form an interesting alternative to out-of-order processors because of their energy efficiency and lower design complexity. However, despite the reduced design complexity, it is nontrivial to get performance estimates or insight in the application--microarchitecture interaction without running slow, detailed cycle-level simulations, because performance highly depends on the order of instructions within the application\u2019s dynamic instruction stream, as in-order processors stall on interinstruction dependences and functional unit contention. To limit the number of detailed cycle-level simulations needed during design space exploration, we propose a mechanistic analytical performance model that is built from understanding the internal mechanisms of the processor.<\/jats:p>\n          <jats:p>The mechanistic performance model for superscalar in-order processors is shown to be accurate with an average performance prediction error of 3.2% compared to detailed cycle-accurate simulation using gem5. We also validate the model against hardware, using the ARM Cortex-A8 processor and show that it is accurate within 10% on average. We further demonstrate the usefulness of the model through three case studies: (1) design space exploration, identifying the optimum number of functional units for achieving a given performance target; (2) program--machine interactions, providing insight into microarchitecture bottlenecks; and (3) compiler--architecture interactions, visualizing the impact of compiler optimizations on performance.<\/jats:p>","DOI":"10.1145\/2678277","type":"journal-article","created":{"date-parts":[[2015,1,12]],"date-time":"2015-01-12T20:02:10Z","timestamp":1421092930000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":16,"title":["Mechanistic Analytical Modeling of Superscalar In-Order Processor Performance"],"prefix":"10.1145","volume":"11","author":[{"given":"Maximilien B.","family":"Breughe","sequence":"first","affiliation":[{"name":"Ghent University, Belgium"}]},{"given":"Stijn","family":"Eyerman","sequence":"additional","affiliation":[{"name":"Ghent University, Belgium"}]},{"given":"Lieven","family":"Eeckhout","sequence":"additional","affiliation":[{"name":"Ghent University, Belgium"}]}],"member":"320","published-online":{"date-parts":[[2015,1,9]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/1629575.1629577"},{"volume-title":"Cortex-A8 Technical: Reference Manual (3p2 ed.). ARM Holdings","author":"Holdings ARM","key":"e_1_2_1_2_1","unstructured":"ARM Holdings . 2010. Cortex-A8 Technical: Reference Manual (3p2 ed.). ARM Holdings , Cambridge, NJ . ARM Holdings. 2010. Cortex-A8 Technical: Reference Manual (3p2 ed.). ARM Holdings, Cambridge, NJ."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/2024716.2024718"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2012.6189202"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/SASP.2011.5941070"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2008.4771779"},{"volume-title":"Proceedings of the IEEE 15th International Symposium on High-Performance Computer Architecture (HPCA). IEEE","author":"Chen X. E.","key":"e_1_2_1_7_1","unstructured":"X. E. Chen and T. M. Aamodt . 2009. A first-order fine-grained multithreaded throughput model . In Proceedings of the IEEE 15th International Symposium on High-Performance Computer Architecture (HPCA). IEEE , Los Alamitos, CA, 329--340. X. E. Chen and T. M. Aamodt. 2009. A first-order fine-grained multithreaded throughput model. In Proceedings of the IEEE 15th International Symposium on High-Performance Computer Architecture (HPCA). IEEE, Los Alamitos, CA, 329--340."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/1806596.1806647"},{"volume-title":"Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA). 76--87","author":"Chou Y.","key":"e_1_2_1_9_1","unstructured":"Y. Chou , B. Fahs , and S. Abraham . 2004. Microarchitecture optimizations for exploiting memory-level parallelism . In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA). 76--87 . Y. Chou, B. Fahs, and S. Abraham. 2004. Microarchitecture optimizations for exploiting memory-level parallelism. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA). 76--87."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/379240.565338"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2007.26"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/12.278481"},{"volume-title":"Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques (PACT). 25--34","author":"Eeckhout L.","key":"e_1_2_1_13_1","unstructured":"L. Eeckhout and K. De Bosschere . 2001. Hybrid analytical-statistical modeling for efficiently exploring architecture and workload design spaces . In Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques (PACT). 25--34 . L. Eeckhout and K. De Bosschere. 2001. Hybrid analytical-statistical modeling for efficiently exploring architecture and workload design spaces. In Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques (PACT). 25--34."},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2003.1240210"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/1944862.1944885"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/1736020.1736033"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/1534909.1534910"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.5555\/2015039.2015543"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.5555\/1128020.1128563"},{"volume-title":"Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS). 13--22","author":"Gutierrez A.","key":"e_1_2_1_20_1","unstructured":"A. Gutierrez , J. Pusdesris , R. G. Dreslinski , T. Mudge , C. Sudanthi , C. D. Emmons , M. Hayenga , and N. Paver . 2014. Sources of error in full-system simulation . In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS). 13--22 . A. Gutierrez, J. Pusdesris, R. G. Dreslinski, T. Mudge, C. Sudanthi, C. D. Emmons, M. Hayenga, and N. Paver. 2014. Sources of error in full-system simulation. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS). 13--22."},{"key":"e_1_2_1_21_1","first-page":"1","article-title":"SimPoint 3.0: Faster and more flexible program analysis","volume":"7","author":"Hamerly G.","year":"2005","unstructured":"G. Hamerly , E. Perelman , J. Lau , and B. Calder . 2005 . SimPoint 3.0: Faster and more flexible program analysis . Journal of Instruction-Level Parallelism 7 , 1 -- 28 . G. Hamerly, E. Perelman, J. Lau, and B. Calder. 2005. SimPoint 3.0: Faster and more flexible program analysis. Journal of Instruction-Level Parallelism 7, 1--28.","journal-title":"Journal of Instruction-Level Parallelism"},{"volume-title":"Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA). 7--13","author":"Hartstein A.","key":"e_1_2_1_22_1","unstructured":"A. Hartstein and T. R. Puzak . 2002. The optimal pipeline depth for a microprocessor . In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA). 7--13 . A. Hartstein and T. R. Puzak. 2002. The optimal pipeline depth for a microprocessor. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA). 7--13."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/12.40842"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/1168857.1168882"},{"volume-title":"Proceedings of the 12th International Symposium on High-Performance Computer Architecture (HPCA). 99--108","author":"Joseph P. J.","key":"e_1_2_1_25_1","unstructured":"P. J. Joseph , K. Vaswani , and M. J. Thazhuthaveetil . 2006a. Construction and use of linear regression models for processor performance analysis . In Proceedings of the 12th International Symposium on High-Performance Computer Architecture (HPCA). 99--108 . P. J. Joseph, K. Vaswani, and M. J. Thazhuthaveetil. 2006a. Construction and use of linear regression models for processor performance analysis. In Proceedings of the 12th International Symposium on High-Performance Computer Architecture (HPCA). 99--108."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2006.6"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1016\/0166-5316(94)90041-8"},{"volume-title":"Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA). 338--349","author":"Karkhanis T.","key":"e_1_2_1_28_1","unstructured":"T. Karkhanis and J. E. Smith . 2004. A first-order superscalar processor model . In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA). 338--349 . T. Karkhanis and J. E. Smith. 2004. A first-order superscalar processor model. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA). 338--349."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2005.35"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/1168857.1168881"},{"key":"e_1_2_1_31_1","volume-title":"Proceedings of the ISCA 25th International Conference on Computers and Their Applications (CATA). 294--299","author":"Lee J.","year":"2010","unstructured":"J. Lee . 2010 . A superscalar processor model for limited functional units using instruction dependencies . In Proceedings of the ISCA 25th International Conference on Computers and Their Applications (CATA). 294--299 . J. Lee. 2010. A superscalar processor model for limited functional units using instruction dependencies. In Proceedings of the ISCA 25th International Conference on Computers and Their Applications (CATA). 294--299."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2008.37"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/2514641.2514647"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1147\/sj.92.0078"},{"volume-title":"Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques (PACT). 2--10","author":"Michaud P.","key":"e_1_2_1_35_1","unstructured":"P. Michaud , A. Seznec , and S. Jourdan . 1999. Exploring instruction-fetch bandwidth requirement in wide-issue superscalar processors . In Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques (PACT). 2--10 . P. Michaud, A. Seznec, and S. Jourdan. 1999. Exploring instruction-fetch bandwidth requirement in wide-issue superscalar processors. In Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques (PACT). 2--10."},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/192724.192730"},{"volume-title":"Proceedings of the 3rd International Symposium on High-Performance Computer Architecture (HPCA). 298--309","author":"Noonburg D. B.","key":"e_1_2_1_37_1","unstructured":"D. B. Noonburg and J. P. Shen . 1997. A framework for statistical modeling of superscalar processor performance . In Proceedings of the 3rd International Symposium on High-Performance Computer Architecture (HPCA). 298--309 . D. B. Noonburg and J. P. Shen. 1997. A framework for statistical modeling of superscalar processor performance. In Proceedings of the 3rd International Symposium on High-Performance Computer Architecture (HPCA). 298--309."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/339647.339656"},{"volume-title":"Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 116--125","author":"Ould-Ahmed-Vall E.","key":"e_1_2_1_39_1","unstructured":"E. Ould-Ahmed-Vall , J. Woodlee , C. Yount , K. A. Doshi , and S. Abraham . 2007. Using model trees for computer architecture performance analysis of software applications . In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 116--125 . E. Ould-Ahmed-Vall, J. Woodlee, C. Yount, K. A. Doshi, and S. Abraham. 2007. Using model trees for computer architecture performance analysis of software applications. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 116--125."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/1815961.1816002"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2007.70817"},{"volume-title":"AM335x Sitara Processors: Technical Reference Manual. Texas Instruments Incorporated","key":"e_1_2_1_42_1","unstructured":"Texas Instruments Incorporated. 2014. AM335x Sitara Processors: Technical Reference Manual. Texas Instruments Incorporated , Dallas, TX . Texas Instruments Incorporated. 2014. AM335x Sitara Processors: Technical Reference Manual. Texas Instruments Incorporated, Dallas, TX."},{"volume-title":"Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS). 45--56","author":"Van Biesbrouck M.","key":"e_1_2_1_43_1","unstructured":"M. Van Biesbrouck , T. Sherwood , and B. Calder . 2004. A co-phase matrix to guide simultaneous multithreading simulation . In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS). 45--56 . M. Van Biesbrouck, T. Sherwood, and B. Calder. 2004. A co-phase matrix to guide simultaneous multithreading simulation. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS). 45--56."},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/IISWC.2011.6114194"},{"volume-title":"Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA). 213--224","author":"Van Craeynest K.","key":"e_1_2_1_45_1","unstructured":"K. Van Craeynest , A. Jaleel , L. Eeckhout , P. Narvaez , and J. Emer . 2012. Scheduling heterogeneous multi-cores through performance impact estimation (pie) . In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA). 213--224 . K. Van Craeynest, A. Jaleel, L. Eeckhout, P. Narvaez, and J. Emer. 2012. Scheduling heterogeneous multi-cores through performance impact estimation (pie). In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA). 213--224."},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/1120725.1120764"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2678277","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2678277","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T06:12:53Z","timestamp":1750227173000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2678277"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2015,1,9]]},"references-count":46,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2015,1,9]]}},"alternative-id":["10.1145\/2678277"],"URL":"https:\/\/doi.org\/10.1145\/2678277","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2015,1,9]]},"assertion":[{"value":"2014-04-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2014-10-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2015-01-09","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}