{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,5]],"date-time":"2026-03-05T15:47:15Z","timestamp":1772725635763,"version":"3.50.1"},"reference-count":72,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2024,5,21]],"date-time":"2024-05-21T00:00:00Z","timestamp":1716249600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"NSF","award":["2326141"],"award-info":[{"award-number":["2326141"]}]},{"DOI":"10.13039\/501100006374","name":"NSF","doi-asserted-by":"publisher","award":["2331536"],"award-info":[{"award-number":["2331536"]}],"id":[{"id":"10.13039\/501100006374","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. ACM Meas. Anal. Comput. Syst."],"published-print":{"date-parts":[[2024,5,21]]},"abstract":"<jats:p>Microarchitecture simulators are indispensable tools for microarchitecture designers to validate, estimate, optimize, and manufacture new hardware that meets specific design requirements. While the quest for a fast, accurate and detailed microarchitecture simulation has been ongoing for decades, existing simulators excel and fall short at different aspects: (i) Although execution-driven simulation is accurate and detailed, it is extremely slow and requires expert-level experience to design. (ii) Trace-driven simulation reuses the execution traces in pursuit of fast simulation but faces accuracy concerns and fails to achieve significant speedup. (iii) Emerging deep learning (DL)-based simulations are remarkably fast and have acceptable accuracy, but fail to provide adequate low-level microarchitectural performance metrics such as branch mispredictions or cache misses, which is crucial for microarchitectural bottleneck analysis. Additionally, they introduce substantial overheads from trace regeneration and model re-training when simulating a new microarchitecture.<\/jats:p>\n          <jats:p>Re-thinking the advantages and limitations of the aforementioned three mainstream simulation paradigms, this paper introduces TAO that redesigns the DL-based simulation with three primary contributions: First, we propose a new training dataset design such that the subsequent simulation (i.e., inference) only needs functional trace as inputs, which can be rapidly generated and reused across microarchitectures. Second, to increase the detail of the simulation, we redesign the input features and the DL model using self-attention to support predicting various performance metrics of interest. Third, we propose techniques to train a microarchitecture agnostic embedding layer that enables fast transfer learning between different microarchitectural configurations and effectively reduces the re-training overhead of conventional DL-based simulators. TAO can predict various performance metrics of interest, significantly reduce the simulation time, and maintain similar simulation accuracy as state-of-the-art DL-based endeavors. Our extensive evaluation shows TAO can reduce the overall training and simulation time by 18.06x over the state-of-the-art DL-based endeavors.<\/jats:p>","DOI":"10.1145\/3656012","type":"journal-article","created":{"date-parts":[[2024,5,29]],"date-time":"2024-05-29T10:40:32Z","timestamp":1716979232000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":4,"title":["TAO: Re-Thinking DL-based Microarchitecture Simulation"],"prefix":"10.1145","volume":"8","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3528-6868","authenticated-orcid":false,"given":"Santosh","family":"Pandey","sequence":"first","affiliation":[{"name":"Rutgers University, New Brunswick, NJ, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8199-7671","authenticated-orcid":false,"given":"Amir","family":"Yazdanbakhsh","sequence":"additional","affiliation":[{"name":"Google DeepMind, Mountain View, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6323-7388","authenticated-orcid":false,"given":"Hang","family":"Liu","sequence":"additional","affiliation":[{"name":"Rutgers University, New Brunswick, NJ, USA"}]}],"member":"320","published-online":{"date-parts":[[2024,5,29]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2013.6557148"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2019.2917698"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/773146.773043"},{"key":"e_1_2_1_4_1","volume-title":"CausalSim: A Causal Framework for Unbiased Trace-Driven Simulation. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23)","author":"Alomar Abdullah","year":"2023","unstructured":"Abdullah Alomar, Pouya Hamadanian, Arash Nasr-Esfahany, Anish Agarwal, Mohammad Alizadeh, and Devavrat Shah. 2023. CausalSim: A Causal Framework for Unbiased Trace-Driven Simulation. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 1115--1147."},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476221"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/2830772.2830780"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/2.982917"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3613424.3614289"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/SBAC-PAD.2014.30"},{"key":"e_1_2_1_10_1","volume-title":"FREENIX Track","volume":"41","author":"Bellard Fabrice","year":"2005","unstructured":"Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator.. In USENIX annual technical conference, FREENIX Track, Vol. 41. California, USA, 46."},{"key":"e_1_2_1_11_1","doi-asserted-by":"crossref","unstructured":"Nathan Binkert Bradford Beckmann Gabriel Black Steven K Reinhardt Ali Saidi Arkaprava Basu Joel Hestness Derek R Hower Tushar Krishna Somayeh Sardashti et al. 2011. The gem5 simulator. ACM SIGARCH computer architecture news Vol. 39 2 (2011) 1--7.","DOI":"10.1145\/2024716.2024718"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/3372393"},{"key":"e_1_2_1_13_1","first-page":"1","article-title":"A simple model for portable and fast prediction of execution time and power consumption of GPU kernels","volume":"18","author":"Braun Lorenz","year":"2020","unstructured":"Lorenz Braun, Sotirios Nikas, Chen Song, Vincent Heuveline, and Holger Fr\u00f6ning. 2020. A simple model for portable and fast prediction of execution time and power consumption of GPU kernels. ACM Transactions on Architecture and Code Optimization (TACO), Vol. 18, 1 (2020), 1--25.","journal-title":"ACM Transactions on Architecture and Code Optimization (TACO)"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/2151024.2151043"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/3185768.3185771"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/ASPDAC.2015.7059093"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/782814.782836"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/2063384.2063454"},{"key":"e_1_2_1_19_1","volume-title":"International conference on machine learning. PMLR, 794--803","author":"Chen Zhao","year":"2018","unstructured":"Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. 2018. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In International conference on machine learning. PMLR, 794--803."},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/183018.183032"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1145\/781131.781159"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2007.12"},{"key":"e_1_2_1_23_1","volume-title":"Improving cache management policies using dynamic reuse distances. In 2012 45Th annual IEEE\/ACM international symposium on microarchitecture","author":"Duong Nam","unstructured":"Nam Duong, Dali Zhao, Taesu Kim, Rosario Cammarota, Mateo Valero, and Alexander V Veidenbaum. 2012. Improving cache management policies using dynamic reuse distances. In 2012 45Th annual IEEE\/ACM international symposium on microarchitecture. IEEE, 389--400."},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPDS.2017.2713782"},{"key":"e_1_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/1534909.1534910"},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.5555\/956417.956543"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/166962.167001"},{"key":"e_1_2_1_28_1","volume-title":"Computer architecture: a quantitative approach","author":"Hennessy John L","unstructured":"John L Hennessy and David A Patterson. 2011. Computer architecture: a quantitative approach fifth ed.). Elsevier."},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2007.56"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/1168917.1168882"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2006.6"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2006.1598116"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/2745844.2745867"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/2333660.2333722"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA45697.2020.00047"},{"key":"e_1_2_1_36_1","volume-title":"Macsim: A cpu-gpu heterogeneous simulation framework user guide","author":"Kim Hyesoon","year":"2012","unstructured":"Hyesoon Kim, Jaekyu Lee, Nagesh B Lakshminarayana, Jaewoong Sim, Jieun Lim, and Tri Pho. 2012. Macsim: A cpu-gpu heterogeneous simulation framework user guide. Georgia Institute of Technology (2012)."},{"key":"e_1_2_1_37_1","first-page":"9","article-title":"MASE: a novel infrastructure for detailed microarchitectural modeling","volume":"1","author":"Larson Eric","year":"2001","unstructured":"Eric Larson, Saugata Chatterjee, and Todd M Austin. 2001. MASE: a novel infrastructure for detailed microarchitectural modeling.. In ISPASS, Vol. 1. 9.","journal-title":"ISPASS"},{"key":"e_1_2_1_38_1","volume-title":"Accurate and efficient regression modeling for microarchitectural performance and power prediction. ACM SIGOPS operating systems review","author":"Lee Benjamin C","year":"2006","unstructured":"Benjamin C Lee and David M Brooks. 2006. Accurate and efficient regression modeling for microarchitectural performance and power prediction. ACM SIGOPS operating systems review, Vol. 40, 5 (2006), 185--194."},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2007.346211"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2009.4919655"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2009.4919641"},{"key":"e_1_2_1_42_1","unstructured":"Lingda Li. [n. d.]. Lingda-li\/simnet. https:\/\/github.com\/lingda-li\/simnet"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3489048.3530958"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3613424.3614277"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/2611758"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2022.3180327"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33019977"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF02834632"},{"key":"e_1_2_1_49_1","volume-title":"International Conference on machine learning. PMLR, 4505--4515","author":"Mendis Charith","year":"2019","unstructured":"Charith Mendis, Alex Renda, Saman Amarasinghe, and Michael Carbin. 2019. Ithemal: Accurate, portable and fast basic block throughput estimation using deep neural networks. In International Conference on machine learning. PMLR, 4505--4515."},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2010.5416635"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2008.4510747"},{"key":"e_1_2_1_52_1","volume-title":"SESC: SuperESCalar simulator. In 17 th Euro micro conference on real time systems (ECRTS'05). Citeseer, 1--4.","author":"Ortego Pablo Montesinos","year":"2004","unstructured":"Pablo Montesinos Ortego and Paul Sack. 2004. SESC: SuperESCalar simulator. In 17 th Euro micro conference on real time systems (ECRTS'05). Citeseer, 1--4."},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2018.00032"},{"key":"e_1_2_1_54_1","doi-asserted-by":"crossref","unstructured":"S. Pandey L. Li T. Flynn A. Hoisie and H. Liu. 2022. Scalable Deep Learning-Based Microarchitecture Simulation on GPUs. In 2022 SC22: International Conference for High Performance Computing Networking Storage and Analysis (SC) (SC). IEEE Computer Society Los Alamitos CA USA 1138--1152. https:\/\/doi.ieeecomputersociety.org\/","DOI":"10.1109\/SC41404.2022.00084"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.5555\/2015039.2015523"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1145\/2485922.2485963"},{"key":"e_1_2_1_57_1","volume-title":"Multi-task learning as multi-objective optimization. Advances in neural information processing systems","author":"Sener Ozan","year":"2018","unstructured":"Ozan Sener and Vladlen Koltun. 2018. Multi-task learning as multi-objective optimization. Advances in neural information processing systems , Vol. 31 (2018)."},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/605432.605403"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1109\/MC.2003.1220579"},{"key":"e_1_2_1_60_1","volume-title":"21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)","author":"Somashekar Gagan","year":"2024","unstructured":"Gagan Somashekar, Karan Tandon, Anush Kini, M Das, Petr Husak, CC Chang, R Bhagwan, N Natarajan, and A Gandhi. 2024. Oppertune: Post-deployment configuration tuning of services made easy. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). USENIX Association."},{"key":"e_1_2_1_61_1","volume-title":"GRANITE: A Graph Neural Network Model for Basic Block Throughput Estimation. In 2022 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 14--26","author":"S\u1ef3kora Ondvr","year":"2022","unstructured":"Ondvr ej S\u1ef3kora, Phitchaya Mangpo Phothilimthana, Charith Mendis, and Amir Yazdanbakhsh. 2022. GRANITE: A Graph Neural Network Model for Basic Block Throughput Estimation. In 2022 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 14--26."},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1145\/2370816.2370865"},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1145\/254180.254184"},{"key":"e_1_2_1_64_1","first-page":"3537","article-title":"Analytical processor performance and power modeling using micro-architecture independent characteristics","volume":"65","author":"den Steen Sam Van","year":"2016","unstructured":"Sam Van den Steen, Stijn Eyerman, Sander De Pestel, Moncef Mechri, Trevor E Carlson, David Black-Schaffer, Erik Hagersten, and Lieven Eeckhout. 2016. Analytical processor performance and power modeling using micro-architecture independent characteristics. IEEE Trans. Comput. , Vol. 65, 12 (2016), 3537--3551.","journal-title":"IEEE Trans. Comput."},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2015.7056063"},{"key":"e_1_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.1145\/859618.859629"},{"key":"e_1_2_1_67_1","doi-asserted-by":"publisher","DOI":"10.1145\/337292.337436"},{"key":"e_1_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCAD51958.2021.9643501"},{"key":"e_1_2_1_69_1","first-page":"5824","article-title":"Gradient surgery for multi-task learning","volume":"33","author":"Yu Tianhe","year":"2020","unstructured":"Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. 2020. Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems , Vol. 33 (2020), 5824--5836.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_70_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01246-5_25"},{"key":"e_1_2_1_71_1","doi-asserted-by":"publisher","DOI":"10.1145\/2897937.2897977"},{"key":"e_1_2_1_72_1","doi-asserted-by":"publisher","DOI":"10.1145\/1552309.1552310"}],"container-title":["Proceedings of the ACM on Measurement and Analysis of Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3656012","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3656012","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,23]],"date-time":"2025-08-23T18:02:01Z","timestamp":1755972121000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3656012"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,5,21]]},"references-count":72,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2024,5,21]]}},"alternative-id":["10.1145\/3656012"],"URL":"https:\/\/doi.org\/10.1145\/3656012","relation":{},"ISSN":["2476-1249"],"issn-type":[{"value":"2476-1249","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,5,21]]},"assertion":[{"value":"2024-05-29","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}