{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T05:00:15Z","timestamp":1750309215681,"version":"3.41.0"},"reference-count":34,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2014,2,1]],"date-time":"2014-02-01T00:00:00Z","timestamp":1391212800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2014,2]]},"abstract":"<jats:p>Continual Flow Pipelines (CFPs) allow a processor core to process hundreds of in-flight instructions without increasing cycle-critical pipeline resources. When a load misses the data cache, CFP checkpoints the processor register state and then moves all miss-dependent instructions into a low-complexity WB to unblock the pipeline. Meanwhile, miss-independent instructions execute normally and update the processor state. When the miss data return, CFP replays the miss-dependent instructions from the WB and then merges the miss-dependent and miss-independent execution results.<\/jats:p>\n          <jats:p>CFP was initially proposed for cache misses to DRAM. Later work focused on reducing the execution overhead of CFP by avoiding the pipeline flush before replaying miss-dependent instructions and executing dependent and independent instructions concurrently. The goal of these improvements was to gain performance by applying CFP to L1 data cache misses that hit the last level on chip cache. However, many applications or execution phases of applications incur excessive amount of replay and\/or rollbacks to the checkpoint. This frequently cancels benefits from CFP and reduces performance.<\/jats:p>\n          <jats:p>In this article, we improve the CFP architecture by using a novel virtual register renaming substrate and by tuning the replay policies to mitigate excessive replays and rollbacks to the checkpoint. We describe these new design optimizations and show, using Spec 2006 benchmarks and microarchitecture performance and power models of our design, that our Tuned-CFP architecture improves performance and energy consumption over previous CFP architectures by \u223c10% and \u223c8%, respectively. We also demonstrate that our proposed architecture gives better performance return on energy per instruction compared to a conventional superscalar as well as previous CFP architectures.<\/jats:p>","DOI":"10.1145\/2579675","type":"journal-article","created":{"date-parts":[[2014,3,18]],"date-time":"2014-03-18T12:09:07Z","timestamp":1395144547000},"page":"1-27","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Tuning the continual flow pipeline architecture with virtual register renaming"],"prefix":"10.1145","volume":"11","author":[{"given":"Komal","family":"Jothi","sequence":"first","affiliation":[{"name":"American University of Beirut, Lebanon"}]},{"given":"Haitham","family":"Akkary","sequence":"additional","affiliation":[{"name":"American University of Beirut, Lebanon"}]}],"member":"320","published-online":{"date-parts":[[2014,2]]},"reference":[{"doi-asserted-by":"publisher","key":"e_1_2_1_1_1","DOI":"10.1145\/339647.339691"},{"volume-title":"Proceedings of the 36th Annual ACM\/IEEE Symposium on Microarchitecture. ACM\/IEEE, 423--434","author":"Akkary Haitham","unstructured":"Haitham Akkary , Ravi Rajwar , and Srikanth T. Srinivasan . 2003. Checkpoint processing and recovery: Towards scalable large instruction window processors . In Proceedings of the 36th Annual ACM\/IEEE Symposium on Microarchitecture. ACM\/IEEE, 423--434 . Haitham Akkary, Ravi Rajwar, and Srikanth T. Srinivasan. 2003. Checkpoint processing and recovery: Towards scalable large instruction window processors. In Proceedings of the 36th Annual ACM\/IEEE Symposium on Microarchitecture. ACM\/IEEE, 423--434.","key":"e_1_2_1_2_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_3_1","DOI":"10.1145\/1044823.1044826"},{"volume-title":"Proceedings of the 36th Annual ACM\/IEEE Symposium on Microarchitecture. IEEE, 387--398","author":"Barnes Ronald D.","unstructured":"Ronald D. Barnes , Erik M. Nystrom , John W. Sios , Sanjay J. Patel , Nacho Navarro , and Wen-mei W. Hwu . 2003. Beating in-order stalls with \u201cFlea-flicker\u201d two-pass pipelining . In Proceedings of the 36th Annual ACM\/IEEE Symposium on Microarchitecture. IEEE, 387--398 . Ronald D. Barnes, Erik M. Nystrom, John W. Sios, Sanjay J. Patel, Nacho Navarro, and Wen-mei W. Hwu. 2003. Beating in-order stalls with \u201cFlea-flicker\u201d two-pass pipelining. In Proceedings of the 36th Annual ACM\/IEEE Symposium on Microarchitecture. IEEE, 387--398.","key":"e_1_2_1_4_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_5_1","DOI":"10.1109\/MICRO.2005.1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_6_1","DOI":"10.1145\/339647.339657"},{"doi-asserted-by":"publisher","key":"e_1_2_1_7_1","DOI":"10.1145\/1555754.1555814"},{"doi-asserted-by":"publisher","key":"e_1_2_1_8_1","DOI":"10.1145\/279358.279378"},{"doi-asserted-by":"publisher","key":"e_1_2_1_9_1","DOI":"10.1109\/HPCA.2004.10008"},{"doi-asserted-by":"publisher","key":"e_1_2_1_10_1","DOI":"10.1145\/1044823.1044825"},{"doi-asserted-by":"publisher","key":"e_1_2_1_12_1","DOI":"10.1145\/263580.263597"},{"doi-asserted-by":"publisher","key":"e_1_2_1_13_1","DOI":"10.1109\/ISCA.2005.46"},{"doi-asserted-by":"publisher","key":"e_1_2_1_14_1","DOI":"10.5555\/822079.822728"},{"doi-asserted-by":"publisher","key":"e_1_2_1_15_1","DOI":"10.1109\/HPCA.2010.5416634"},{"doi-asserted-by":"publisher","key":"e_1_2_1_16_1","DOI":"10.1109\/HPCA.2009.4798281"},{"doi-asserted-by":"publisher","key":"e_1_2_1_17_1","DOI":"10.1145\/30350.30353"},{"doi-asserted-by":"publisher","key":"e_1_2_1_18_1","DOI":"10.1145\/2464996.2465011"},{"doi-asserted-by":"publisher","key":"e_1_2_1_19_1","DOI":"10.1145\/2483028.2483057"},{"doi-asserted-by":"publisher","key":"e_1_2_1_20_1","DOI":"10.1109\/ICCD.2011.6081387"},{"doi-asserted-by":"publisher","key":"e_1_2_1_21_1","DOI":"10.5555\/545215.545223"},{"doi-asserted-by":"publisher","key":"e_1_2_1_22_1","DOI":"10.1145\/279358.279377"},{"doi-asserted-by":"publisher","key":"e_1_2_1_23_1","DOI":"10.5555\/774861.774863"},{"doi-asserted-by":"publisher","key":"e_1_2_1_24_1","DOI":"10.1109\/ISCA.2005.49"},{"doi-asserted-by":"publisher","key":"e_1_2_1_25_1","DOI":"10.1109\/L-CA.2005.1"},{"volume-title":"Proceedings of the 9th International Symposium on High Performance Computer Architecture. IEEE, 129--140","author":"Mutlu Onur","unstructured":"Onur Mutlu , Jared Stark , Chris Wilkerson , and Yale N. Patt . 2003. Runahead execution: An alternative to very large instruction windows for out-of-order processors . In Proceedings of the 9th International Symposium on High Performance Computer Architecture. IEEE, 129--140 . Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N. Patt. 2003. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In Proceedings of the 9th International Symposium on High Performance Computer Architecture. IEEE, 129--140.","key":"e_1_2_1_26_1"},{"doi-asserted-by":"publisher","key":"e_1_2_1_27_1","DOI":"10.1109\/ICCD.2008.4751889"},{"doi-asserted-by":"publisher","key":"e_1_2_1_28_1","DOI":"10.1109\/40.491458"},{"doi-asserted-by":"publisher","key":"e_1_2_1_29_1","DOI":"10.1007\/978-3-642-36424-2_8"},{"doi-asserted-by":"publisher","key":"e_1_2_1_30_1","DOI":"10.1145\/2355585.2355592"},{"doi-asserted-by":"publisher","key":"e_1_2_1_31_1","DOI":"10.1109\/5.476078"},{"doi-asserted-by":"publisher","key":"e_1_2_1_32_1","DOI":"10.1145\/1024393.1024407"},{"doi-asserted-by":"publisher","key":"e_1_2_1_33_1","DOI":"10.1147\/rd.111.0025"},{"doi-asserted-by":"publisher","key":"e_1_2_1_34_1","DOI":"10.1145\/223982.224449"},{"volume-title":"Proceedings of the 9th Annual Workshop on Duplicating, Deconstructing and Debunking.","author":"Sonya","unstructured":"Sonya R. Wolff and Ronald D. Barnes. 2011. Reexamining instruction reuse in preexecution approaches . In Proceedings of the 9th Annual Workshop on Duplicating, Deconstructing and Debunking. Sonya R. Wolff and Ronald D. Barnes. 2011. Reexamining instruction reuse in preexecution approaches. In Proceedings of the 9th Annual Workshop on Duplicating, Deconstructing and Debunking.","key":"e_1_2_1_35_1"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2579675","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2579675","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T23:43:50Z","timestamp":1750290230000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2579675"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2014,2]]},"references-count":34,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2014,2]]}},"alternative-id":["10.1145\/2579675"],"URL":"https:\/\/doi.org\/10.1145\/2579675","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2014,2]]},"assertion":[{"value":"2013-06-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2013-11-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2014-02-01","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}