{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,8]],"date-time":"2026-01-08T00:23:00Z","timestamp":1767831780543,"version":"3.49.0"},"reference-count":64,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2019,3,8]],"date-time":"2019-03-08T00:00:00Z","timestamp":1552003200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2019,3,31]]},"abstract":"<jats:p>\n            Conventional on-chip TLB hierarchies are unable to fully cover the growing application working-set sizes. To make things worse, Last-Level TLB (LLT) misses require multiple accesses to the page table even with the use of page walk caches. Consequently, LLT misses incur long address translation latency and hurt performance. This article proposes two low-overhead hardware mechanisms for reducing the frequency and penalty of on-die LLT misses. The first,\n            <jats:italic>Unified CAche and TLB (UCAT)<\/jats:italic>\n            , enables the conventional on-die Last-Level Cache to store cache lines and TLB entries in a single unified structure and increases on-die TLB capacity significantly. The second,\n            <jats:italic>DRAM-TLB<\/jats:italic>\n            , memoizes virtual to physical address translations in DRAM and reduces LLT miss penalty when UCAT is unable to fully cover total application working-set. DRAM-TLB serves as the next larger level in the TLB hierarchy that significantly increases TLB coverage relative to on-chip TLBs. The combination of these two mechanisms,\n            <jats:italic>DUCATI<\/jats:italic>\n            , is an address translation architecture that improves GPU performance by 81%; (up to 4.5\u00d7) while requiring minimal changes to the existing system design. We show that DUCATI is within 20%, 5%, and 2% the performance of a perfect LLT system when using 4KB, 64KB, and 2MB pages, respectively.\n          <\/jats:p>","DOI":"10.1145\/3309710","type":"journal-article","created":{"date-parts":[[2019,3,8]],"date-time":"2019-03-08T13:16:43Z","timestamp":1552051003000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":25,"title":["DUCATI"],"prefix":"10.1145","volume":"16","author":[{"given":"Aamer","family":"Jaleel","sequence":"first","affiliation":[{"name":"NVIDIA"}]},{"given":"Eiman","family":"Ebrahimi","sequence":"additional","affiliation":[{"name":"NVIDIA"}]},{"given":"Sam","family":"Duncan","sequence":"additional","affiliation":[{"name":"NVIDIA"}]}],"member":"320","published-online":{"date-parts":[[2019,3,8]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/2694344.2694381"},{"key":"e_1_2_1_2_1","unstructured":"ATS. 2009. PCI Express Address Translation Service. Retrieved from http:\/\/composter.com.ua\/documents\/ats_r1.1_26Jan09.pdf.  ATS. 2009. PCI Express Address Translation Service. Retrieved from http:\/\/composter.com.ua\/documents\/ats_r1.1_26Jan09.pdf."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/1815961.1815970"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/2024723.2000101"},{"key":"e_1_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1145\/2485922.2485943"},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/1346281.1346286"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/2540708.2540741"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.5555\/2014698.2014896"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/1736020.1736060"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/74850.74854"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/139669.139708"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/2749469.2750387"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2014.63"},{"key":"e_1_2_1_14_1","volume-title":"BATMAN: Maximizing bandwidth utilization for hybrid memory systems. Technical Report for Computer ARchitecture and Emerging Technologies (CARET) Lab, TR-CARET-2015-01.","author":"Chou Chiachen","year":"2015"},{"key":"e_1_2_1_15_1","unstructured":"CORAL. 2014. CORAL Procurement Benchmarks. Retrieved from https:\/\/asc.llnl.gov\/CORAL-benchmarks\/.  CORAL. 2014. CORAL Procurement Benchmarks. Retrieved from https:\/\/asc.llnl.gov\/CORAL-benchmarks\/."},{"key":"e_1_2_1_16_1","unstructured":"Jonathan Corbet. 2017. Five-level page tables. Retrieved from https:\/\/lwn.net\/Articles\/717293\/.  Jonathan Corbet. 2017. Five-level page tables. Retrieved from https:\/\/lwn.net\/Articles\/717293\/."},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/2451116.2451157"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCSim.2012.6266938"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2013.49"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.5555\/2643634.2643659"},{"key":"e_1_2_1_21_1","unstructured":"NVIDIA GP100. 2016. P100 GPU Accelerator.  NVIDIA GP100. 2016. P100 GPU Accelerator."},{"key":"e_1_2_1_23_1","unstructured":"HMC\n     Specification 1.0.\n   Retrieved from http:\/\/www.hybridmemorycube.org 2013\n  .  HMC Specification 1.0. Retrieved from http:\/\/www.hybridmemorycube.org 2013."},{"key":"e_1_2_1_24_1","volume-title":"HSA Platform System Architecture Specification","author":"HSA"},{"key":"e_1_2_1_25_1","unstructured":"Intel. 2009. Intel 64 and IA-32 Architectures Optimization Reference Manual.  Intel. 2009. Intel 64 and IA-32 Architectures Optimization Reference Manual."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/2.683005"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/1815961.1815971"},{"key":"e_1_2_1_28_1","unstructured":"JEDEC. 2013a. DDR4 SPEC (JESD79-4). JEDEC.  JEDEC. 2013a. DDR4 SPEC (JESD79-4). JEDEC."},{"key":"e_1_2_1_29_1","unstructured":"JEDEC. 2013b. High Bandwidth Memory (HBM) DRAM (JESD235).  JEDEC. 2013b. High Bandwidth Memory (HBM) DRAM (JESD235)."},{"key":"e_1_2_1_30_1","doi-asserted-by":"crossref","unstructured":"James Jeffers James Reinders and Avinash Sodani. 2016. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Morgan Kaufmann.   James Jeffers James Reinders and Avinash Sodani. 2016. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Morgan Kaufmann.","DOI":"10.1016\/B978-0-12-809194-4.00002-8"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2014.51"},{"key":"e_1_2_1_32_1","volume-title":"Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA\u201913)","author":"Jiang Zhipeng","year":"2013"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/2540708.2540733"},{"key":"e_1_2_1_34_1","volume-title":"The compute architecture of intel processor graphics gen9. Intel Whitepaper v1","author":"Junkins Stephen","year":"2015"},{"key":"e_1_2_1_35_1","volume-title":"Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA\u201902)","author":"Gokul"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/2155620.2155624"},{"key":"e_1_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/1854273.1854333"},{"key":"e_1_2_1_38_1","volume-title":"Lonestar: A Suite of Parallel Irregular Programs?","author":"Kulkarni Milind A.","year":"2009"},{"key":"e_1_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2012.6168947"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1145\/2155620.2155673"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/1188455.1188677"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/HOTCHIPS.2015.7477461"},{"key":"e_1_2_1_43_1","unstructured":"Khalid Moammer. 2016. AMD Zen Raven Ridge APU Features HBM 128GB\/s of Bandwidth and Large GPU.  Khalid Moammer. 2016. AMD Zen Raven Ridge APU Features HBM 128GB\/s of Bandwidth and Large GPU."},{"key":"e_1_2_1_45_1","volume-title":"Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA\u201914)","author":"Pham Binh"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2012.32"},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/2830772.2830773"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/2541940.2541942"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA.2014.6835965"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1145\/1250662.1250709"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2012.30"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/12.2242"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080210"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.1145\/339647.339666"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2012.31"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1145\/2485922.2485958"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2016.25"},{"key":"e_1_2_1_58_1","doi-asserted-by":"publisher","DOI":"10.1145\/195473.195531"},{"key":"e_1_2_1_59_1","doi-asserted-by":"publisher","DOI":"10.1145\/139669.140406"},{"key":"e_1_2_1_60_1","doi-asserted-by":"publisher","DOI":"10.1145\/237090.237205"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS.2016.7482091"},{"key":"e_1_2_1_62_1","doi-asserted-by":"publisher","DOI":"10.1145\/17356.17398"},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.1145\/2155620.2155671"},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00036"},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","DOI":"10.1109\/MICRO.2018.00035"},{"key":"e_1_2_1_66_1","volume-title":"Keckler","author":"Zheng Tianhao","year":"2016"}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3309710","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3309710","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T23:53:35Z","timestamp":1750204415000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3309710"}},"subtitle":["High-performance Address Translation by Extending TLB Reach of GPU-accelerated Systems"],"short-title":[],"issued":{"date-parts":[[2019,3,8]]},"references-count":64,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2019,3,31]]}},"alternative-id":["10.1145\/3309710"],"URL":"https:\/\/doi.org\/10.1145\/3309710","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"value":"1544-3566","type":"print"},{"value":"1544-3973","type":"electronic"}],"subject":[],"published":{"date-parts":[[2019,3,8]]},"assertion":[{"value":"2018-02-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-01-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2019-03-08","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}