{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T13:40:11Z","timestamp":1755870011228,"version":"3.44.0"},"publisher-location":"New York, NY, USA","reference-count":68,"publisher":"ACM","funder":[{"name":"Self-funded"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,6,8]]},"DOI":"10.1145\/3721145.3725748","type":"proceedings-article","created":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T12:57:17Z","timestamp":1755867437000},"page":"1190-1205","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["DREAM: Device-Driven Efficient Access to Virtual Memory"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0005-6229-1493","authenticated-orcid":false,"given":"Nurlan","family":"Nazaraliyev","sequence":"first","affiliation":[{"name":"University of California, Riverside, Riverside, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5834-4346","authenticated-orcid":false,"given":"Elaheh","family":"Sadredini","sequence":"additional","affiliation":[{"name":"UC Riverside, Riverside, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9485-5370","authenticated-orcid":false,"given":"Nael","family":"Abu-Ghazaleh","sequence":"additional","affiliation":[{"name":"Computer Science and Engineering, University of California, Riverside, Riverside, USA"}]}],"member":"320","published-online":{"date-parts":[[2025,8,22]]},"reference":[{"key":"e_1_3_3_2_2_2","unstructured":"NVIDIA 2020. 2020. NVIDIA Tesla A100 Tensor Core GPU Architecture. https:\/\/images.nvidia.com\/aem-dam\/en-zz\/Solutions\/data-center\/nvidia-ampere-architecture-whitepaper.pdf."},{"key":"e_1_3_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1145\/3575693.3575713"},{"key":"e_1_3_3_2_4_2","doi-asserted-by":"crossref","unstructured":"Tyler Allen Bennett Cooper and Rong Ge. 2024. Fine-grain Quantitative Analysis of Demand Paging in Unified Virtual Memory. ACM Transactions on Architecture and Code Optimization 21 1 (2024) 1\u201324.","DOI":"10.1145\/3632953"},{"key":"e_1_3_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS49936.2021.00023"},{"key":"e_1_3_3_2_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3480855"},{"key":"e_1_3_3_2_7_2","doi-asserted-by":"crossref","unstructured":"Guillaume Ambal Brijesh Dongol Haggai Eran Vasileios Klimis Ori Lahav and Azalea Raad. 2024. Semantics of Remote Direct Memory Access: Operational and Declarative Models of RDMA on TSO Architectures. Proceedings of the ACM on Programming Languages 8 OOPSLA2 (2024) 1982\u20132009.","DOI":"10.1145\/3689781"},{"key":"e_1_3_3_2_8_2","doi-asserted-by":"crossref","unstructured":"Chen Zheng and Zhang Feng and Guan JiaWei and Zhai Jidong and Shen Xipeng and Zhang Huanchen and Shu Wentong and Du Xiaoyong. 2023. Compressgraph: Efficient parallel graph analytics with rule-based compression. Proceedings of the ACM on Management of Data 1 1 (2023) 1\u201331.","DOI":"10.1145\/3588684"},{"key":"e_1_3_3_2_9_2","first-page":"625","volume-title":"2022 USENIX Annual Technical Conference (USENIX ATC 22)","author":"Choi Sangjin","year":"2022","unstructured":"Sangjin Choi, Taeksoo Kim, Jinwoo Jeong, Rachata Ausavarungnirun, Myeongjae Jeon, Youngjin Kwon, and Jeongseob Ahn. 2022. Memory harvesting in { Multi-GPU} systems with hierarchical unified virtual memory. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). 625\u2013638."},{"key":"e_1_3_3_2_10_2","doi-asserted-by":"crossref","unstructured":"Jack Choquette. 2023. Nvidia hopper h100 gpu: Scaling performance. IEEE Micro 43 3 (2023) 9\u201317.","DOI":"10.1109\/MM.2023.3256796"},{"key":"e_1_3_3_2_11_2","unstructured":"City of Chicago. 2023. Taxi Trips 2013-2023. https:\/\/data.cityofchicago.org\/Transportation\/Taxi-Trips-2013-2023-\/wrvz-psew."},{"key":"e_1_3_3_2_12_2","unstructured":"Rogan Creswick. 2021. Using CUDA Warp-Level Primitives. https:\/\/developer.nvidia.com\/blog\/using-cuda-warp-level-primitives\/."},{"key":"e_1_3_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/2931088.2931091"},{"key":"e_1_3_3_2_14_2","doi-asserted-by":"crossref","unstructured":"Timothy\u00a0A Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS) 38 1 (2011) 1\u201325.","DOI":"10.1145\/2049662.2049663"},{"key":"e_1_3_3_2_15_2","doi-asserted-by":"crossref","unstructured":"Djenouri Youcef and Belhadi Asma and Srivastava Gautam and Lin Jerry Chun-Wei. 2023. An efficient and accurate GPU-based deep learning model for multimedia recommendation. ACM Transactions on Multimedia Computing Communications and Applications 20 2 (2023) 1\u201318.","DOI":"10.1145\/3524022"},{"key":"e_1_3_3_2_16_2","first-page":"401","volume-title":"11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14)","author":"Dragojevi\u0107 Aleksandar","year":"2014","unstructured":"Aleksandar Dragojevi\u0107, Dushyanth Narayanan, Miguel Castro, and Orion Hodson. 2014. { FaRM} : Fast remote memory. In 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14). 401\u2013414."},{"key":"e_1_3_3_2_17_2","first-page":"1","volume-title":"Proceedings of the USENIX Annual Technical Conference (ATC)","author":"Duplyakin Dmitry","year":"2019","unstructured":"Dmitry Duplyakin, Robert Ricci, Aleksander Maricq, Gary Wong, Jonathon Duerig, Eric Eide, Leigh Stoller, Mike Hibler, David Johnson, Kirk Webb, Aditya Akella, Kuangching Wang, Glenn Ricart, Larry Landweber, Chip Elliott, Michael Zink, Emmanuel Cecchet, Snigdhaswin Kar, and Prabodh Mishra. 2019. The Design and Operation of CloudLab. In Proceedings of the USENIX Annual Technical Conference (ATC). 1\u201314. https:\/\/www.flux.utah.edu\/paper\/duplyakin-atc19"},{"key":"e_1_3_3_2_18_2","unstructured":"Luigi Fusco Mikhail Khalilov Marcin Chrapek Giridhar Chukkapalli Thomas Schulthess and Torsten Hoefler. 2024. Understanding Data Movement in Tightly Coupled Heterogeneous Systems: A Case Study with the Grace Hopper Superchip. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2408.11556 (2024)."},{"key":"e_1_3_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1145\/3307650.3322224"},{"key":"e_1_3_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISPASS57527.2023.00032"},{"key":"e_1_3_3_2_21_2","first-page":"649","volume-title":"14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17)","author":"Gu Juncheng","year":"2017","unstructured":"Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, and Kang\u00a0G Shin. 2017. Efficient memory disaggregation with infiniswap. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). 649\u2013667."},{"key":"e_1_3_3_2_22_2","unstructured":"Yongbin Gu Wenxuan Wu Yunfan Li and Lizhong Chen. 2020. Uvmbench: A comprehensive benchmark suite for researching unified virtual memory in gpus. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2007.09822 (2020)."},{"key":"e_1_3_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/3332466.3374544"},{"key":"e_1_3_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1145\/3458817.3476223"},{"key":"e_1_3_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1145\/3627703.3629571"},{"key":"e_1_3_3_2_26_2","first-page":"745","volume-title":"21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)","author":"Jiang Ziheng","year":"2024","unstructured":"Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, et\u00a0al. 2024. { MegaScale} : Scaling large language model training to more than 10,000 { GPUs}. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). 745\u2013760."},{"key":"e_1_3_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1145\/3373376.3378529"},{"key":"e_1_3_3_2_28_2","doi-asserted-by":"crossref","unstructured":"Marcin Knap and Pawe\u0142 Czarnul. 2019. Performance evaluation of Unified Memory with prefetching and oversubscription for selected parallel CUDA applications on NVIDIA Pascal and Volta GPUs. The Journal of Supercomputing 75 11 (2019) 7625\u20137645.","DOI":"10.1007\/s11227-019-02966-8"},{"key":"e_1_3_3_2_29_2","doi-asserted-by":"crossref","unstructured":"Konstantinos Koukos Alberto Ros Erik Hagersten and Stefanos Kaxiras. 2016. Building heterogeneous unified virtual memories (uvms) without the overhead. ACM Transactions on Architecture and Code Optimization (TACO) 13 1 (2016) 1\u201322.","DOI":"10.1145\/2889488"},{"key":"e_1_3_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1145\/2487788.2488173"},{"key":"e_1_3_3_2_31_2","unstructured":"Michael Larabel. 2021. AMD Making Progress on HMM-based SVM Memory Manager for Open-Source Compute. https:\/\/www.phoronix.com\/news\/AMD-ROCm-HMM-SVM-Memory."},{"key":"e_1_3_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.1145\/3295500.3356141"},{"key":"e_1_3_3_2_33_2","first-page":"99","volume-title":"21st USENIX Conference on File and Storage Technologies (FAST 23)","author":"Li Pengfei","year":"2023","unstructured":"Pengfei Li, Yu Hua, Pengfei Zuo, Zhangyu Chen, and Jiajie Sheng. 2023. { ROLEX} : A Scalable { RDMA-oriented} Learned { Key-Value} Store for Disaggregated Memory Systems. In 21st USENIX Conference on File and Storage Technologies (FAST 23). 99\u2013114."},{"key":"e_1_3_3_2_34_2","doi-asserted-by":"crossref","unstructured":"John\u00a0DC Little and Stephen\u00a0C Graves. 2008. Little\u2019s law. Building intuition: insights from basic operations management models and principles (2008) 81\u2013100.","DOI":"10.1007\/978-0-387-73699-0_5"},{"key":"e_1_3_3_2_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/SC.2018.00035"},{"key":"e_1_3_3_2_36_2","unstructured":"Seung\u00a0Won Min Vikram\u00a0Sharma Mailthody Zaid Qureshi Jinjun Xiong Eiman Ebrahimi and Wen-mei Hwu. 2020. EMOGI: Efficient memory-access for out-of-memory graph-traversal in GPUs. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2006.06890 (2020)."},{"key":"e_1_3_3_2_37_2","unstructured":"Nurlan Nazaraliyev Elaheh Sadredini and Nael Abu-Ghazaleh. 2024. GPUVM: GPU-driven Unified Virtual Memory. arXiv preprint arXiv:https:\/\/arXiv.org\/abs\/2411.05309 (2024)."},{"key":"e_1_3_3_2_38_2","unstructured":"NVIDIA. [n. d.]. NVIDIA GPUDirect RDMA. https:\/\/docs.nvidia.com\/cuda\/gpudirect-rdma\/."},{"key":"e_1_3_3_2_39_2","volume-title":"Replayable Faults","unstructured":"NVIDIA. [n. d.]. Replayable Faults. https:\/\/github.com\/NVIDIA\/open-gpu-kernel-modules\/blob\/main\/kernel-open\/nvidia-uvm\/uvm_gpu_non_replayable_faults.c Available at: https:\/\/github.com\/NVIDIA\/open-gpu-kernel-modules\/blob\/main\/kernel-open\/nvidia-uvm\/uvm_gpu_non_replayable_faults.c."},{"key":"e_1_3_3_2_40_2","unstructured":"NVIDIA. 2017. Unified Memory for CUDA Beginners. https:\/\/developer.nvidia.com\/blog\/unified-memory-cuda-beginners\/."},{"key":"e_1_3_3_2_41_2","unstructured":"NVIDIA. 2020. NVIDIA DGX A100 Datasheet. https:\/\/www.nvidia.com\/content\/dam\/en-zz\/Solutions\/Data-Center\/nvidia-dgx-a100-datasheet.pdf."},{"key":"e_1_3_3_2_42_2","unstructured":"NVIDIA. 2023. NVIDIA H100 Tensor Core GPU. https:\/\/resources.nvidia.com\/en-us-tensor-core\/nvidia-tensor-core-gpu-datasheet."},{"key":"e_1_3_3_2_43_2","unstructured":"NVIDIA. 2023. Simplifying GPU Application Development with Heterogeneous Memory Management. https:\/\/developer.nvidia.com\/blog\/simplifying-gpu-application-development-with-heterogeneous-memory-management\/."},{"key":"e_1_3_3_2_44_2","unstructured":"NVIDIA. 2024. Improving Network Performance of HPC Systems Using NVIDIA Magnum IO NVSHMEM and GPUDirect Async. https:\/\/developer.nvidia.com\/blog\/improving-network-performance-of-hpc-systems-using-nvidia-magnum-io-nvshmem-and-gpudirect-async\/."},{"key":"e_1_3_3_2_45_2","unstructured":"NVIDIA. 2024. NVIDIA Grace Hopper Superchip Architecture Whitepaper. https:\/\/resources.nvidia.com\/en-us-grace-cpu\/nvidia-grace-hopper."},{"key":"e_1_3_3_2_46_2","unstructured":"NVIDIA. 2024. Open GPU Documentation. https:\/\/nvidia.github.io\/open-gpu-doc\/."},{"key":"e_1_3_3_2_47_2","volume-title":"Welcome to the cuDF documentation!","year":"2024","unstructured":"NVIDIA. 2024. Welcome to the cuDF documentation!https:\/\/docs.rapids.ai\/api\/cudf\/stable\/\/"},{"key":"e_1_3_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.23919\/DATE56975.2023.10137307"},{"key":"e_1_3_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.1145\/3575693.3575748"},{"key":"e_1_3_3_2_50_2","first-page":"315","volume-title":"14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)","author":"Ruan Zhenyuan","year":"2020","unstructured":"Zhenyuan Ruan, Malte Schwarzkopf, Marcos\u00a0K Aguilera, and Adam Belay. 2020. { AIFM} :{ High-Performance}, { Application-Integrated} far memory. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 315\u2013332."},{"key":"e_1_3_3_2_51_2","doi-asserted-by":"publisher","DOI":"10.1145\/3342195.3387537"},{"key":"e_1_3_3_2_52_2","volume-title":"GPU Technology Conference (GTC)","author":"Sakharnykh Nikolay","year":"2019","unstructured":"Nikolay Sakharnykh. 2019. Memory management on modern gpu architectures. In GPU Technology Conference (GTC)."},{"key":"e_1_3_3_2_53_2","doi-asserted-by":"publisher","DOI":"10.1145\/3673038.3673110"},{"key":"e_1_3_3_2_54_2","doi-asserted-by":"crossref","unstructured":"Sagi Shahar Shai Bergman and Mark Silberstein. 2016. ActivePointers: a case for software address translation on GPUs. ACM SIGARCH Computer Architecture News 44 3 (2016) 596\u2013608.","DOI":"10.1145\/3007787.3001200"},{"key":"e_1_3_3_2_55_2","first-page":"69","volume-title":"13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18)","author":"Shan Yizhou","year":"2018","unstructured":"Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang. 2018. { LegoOS} : A disseminated, distributed { OS} for hardware resource disaggregation. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 69\u201387."},{"key":"e_1_3_3_2_56_2","doi-asserted-by":"publisher","DOI":"10.1145\/3489525.3511691"},{"key":"e_1_3_3_2_57_2","first-page":"31094","volume-title":"International Conference on Machine Learning","author":"Sheng Ying","year":"2023","unstructured":"Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher R\u00e9, Ion Stoica, and Ce Zhang. 2023. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning. PMLR, 31094\u201331116."},{"key":"e_1_3_3_2_58_2","doi-asserted-by":"publisher","DOI":"10.1145\/2451116.2451169"},{"key":"e_1_3_3_2_59_2","doi-asserted-by":"crossref","unstructured":"Mark Silberstein Sangman Kim Seonggu Huh Xinya Zhang Yige Hu Amir Wated and Emmett Witchel. 2016. GPUnet: Networking abstractions for GPU programs. ACM Transactions on Computer Systems (TOCS) 34 3 (2016) 1\u201331.","DOI":"10.1145\/2963098"},{"key":"e_1_3_3_2_60_2","doi-asserted-by":"publisher","DOI":"10.1145\/3097983.3098057"},{"key":"e_1_3_3_2_61_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00075"},{"key":"e_1_3_3_2_62_2","doi-asserted-by":"publisher","DOI":"10.1145\/3649411.3649413"},{"key":"e_1_3_3_2_63_2","doi-asserted-by":"crossref","unstructured":"Pengyu Wang Jing Wang Chao Li Jianzong Wang Haojin Zhu and Minyi Guo. 2021. Grus: Toward unified-memory-efficient high-performance graph processing on gpu. ACM Transactions on Architecture and Code Optimization (TACO) 18 2 (2021) 1\u201325.","DOI":"10.1145\/3444844"},{"key":"e_1_3_3_2_64_2","doi-asserted-by":"publisher","DOI":"10.1145\/2350190.2350193"},{"key":"e_1_3_3_2_65_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPS47924.2020.00056"},{"key":"e_1_3_3_2_66_2","doi-asserted-by":"crossref","unstructured":"Qi Yu Bruce Childers Libo Huang Cheng Qian and Zhiying Wang. 2020. A quantitative evaluation of unified memory in GPUs. The Journal of Supercomputing 76 (2020) 2958\u20132985.","DOI":"10.1007\/s11227-019-03079-y"},{"key":"e_1_3_3_2_67_2","doi-asserted-by":"publisher","DOI":"10.1145\/3613424.3614309"},{"key":"e_1_3_3_2_68_2","doi-asserted-by":"crossref","unstructured":"Amir\u00a0Kavyan Ziabari Yifan Sun Yenai Ma Dana Schaa Jos\u00e9\u00a0L Abell\u00e1n Rafael Ubal John Kim Ajay Joshi and David Kaeli. 2016. UMH: A Hardware-Based Unified Memory Hierarchy for Systems with Multiple Discrete GPUs. ACM Transactions on Architecture and Code Optimization (TACO) 13 4 (2016) 1\u201325.","DOI":"10.1145\/2996190"},{"key":"e_1_3_3_2_69_2","first-page":"15","volume-title":"2021 USENIX Annual Technical Conference (USENIX ATC 21)","author":"Zuo Pengfei","year":"2021","unstructured":"Pengfei Zuo, Jiazhao Sun, Liu Yang, Shuangwu Zhang, and Yu Hua. 2021. One-sided { RDMA-Conscious} extendible hashing for disaggregated memory. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). 15\u201329."}],"event":{"name":"ICS '25: 2025 International Conference on Supercomputing","location":"Salt Lake City USA","acronym":"ICS '25","sponsor":["SIGARCH ACM Special Interest Group on Computer Architecture"]},"container-title":["Proceedings of the 39th ACM International Conference on Supercomputing"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3721145.3725748","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,22]],"date-time":"2025-08-22T13:04:05Z","timestamp":1755867845000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3721145.3725748"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,8]]},"references-count":68,"alternative-id":["10.1145\/3721145.3725748","10.1145\/3721145"],"URL":"https:\/\/doi.org\/10.1145\/3721145.3725748","relation":{},"subject":[],"published":{"date-parts":[[2025,6,8]]},"assertion":[{"value":"2025-08-22","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}