{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,10]],"date-time":"2026-01-10T07:28:14Z","timestamp":1768030094021,"version":"3.49.0"},"reference-count":40,"publisher":"Association for Computing Machinery (ACM)","issue":"5s","license":[{"start":{"date-parts":[[2023,9,9]],"date-time":"2023-09-09T00:00:00Z","timestamp":1694217600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Science Foundation","award":["CNS-1822085"],"award-info":[{"award-number":["CNS-1822085"]}]},{"name":"National Science Foundation IUCRC memberships from Samsung and other companies"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Embed. Comput. Syst."],"published-print":{"date-parts":[[2023,10,31]]},"abstract":"<jats:p>\n            Recommendation systems have been widely embedded into many Internet services. For example, Meta\u2019s deep learning recommendation model (DLRM) shows high prefictive accuracy of click-through rate in processing large-scale embedding tables. The SparseLengthSum (SLS) kernel of the DLRM dominates the inference time of the DLRM due to intensive irregular memory accesses to the embedding vectors. Some prior works directly adopt near data processing (NDP) solutions to obtain higher memory bandwidth to accelerate SLS. However, their inferior memory hierarchy induces low performance-cost ratio and fails to fully exploit the data locality. Although some software-managed cache policies were proposed to improve the cache hit rate, the incurred cache miss penalty is unacceptable considering the high overheads of executing the corresponding programs and the communication between the host and the accelerator. To address the issues aforementioned, we propose\n            <jats:sc>EMS-i<\/jats:sc>\n            , an efficient memory system design that integrates Solide State Drive (SSD) into the memory hierarchy using Compute Express Link (CXL) for recommendation system inference. We specialize the caching mechanism according to the characteristics of various DLRM workloads and propose a novel prefetching mechanism to further improve the performance. In addition, we delicately design the inference kernel and develop a customized mapping scheme for SLS operation, considering the multi-level parallelism in SLS and the data locality within a batch of queries. Compared to the state-of-the-art NDP solutions,\n            <jats:sc>EMS-i<\/jats:sc>\n            achieves up to 10.9\u00d7 speedup over RecSSD and the performance comparable to RecNMP with 72% energy savings.\n            <jats:sc>EMS-i<\/jats:sc>\n            also saves up to 8.7\u00d7 and 6.6 \u00d7 memory cost w.r.t. RecSSD and RecNMP, respectively.\n          <\/jats:p>","DOI":"10.1145\/3609384","type":"journal-article","created":{"date-parts":[[2023,9,9]],"date-time":"2023-09-09T13:33:18Z","timestamp":1694266398000},"page":"1-22","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":7,"title":["<scp>EMS-i<\/scp>\n            : An Efficient Memory System Design with Specialized Caching Mechanism for Recommendation Inference"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0009-0004-0212-6704","authenticated-orcid":false,"given":"Yitu","family":"Wang","sequence":"first","affiliation":[{"name":"Duke University, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1990-7150","authenticated-orcid":false,"given":"Shiyu","family":"Li","sequence":"additional","affiliation":[{"name":"Duke University, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5593-1369","authenticated-orcid":false,"given":"Qilin","family":"Zheng","sequence":"additional","affiliation":[{"name":"Duke University, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-1573-1377","authenticated-orcid":false,"given":"Andrew","family":"Chang","sequence":"additional","affiliation":[{"name":"Samsung Semiconductor, Inc., USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3228-6544","authenticated-orcid":false,"given":"Hai","family":"Li","sequence":"additional","affiliation":[{"name":"Duke University, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1486-8412","authenticated-orcid":false,"given":"Yiran","family":"Chen","sequence":"additional","affiliation":[{"name":"Duke University, USA"}]}],"member":"320","published-online":{"date-parts":[[2023,9,9]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"Amazon Personalize 2023. https:\/\/aws.amazon.com\/personalize\/"},{"key":"e_1_3_1_3_2","volume-title":"ICDCS","author":"Ardestani Ehsan K.","year":"2022","unstructured":"Ehsan K. Ardestani et\u00a0al. 2022. Supporting massive DLRM inference through software defined memory. In ICDCS. IEEE."},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.1145\/293347.293348"},{"key":"e_1_3_1_5_2","first-page":"2055","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Babenko Artem","year":"2016","unstructured":"Artem Babenko and Victor Lempitsky. 2016. Efficient indexing of billion-scale datasets of deep descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2055\u20132063."},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/3460231.3474246"},{"key":"e_1_3_1_7_2","unstructured":"Criteo Kaggle Dataset 2020. https:\/\/www.kaggle.com\/datasets\/mrkmakr\/criteo-dataset"},{"key":"e_1_3_1_8_2","unstructured":"CXL 3.0 Specification 2022. https:\/\/www.computeexpresslink.org\/download-the-specification\/"},{"key":"e_1_3_1_9_2","unstructured":"DRAM Market Price 2023. https:\/\/electronics-sourcing.com\/2022\/05\/12\/dram-price-increases-will-ease\/"},{"key":"e_1_3_1_10_2","unstructured":"Facebook DLRM Dataset 2021. https:\/\/github.com\/facebookresearch\/dlrm_datasets"},{"key":"e_1_3_1_11_2","volume-title":"ISCA","author":"Gupta Udit","year":"2020","unstructured":"Udit Gupta et\u00a0al. 2020. DeepRecSys: A system for optimizing end-to-end at-scale neural recommendation inference. In ISCA."},{"key":"e_1_3_1_12_2","unstructured":"HBM Market Price 2023. https:\/\/www.networkworld.com\/article\/3664088\/high-bandwidth-memory-hdm-delivers-impressive-performance-gains.html"},{"key":"e_1_3_1_13_2","first-page":"968","volume-title":"2020 ACM\/IEEE 47th Annual International Symposium on Computer Architecture (ISCA\u201920)","author":"Hwang Ranggi","year":"2020","unstructured":"Ranggi Hwang, Taehun Kim, Youngeun Kwon, and Minsoo Rhu. 2020. Centaur: A chiplet-based, hybrid sparse-dense accelerator for personalized recommendations. In 2020 ACM\/IEEE 47th Annual International Symposium on Computer Architecture (ISCA\u201920). IEEE, 968\u2013981."},{"key":"e_1_3_1_14_2","article-title":"SimpleSSD: Modeling solid state drives for holistic system simulation","author":"Jung Myoungsoo","year":"2017","unstructured":"Myoungsoo Jung et\u00a0al. 2017. SimpleSSD: Modeling solid state drives for holistic system simulation. IEEE Computer Architecture Letters (2017).","journal-title":"IEEE Computer Architecture Letters"},{"key":"e_1_3_1_15_2","unstructured":"Kaggle 2023. https:\/\/www.kaggle.com"},{"key":"e_1_3_1_16_2","volume-title":"ISCA","author":"Ke Liu","year":"2020","unstructured":"Liu Ke et\u00a0al. 2020. Recnmp: Accelerating personalized recommendation with near-memory processing. In ISCA."},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2021.3097700"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/TC.2022.3155956"},{"key":"e_1_3_1_19_2","article-title":"Ramulator: A fast and extensible DRAM simulator","author":"Kim Yoongu","year":"2015","unstructured":"Yoongu Kim et\u00a0al. 2015. Ramulator: A fast and extensible DRAM simulator. IEEE Computer Architecture Letters (2015).","journal-title":"IEEE Computer Architecture Letters"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1145\/3352460.3358284"},{"key":"e_1_3_1_21_2","doi-asserted-by":"crossref","first-page":"235","DOI":"10.1109\/HPCA51647.2021.00029","volume-title":"2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA\u201921)","author":"Kwon Youngeun","year":"2021","unstructured":"Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2021. Tensor casting: Co-designing algorithm-architecture for personalized recommendation training. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA\u201921). IEEE, 235\u2013248."},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2203.00241"},{"key":"e_1_3_1_23_2","article-title":"The gem5 simulator: Version 20.0+","author":"Lowe-Power Jason","year":"2020","unstructured":"Jason Lowe-Power et\u00a0al. 2020. The gem5 simulator: Version 20.0+. arXiv preprint arXiv:2007.03152 (2020).","journal-title":"arXiv preprint arXiv:2007.03152"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2018.2889473"},{"key":"e_1_3_1_25_2","unstructured":"Meta 2023. https:\/\/about.meta.com"},{"key":"e_1_3_1_26_2","article-title":"High-performance, distributed training of large-scale deep learning recommendation models","author":"Mudigere Dheevatsa","year":"2021","unstructured":"Dheevatsa Mudigere, Yuchen Hao, Jianyu Huang, Andrew Tulloch, Srinivas Sridharan, Xing Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, Liang Luo, et\u00a0al. 2021. High-performance, distributed training of large-scale deep learning recommendation models. arXiv preprint arXiv:2104.05158 (2021).","journal-title":"arXiv preprint arXiv:2104.05158"},{"key":"e_1_3_1_27_2","article-title":"Deep learning recommendation model for personalization and recommendation systems","author":"Naumov Maxim","year":"2019","unstructured":"Maxim Naumov et\u00a0al. 2019. Deep learning recommendation model for personalization and recommendation systems. arXiv (2019).","journal-title":"arXiv"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1162"},{"key":"e_1_3_1_29_2","unstructured":"PM983 Product Brief 2018. https:\/\/www.samsung.com\/semiconductor\/global.semi.static\/"},{"key":"e_1_3_1_30_2","unstructured":"sift-1b 2022. http:\/\/corpus-texmex.irisa.fr\/"},{"key":"e_1_3_1_31_2","volume-title":"ICPE","author":"Soltaniyeh Mohammadreza","year":"2022","unstructured":"Mohammadreza Soltaniyeh et\u00a0al. 2022. Near-storage processing for solid state drive based recommendation inference with SmartSSDs\u00ae. In ICPE."},{"key":"e_1_3_1_32_2","unstructured":"spacev-1b 2021. https:\/\/github.com\/microsoft\/SPTAG\/tree\/main\/datasets\/SPACEV1B"},{"key":"e_1_3_1_33_2","unstructured":"SSD Market Price 2023. https:\/\/www.disctech.com\/Samsung-PM1725B-3.2TB-MZ-PLL3T2C-MZPLK1T6HCHP-00005-Dell-73KJ7-PCIe-NVMe-SSD?partner=1011&gclid=CjwKCAiAzp6eBhByEiwA_gGq5BswRyE1M-T6X7Gjbw9dlC_GAWnrc0kRwddyzN9IQ6mbkMA3mfSvpxoCmvEQAvD_BwE"},{"key":"e_1_3_1_34_2","first-page":"1056","volume-title":"2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA\u201922)","author":"Sun Xuan","year":"2022","unstructured":"Xuan Sun, Hu Wan, Qiao Li, Chia-Lin Yang, Tei-Wei Kuo, and Chun Jason Xue. 2022. Rm-ssd: In-storage computing for large-scale recommendation inference. In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA\u201922). IEEE, 1056\u20131070."},{"key":"e_1_3_1_35_2","unstructured":"torchrec 2022. https:\/\/pytorch.org\/torchrec\/"},{"key":"e_1_3_1_36_2","article-title":"A model of a trust-based recommendation system on a social network","author":"Walter Frank Edward","year":"2008","unstructured":"Frank Edward Walter et\u00a0al. 2008. A model of a trust-based recommendation system on a social network. AAMAS (2008).","journal-title":"AAMAS"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCAD51958.2021.9643573"},{"key":"e_1_3_1_38_2","volume-title":"ASPLOS","author":"Wilkening Mark","year":"2021","unstructured":"Mark Wilkening, Gupta, et\u00a0al. 2021. RecSSD: Near data processing for solid state drive based recommendation inference. In ASPLOS."},{"key":"e_1_3_1_39_2","article-title":"Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms","author":"Xiao Han","year":"2017","unstructured":"Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017).","journal-title":"arXiv preprint arXiv:1708.07747"},{"key":"e_1_3_1_40_2","unstructured":"Xilinx VU57P HBM 2023. https:\/\/www.xilinx.com\/products\/silicon-devices\/fpga\/virtex-ultrascale-plus-vu57p.html"},{"key":"e_1_3_1_41_2","volume-title":"ICMD","author":"Zhou Xiangmin","year":"2015","unstructured":"Xiangmin Zhou et\u00a0al. 2015. Online video recommendation in sharing community. In ICMD."}],"container-title":["ACM Transactions on Embedded Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3609384","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3609384","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:46:23Z","timestamp":1750178783000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3609384"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,9,9]]},"references-count":40,"journal-issue":{"issue":"5s","published-print":{"date-parts":[[2023,10,31]]}},"alternative-id":["10.1145\/3609384"],"URL":"https:\/\/doi.org\/10.1145\/3609384","relation":{},"ISSN":["1539-9087","1558-3465"],"issn-type":[{"value":"1539-9087","type":"print"},{"value":"1558-3465","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,9,9]]},"assertion":[{"value":"2023-03-23","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-07-13","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-09-09","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}