{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,7]],"date-time":"2026-04-07T16:14:20Z","timestamp":1775578460194,"version":"3.50.1"},"reference-count":35,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2022,12,22]],"date-time":"2022-12-22T00:00:00Z","timestamp":1671667200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Reconfigurable Technol. Syst."],"published-print":{"date-parts":[[2023,3,31]]},"abstract":"<jats:p>\n            The use of application-specific accelerators in data centers has been the state of the art for at least a decade, starting with the availability of General Purpose GPUs achieving higher performance either overall or per watt. In most cases, these accelerators are coupled via PCIe interfaces to the corresponding hosts, which leads to disadvantages in interoperability, scalability and power consumption. As a viable alternative to PCIe-attached FPGA accelerators this paper proposes standalone FPGAs as\n            <jats:bold>Network-attached Accelerators (NAAs)<\/jats:bold>\n            . To enable reliable communication for decoupled FPGAs we present an\n            <jats:bold>RDMA over Converged Ethernet v2 (RoCEv2)<\/jats:bold>\n            communication stack for high-speed and low-latency data transfer integrated into a hardware framework.\n          <\/jats:p>\n          <jats:p>For NAAs to be used instead of PCIe coupled FPGAs the framework must provide similar throughput and latency with low resource usage. We show that our RoCEv2 stack is capable of achieving 100 Gb\/s throughput with latencies of less than 4\u03bcs while using about 10% of the available resources on a mid-range FPGA. To evaluate the energy efficiency of our NAA architecture, we built a demonstrator with 8 NAAs for machine learning based image classification. Based on our measurements, network-attached FPGAs are a great alternative to the more energy-demanding PCIe-attached FPGA accelerators.<\/jats:p>","DOI":"10.1145\/3543176","type":"journal-article","created":{"date-parts":[[2022,12,22]],"date-time":"2022-12-22T11:06:50Z","timestamp":1671707210000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":8,"title":["A High-Throughput, Resource-Efficient Implementation of the RoCEv2 Remote DMA Protocol and its Application"],"prefix":"10.1145","volume":"16","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6597-4928","authenticated-orcid":false,"given":"Niklas","family":"Schelten","sequence":"first","affiliation":[{"name":"Fraunhofer Institute for Telecommunications - Heinrich Hertz Institute (HHI), Einsteinufer, Berlin, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8733-3064","authenticated-orcid":false,"given":"Fritjof","family":"Steinert","sequence":"additional","affiliation":[{"name":"Fraunhofer Institute for Telecommunications - HHI, Germany and Universityof Potsdam, Potsdam, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3931-4288","authenticated-orcid":false,"given":"Justin","family":"Knapheide","sequence":"additional","affiliation":[{"name":"University of Potsdam, Germany and Fraunhofer Institute for Telecommunications - HHI, Einsteinufer, Berlin, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5281-8144","authenticated-orcid":false,"given":"Anton","family":"Schulte","sequence":"additional","affiliation":[{"name":"Fraunhofer Institute for Telecommunications - HHI, Einsteinufer, Berlin, Germany"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6654-1606","authenticated-orcid":false,"given":"Benno","family":"Stabernack","sequence":"additional","affiliation":[{"name":"Fraunhofer Institute for Telecommunications - HHI, Germany and Universityof Potsdam, Potsdam, Germany"}]}],"member":"320","published-online":{"date-parts":[[2022,12,22]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"2017. IEEE Standard for Ethernet Amendment 10: Media Access Control Parameters Physical Layers and Management Parameters for 200 Gb\/s and 400 Gb\/s Operation."},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/IEEESTD.2018.8457469"},{"key":"e_1_3_1_4_2","unstructured":"2021. Linux RDMA. (2021). https:\/\/github.com\/linux-rdma\/rdma-core."},{"key":"e_1_3_1_5_2","unstructured":"2021. NVIDIA Volta Unveiled: GV100 GPU and Tesla V100 accelerator announced. (2021). https:\/\/www.anandtech.com\/show\/11367\/nvidia-volta-unveiled-gv100-gpu-and-tesla-v100-accelerator-announced."},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/HOTI.2017.13"},{"key":"e_1_3_1_7_2","article-title":"AMBA\u00aeAXI \u2122and ACE \u2122 Protocol Specification","unstructured":"ARM. AMBA\u00aeAXI \u2122and ACE \u2122 Protocol Specification. WWW. (n.d.). https:\/\/static.docs.arm.com\/ihi0022\/e\/IHI0022E_amba_axi_and_ace_protocol_spec.pdf.","journal-title":"WWW"},{"key":"e_1_3_1_8_2","first-page":"1","volume-title":"2016 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO)","author":"Caulfield A. M.","year":"2016","unstructured":"A. M. Caulfield, E. S. Chung, A. Putnam, H. Angepat, J. Fowers, M. Haselman, S. Heil, M. Humphrey, P. Kaur, J. Kim, D. Lo, T. Massengill, K. Ovtcharov, M. Papamichael, L. Woods, S. Lanka, D. Chiou, and D. Burger. 2016. A cloud-scale acceleration architecture. In 2016 49th Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO). 1\u201313."},{"key":"e_1_3_1_9_2","article-title":"Scalable Network Stack supporting TCP\/IP, RoCEv2, UDP\/IP at 10-100Gbit\/s","author":"Z\u00fcrich ETH","unstructured":"ETH Z\u00fcrich. Scalable Network Stack supporting TCP\/IP, RoCEv2, UDP\/IP at 10-100Gbit\/s. WWW. (n.d.). https:\/\/github.com\/fpgasystems\/fpga-network-stack.","journal-title":"WWW"},{"key":"e_1_3_1_10_2","article-title":"Annex A 16: RoCE","author":"Association InfiniBand\u2122 Trade","year":"2010","unstructured":"InfiniBand\u2122 Trade Association. 2010. Annex A 16: RoCE. (WWW). https:\/\/cw.infinibandta.org\/document\/dl\/7148.","journal-title":"(WWW)"},{"key":"e_1_3_1_11_2","article-title":"Annex A 17: RoCEv2","author":"Association InfiniBand\u2122 Trade","year":"2014","unstructured":"InfiniBand\u2122 Trade Association. 2014. Annex A 17: RoCEv2. (WWW). https:\/\/cw.infinibandta.org\/document\/dl\/7781.","journal-title":"(WWW)"},{"key":"e_1_3_1_12_2","article-title":"Architecture Specification Volume 1, Release 1.3","author":"Association InfiniBand\u2122 Trade","year":"2015","unstructured":"InfiniBand\u2122 Trade Association. 2015. Architecture Specification Volume 1, Release 1.3. (WWW). https:\/\/cw.infinibandta.org\/document\/dl\/7859.","journal-title":"(WWW)"},{"key":"e_1_3_1_13_2","article-title":"Intel Arria 10 Device Overview","year":"2018","unstructured":"Intel. 2018. Intel Arria 10 Device Overview. (WWW). https:\/\/www.intel.com\/content\/dam\/www\/programmable\/us\/en\/pdfs\/literature\/hb\/arria-10\/a10_overview.pdf.","journal-title":"(WWW)"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL.2016.7577381"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL50879.2020.00053"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2016.55"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNS.2019.2904118"},{"key":"e_1_3_1_18_2","article-title":"ConnectX-4 EN Card","author":"Technologies Mellanox","unstructured":"Mellanox Technologies. ConnectX-4 EN Card. WWW. (n.d.). https:\/\/www.mellanox.com\/related-docs\/prod_adapter_cards\/PB_ConnectX-4_EN_Card.pdf.","journal-title":"WWW"},{"key":"e_1_3_1_19_2","article-title":"Mellanox Technologies SX1012 12 Port QSFP+ 40GbE 1U Ethernet","author":"Technologies Mellanox","unstructured":"Mellanox Technologies. Mellanox Technologies SX1012 12 Port QSFP+ 40GbE 1U Ethernet. WWW. (n.d.). https:\/\/www.provantage.com\/mellanox-technologies-msx1012b-2brs7MLNX1M3.htm.","journal-title":"WWW"},{"key":"e_1_3_1_20_2","unstructured":"University of Toronto. WWW. (n.d.). https:\/\/github.com\/UofT-HPRC\/GULF-Stream."},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL.2019.00054"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL.2019.00053"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-015-0816-y"},{"key":"e_1_3_1_24_2","article-title":"Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation","volume":"1801","author":"Sandler Mark","year":"2018","unstructured":"Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. CoRR abs\/1801.04381 (2018). arxiv:1801.04381http:\/\/arxiv.org\/abs\/1801.04381.","journal-title":"CoRR"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICFPT51103.2020.00042"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1145\/3342195.3387519"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL53798.2021.00077"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/DSD51259.2020.00033"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11265-021-01727-2"},{"key":"e_1_3_1_30_2","unstructured":"HITEK Systems. WWW. (n.d.). https:\/\/hiteksys.com\/fpga-ip-cores\/udp-ip-offload-engine."},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2018.2877290"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/CloudCom.2016.0018"},{"key":"e_1_3_1_33_2","article-title":"UltraScale+ FPGAs Product Tables and Product Selection Guide","year":"2015","unstructured":"Xilinx. 2015. UltraScale+ FPGAs Product Tables and Product Selection Guide. (WWW). https:\/\/www.xilinx.com\/support\/documentation\/selection-guides\/ultrascale-plus-.","journal-title":"(WWW)"},{"key":"e_1_3_1_34_2","unstructured":"Xilinx. 2018. Xilinx embedded target RDMA enabled NIC v1.1. (June2018). https:\/\/www.xilinx.com\/support\/documentation\/ip_documentation\/etrnic\/v1_1\/pg294-etrnic.pdf."},{"key":"e_1_3_1_35_2","unstructured":"Xilinx. 2021. Embedded RDMA enabled NIC v3.1. (June2021). https:\/\/www.xilinx.com\/support\/documentation\/ip_documentation\/ernic\/v3_0\/pg332-ernic.pdf."},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICPADS.2015.65"}],"container-title":["ACM Transactions on Reconfigurable Technology and Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3543176","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3543176","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T18:09:16Z","timestamp":1750183756000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3543176"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,12,22]]},"references-count":35,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2023,3,31]]}},"alternative-id":["10.1145\/3543176"],"URL":"https:\/\/doi.org\/10.1145\/3543176","relation":{},"ISSN":["1936-7406","1936-7414"],"issn-type":[{"value":"1936-7406","type":"print"},{"value":"1936-7414","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,12,22]]},"assertion":[{"value":"2021-09-06","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-05-24","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-12-22","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}