{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,10]],"date-time":"2026-02-10T16:48:12Z","timestamp":1770742092793,"version":"3.49.0"},"reference-count":39,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2022,8,8]],"date-time":"2022-08-08T00:00:00Z","timestamp":1659916800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Reconfigurable Technol. Syst."],"published-print":{"date-parts":[[2022,12,31]]},"abstract":"<jats:p>The advent of AI has driven the exploration of high-density low-precision arithmetic on FPGAs. This has resulted in new methods in mapping both arithmetic functions as well as dataflows onto the fabric, as well as some changes to the embedded DSP Blocks. Technologies outside of the FPGA realm have also evolved, such as the addition of tensor structures for GPUs, as well as the introduction of numerous AI ASSPs, all of which have a higher claimed performance and efficiency than current FPGAs. In this article, we will introduce the Stratix 10 NX device, which is a variant of FPGA specifically optimized for the AI application space. In addition to the computational capabilities of the standard programmable soft-logic fabric, a new type of DSP Block provides the dense arrays of low-precision multipliers typically used in AI implementations. The architecture of the block is tuned for the common matrix-matrix or vector-matrix multiplications in AI, with capabilities designed to work efficiently for both small and large matrix sizes. The base precisions are INT8 and INT4, along with shared exponent support to support block FP16 and block FP12 numerics. All additions\/accumulations can be done in INT32 or IEEE-754 single precision floating point (FP32), and multiple blocks can be cascaded together to support larger matrices. We will also describe methods by which the smaller precision multipliers can be aggregated to create larger multipliers that are more applicable to standard signal processing requirements.<\/jats:p>\n          <jats:p>In the AI market, the FPGA must compete directly with other types of devices, rather than occupy a unique niche. Deterministic system performance is as important as the performance of individual FPGA elements, such as logic, memory, and DSP. We will show that the feed forward datapath structures that are needed to support the typical AI matrix-vector and matrix-matrix multiplication operations can consistently close timing at over 500 MHz on a mid-speed grade device, even if all of the Tensor Blocks on the device are used. We will also show a full-chip NPU processor implementation that out performs GPUs at the same process node for a variety of AI inferencing workloads, even though it has a lower operating frequency of 365 MHz.<\/jats:p>\n          <jats:p>In terms of overall compute throughput, Stratix 10 NX is specified at 143 INT8\/FP16 TOPs\/FLOPs or 286 INT4\/FP12 TOPS\/FLOPs. Depending on the configuration, power efficiency is in the range of 1\u20134 TOPs or TFLOPs\/W.<\/jats:p>","DOI":"10.1145\/3520197","type":"journal-article","created":{"date-parts":[[2022,3,14]],"date-time":"2022-03-14T12:46:51Z","timestamp":1647262011000},"page":"1-32","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":8,"title":["Stratix 10 NX Architecture"],"prefix":"10.1145","volume":"15","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8206-2077","authenticated-orcid":false,"given":"Martin","family":"Langhammer","sequence":"first","affiliation":[{"name":"Intel Corporation, Marlow, United Kingdom"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Eriko","family":"Nurvitadhi","sequence":"additional","affiliation":[{"name":"Intel Corporation, Hillsboro, OR, United States"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sergey","family":"Gribok","sequence":"additional","affiliation":[{"name":"Intel Corporation, San Jose, CA, United States"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5454-4375","authenticated-orcid":false,"given":"Bogdan","family":"Pasca","sequence":"additional","affiliation":[{"name":"Intel Corporation, Meudon, France"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2022,8,8]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"Xilinx. 2017. Deep Learning with INT8 Optimization on Xilinx Devices. Retrieved from https:\/\/www.xilinx.com\/support\/documentation\/white_papers\/wp486-deep-learning-int8.pdf."},{"key":"e_1_3_1_3_2","unstructured":"Nvidia. 2018. NVIDIA-Turing-Architecture-Whitepaper. Retrieved from https:\/\/images.nvidia.com\/aem-dam\/en-zz\/Solutions\/design-visualization\/technologies\/turing-architecture\/NVIDIA-Turing-Architecture-Whitepaper.pdf."},{"key":"e_1_3_1_4_2","unstructured":"Intel. 2019. Agilex F-Series FPGAs and SoC FPGAs. Retrieved from https:\/\/www.intel.com\/content\/www\/us\/en\/products\/details\/fpga\/agilex\/f-series.htmf."},{"key":"e_1_3_1_5_2","unstructured":"Graphcore. 2020. Introducing 2nd Generation IPU Systems for AI at Scale. Retrieved from https:\/\/www.graphcore.ai\/posts\/introducing-second-generation-ipu-systems-for-ai-at-scale."},{"key":"e_1_3_1_6_2","unstructured":"Xilinx. 2020. Versal ACAP Packaging and Pinouts Architecture Manual. Retrieved from https:\/\/www.xilinx.com\/support\/documentation\/architecture-manuals\/am013-versal-pkg-pinout.pdf."},{"key":"e_1_3_1_7_2","unstructured":"Xilinx. 2020. Versal: The First Adaptive Compute Acceleration Platform (ACAP)s. Retrieved from https:\/\/www.xilinx.com\/support\/documentation\/white_papers\/wp505-versal-acap.pdf."},{"key":"e_1_3_1_8_2","unstructured":"Xilinx. 2020. Zync DPU v3.2. Retrieved from https:\/\/www.xilinx.com\/support\/documentation\/ip_documentation\/dpu\/v3_0\/pg338-dpu.pdf."},{"key":"e_1_3_1_9_2","unstructured":"Wikipedia. 2021. 14 nm process. Retrieved from https:\/\/https:\/\/en.wikipedia.org\/wiki\/14_nm_process."},{"key":"e_1_3_1_10_2","unstructured":"Wikipedia. 2021. 7 nm process. Retrieved from https:\/\/https:\/\/en.wikipedia.org\/wiki\/7_nm_process."},{"key":"e_1_3_1_11_2","unstructured":"Xilinx. 2021. Versal ACAP DSP Engine Architecture Manual.Retrieved from https:\/\/www.xilinx.com\/support\/documentation\/architecture-manuals\/am004-versal-dsp-engine.pdf."},{"key":"e_1_3_1_12_2","unstructured":"Xilinx. 2021. Versal AI Product Selection Guide. Retrieved from https:\/\/www.xilinx.com\/support\/documentation\/selection-guides\/versal-ai-core-product-selection-guide.pdf."},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA45697.2020.00023"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1145\/3242897"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICFPT51103.2020.00011"},{"key":"e_1_3_1_16_2","unstructured":"Andre Xian Ming Chang Aliasger Zaidy Vinayak Gokhale and Eugenio Culurciello. 2017. Compiling deep learning models for custom hardware accelerators. Retrieved from http:\/\/arxiv.org\/abs\/1708.00117."},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/HPEC43674.2020.9286183"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2012.231"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00012"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCAS.2017.8050809"},{"key":"e_1_3_1_21_2","doi-asserted-by":"crossref","unstructured":"Dhiraj Kalamkar Evangelos Georganas Sudarshan Srinivasan Jianping Chen Mikhail Shiryaev and Alexander Heinecke. 2020. Optimizing deep learning recommender systems training on CPU cluster architectures. Retrieved from https:\/\/arXiv:cs.DC\/2005.04680.","DOI":"10.1109\/SC41405.2020.00047"},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1109\/ARITH.2018.8464695"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1145\/3289602.3293927"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/IPDPSW50202.2020.00025"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL53798.2021.00029"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM48280.2020.00021"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1145\/3431920.3439293"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM.2019.00047"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL.2019.00027"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM.2019.00013"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2018.2884972"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL.2018.00015"},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/FPL.2019.00061"},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1145\/3079856.3080221"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1109\/DAC18072.2020.9218581"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/FCCM51124.2021.00027"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.23919\/FPL.2017.8056794"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/3289602.3293925"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2019.2930577"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/3373087.3375311"}],"container-title":["ACM Transactions on Reconfigurable Technology and Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3520197","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3520197","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T18:10:32Z","timestamp":1750183832000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3520197"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,8,8]]},"references-count":39,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2022,12,31]]}},"alternative-id":["10.1145\/3520197"],"URL":"https:\/\/doi.org\/10.1145\/3520197","relation":{},"ISSN":["1936-7406","1936-7414"],"issn-type":[{"value":"1936-7406","type":"print"},{"value":"1936-7414","type":"electronic"}],"subject":[],"published":{"date-parts":[[2022,8,8]]},"assertion":[{"value":"2021-09-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-02-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-08-08","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}