{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,16]],"date-time":"2026-01-16T02:25:23Z","timestamp":1768530323378,"version":"3.49.0"},"reference-count":29,"publisher":"Association for Computing Machinery (ACM)","issue":"5s","license":[{"start":{"date-parts":[[2021,9,17]],"date-time":"2021-09-17T00:00:00Z","timestamp":1631836800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Science Foundation","award":["NSF IIS-2027546"],"award-info":[{"award-number":["NSF IIS-2027546"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Embed. Comput. Syst."],"published-print":{"date-parts":[[2021,10,31]]},"abstract":"<jats:p>Multi-head self-attention (attention mechanism) has been employed in a variety of fields such as machine translation, language modeling, and image processing due to its superiority in feature extraction and sequential data analysis. This is benefited from a large number of parameters and sophisticated model architecture behind the attention mechanism. To efficiently deploy attention mechanism on resource-constrained devices, existing works propose to reduce the model size by building a customized smaller model or compressing a big standard model. A customized smaller model is usually optimized for the specific task and needs effort in model parameters exploration. Model compression reduces model size without hurting the model architecture robustness, which can be efficiently applied to different tasks. The compressed weights in the model are usually regularly shaped (e.g. rectangle) but the dimension sizes vary (e.g. differs in rectangle height and width). Such compressed attention mechanism can be efficiently deployed on CPU\/GPU platforms as their memory and computing resources can be flexibly assigned with demand. However, for Field Programmable Gate Arrays (FPGAs), the data buffer allocation and computing kernel are fixed at run time to achieve maximum energy efficiency. After compression, weights are much smaller and different in size, which leads to inefficient utilization of FPGA on-chip buffer. Moreover, the different weight heights and widths may lead to inefficient FPGA computing kernel execution. Due to the large number of weights in the attention mechanism, building a unique buffer and computing kernel for each compressed weight on FPGA is not feasible. In this work, we jointly consider the compression impact on buffer allocation and the required computing kernel during the attention mechanism compressing. A novel structural pruning method with memory footprint awareness is proposed and the associated accelerator on FPGA is designed. The experimental results show that our work can compress Transformer (an attention mechanism based model) by 95x. The developed accelerator can fully utilize the FPGA resource, processing the sparse attention mechanism with the run-time throughput performance of 1.87 Tops in ZCU102 FPGA.<\/jats:p>","DOI":"10.1145\/3477002","type":"journal-article","created":{"date-parts":[[2021,9,17]],"date-time":"2021-09-17T18:36:51Z","timestamp":1631903811000},"page":"1-24","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":40,"title":["Algorithm-hardware Co-design of Attention Mechanism on FPGA Devices"],"prefix":"10.1145","volume":"20","author":[{"given":"Xinyi","family":"Zhang","sequence":"first","affiliation":[{"name":"University of Pittsburgh, USA"}]},{"given":"Yawen","family":"Wu","sequence":"additional","affiliation":[{"name":"University of Pittsburgh, USA"}]},{"given":"Peipei","family":"Zhou","sequence":"additional","affiliation":[{"name":"University of Pittsburgh, USA"}]},{"given":"Xulong","family":"Tang","sequence":"additional","affiliation":[{"name":"University of Pittsburgh, USA"}]},{"given":"Jingtong","family":"Hu","sequence":"additional","affiliation":[{"name":"University of Pittsburgh, USA"}]}],"member":"320","published-online":{"date-parts":[[2021,9,17]]},"reference":[{"key":"e_1_2_1_1_1","volume-title":"Attention is all you need. arXiv preprint arXiv:1706.03762","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani , Noam Shazeer , Niki Parmar , Jakob Uszkoreit , Llion Jones , Aidan N. Gomez , \u0141ukasz Kaiser , and Illia Polosukhin . 2017. Attention is all you need. arXiv preprint arXiv:1706.03762 ( 2017 ). Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)."},{"key":"e_1_2_1_2_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)."},{"key":"e_1_2_1_3_1","volume-title":"et\u00a0al","author":"Brown Tom B.","year":"2020","unstructured":"Tom B. Brown , Benjamin Mann , Nick Ryder , Melanie Subbiah , Jared Kaplan , Prafulla Dhariwal , Arvind Neelakantan , Pranav Shyam , Girish Sastry , Amanda Askell , et\u00a0al . 2020 . Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020). Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et\u00a0al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)."},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"e_1_2_1_5_1","volume-title":"Lite transformer with long-short range attention. arXiv preprint arXiv:2004.11886","author":"Wu Zhanghao","year":"2020","unstructured":"Zhanghao Wu , Zhijian Liu , Ji Lin , Yujun Lin , and Song Han . 2020. Lite transformer with long-short range attention. arXiv preprint arXiv:2004.11886 ( 2020 ). Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, and Song Han. 2020. Lite transformer with long-short range attention. arXiv preprint arXiv:2004.11886 (2020)."},{"key":"e_1_2_1_7_1","volume-title":"Hat: Hardware-aware transformers for efficient natural language processing. arXiv preprint arXiv:2005.14187","author":"Wang Hanrui","year":"2020","unstructured":"Hanrui Wang , Zhanghao Wu , Zhijian Liu , Han Cai , Ligeng Zhu , Chuang Gan , and Song Han . 2020 . Hat: Hardware-aware transformers for efficient natural language processing. arXiv preprint arXiv:2005.14187 (2020). Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang Gan, and Song Han. 2020. Hat: Hardware-aware transformers for efficient natural language processing. arXiv preprint arXiv:2005.14187 (2020)."},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3370748.3406567"},{"key":"e_1_2_1_9_1","volume-title":"Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710","author":"Li Hao","year":"2016","unstructured":"Hao Li , Asim Kadav , Igor Durdanovic , Hanan Samet , and Hans Peter Graf . 2016. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710 ( 2016 ). Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2016. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710 (2016)."},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1145\/3005348"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.5555\/3437539.3437784"},{"key":"e_1_2_1_12_1","unstructured":"Xilinx. 2019. Supercharge Your AI and database applications with xilinx\u2019s HBM-enabled ultrascale+ devices featuring samsung HBM2. In Xilinx white paper WP508 (v1.1.2).  Xilinx. 2019. Supercharge Your AI and database applications with xilinx\u2019s HBM-enabled ultrascale+ devices featuring samsung HBM2. In Xilinx white paper WP508 (v1.1.2)."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/2684746.2689060"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3007787.3001177"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/3140659.3080221"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3358192"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3316781.3317757"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCAD.2020.2986127"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCD50377.2020.00086"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.3390\/s21061955"},{"key":"e_1_2_1_21_1","volume-title":"The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635","author":"Frankle Jonathan","year":"2018","unstructured":"Jonathan Frankle and Michael Carbin . 2018. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635 ( 2018 ). Jonathan Frankle and Michael Carbin. 2018. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635 (2018)."},{"key":"e_1_2_1_22_1","volume-title":"Drawing early-bird tickets: Towards more efficient training of deep networks. arXiv preprint arXiv:1909.11957","author":"You Haoran","year":"2019","unstructured":"Haoran You , Chaojian Li , Pengfei Xu , Yonggan Fu , Yue Wang , Xiaohan Chen , Richard G. Baraniuk , Zhangyang Wang , and Yingyan Lin . 2019. Drawing early-bird tickets: Towards more efficient training of deep networks. arXiv preprint arXiv:1909.11957 ( 2019 ). Haoran You, Chaojian Li, Pengfei Xu, Yonggan Fu, Yue Wang, Xiaohan Chen, Richard G. Baraniuk, Zhangyang Wang, and Yingyan Lin. 2019. Drawing early-bird tickets: Towards more efficient training of deep networks. arXiv preprint arXiv:1909.11957 (2019)."},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.5555\/3454287.3454731"},{"key":"e_1_2_1_24_1","volume-title":"International Conference on Machine Learning. PMLR, 6682\u20136691","author":"Malach Eran","year":"2020","unstructured":"Eran Malach , Gilad Yehudai , Shai Shalev-Schwartz , and Ohad Shamir . 2020 . Proving the lottery ticket hypothesis: Pruning is all you need . In International Conference on Machine Learning. PMLR, 6682\u20136691 . Eran Malach, Gilad Yehudai, Shai Shalev-Schwartz, and Ohad Shamir. 2020. Proving the lottery ticket hypothesis: Pruning is all you need. In International Conference on Machine Learning. PMLR, 6682\u20136691."},{"key":"e_1_2_1_25_1","volume-title":"Jamie Ryan Kiros, and Geoffrey E. Hinton","author":"Ba Jimmy Lei","year":"2016","unstructured":"Jimmy Lei Ba , Jamie Ryan Kiros, and Geoffrey E. Hinton . 2016 . Layer normalization. arXiv preprint arXiv:1607.06450 (2016). Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016)."},{"key":"e_1_2_1_26_1","unstructured":"[n.d.]. FBGEMM. Website. https:\/\/github.com\/pytorch\/FBGEMM.  [n.d.]. FBGEMM. Website. https:\/\/github.com\/pytorch\/FBGEMM."},{"key":"e_1_2_1_27_1","unstructured":"Xilinx. 2017. Deep Learning with INT8 optimization on xilinx devices. In Xilinx WP486.  Xilinx. 2017. Deep Learning with INT8 optimization on xilinx devices. In Xilinx WP486."},{"key":"e_1_2_1_28_1","unstructured":"[n.d.]. Multi30K. Website. https:\/\/github.com\/multi30k\/dataset.  [n.d.]. Multi30K. Website. https:\/\/github.com\/multi30k\/dataset."},{"key":"e_1_2_1_29_1","unstructured":"[n.d.]. IWSLT. Website. http:\/\/workshop2017.iwslt.org\/.  [n.d.]. IWSLT. Website. http:\/\/workshop2017.iwslt.org\/."},{"key":"e_1_2_1_30_1","volume-title":"Hardware accelerator for multi-head attention and position-wise feed-forward in the transformer. arXiv preprint arXiv:2009.08605","author":"Lu Siyuan","year":"2020","unstructured":"Siyuan Lu , Meiqi Wang , Shuang Liang , Jun Lin , and Zhongfeng Wang . 2020. Hardware accelerator for multi-head attention and position-wise feed-forward in the transformer. arXiv preprint arXiv:2009.08605 ( 2020 ). Siyuan Lu, Meiqi Wang, Shuang Liang, Jun Lin, and Zhongfeng Wang. 2020. Hardware accelerator for multi-head attention and position-wise feed-forward in the transformer. arXiv preprint arXiv:2009.08605 (2020)."}],"container-title":["ACM Transactions on Embedded Computing Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3477002","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3477002","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:30:46Z","timestamp":1750188646000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3477002"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,9,17]]},"references-count":29,"journal-issue":{"issue":"5s","published-print":{"date-parts":[[2021,10,31]]}},"alternative-id":["10.1145\/3477002"],"URL":"https:\/\/doi.org\/10.1145\/3477002","relation":{},"ISSN":["1539-9087","1558-3465"],"issn-type":[{"value":"1539-9087","type":"print"},{"value":"1558-3465","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,9,17]]},"assertion":[{"value":"2021-04-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-07-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-09-17","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}