{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,7,25]],"date-time":"2025-07-25T10:25:13Z","timestamp":1753439113894,"version":"3.41.0"},"reference-count":36,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2022,1,23]],"date-time":"2022-01-23T00:00:00Z","timestamp":1642896000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"CoCoUnit ERC Advanced Grant of the EU\u2019s Horizon 2020","award":["833057"],"award-info":[{"award-number":["833057"]}]},{"DOI":"10.13039\/501100011033","name":"Spanish State Research Agency","doi-asserted-by":"crossref","award":["PID2020-113172RB-I00"],"award-info":[{"award-number":["PID2020-113172RB-I00"]}],"id":[{"id":"10.13039\/501100011033","id-type":"DOI","asserted-by":"crossref"}]},{"name":"ICREA Academia program"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Archit. Code Optim."],"published-print":{"date-parts":[[2022,3,31]]},"abstract":"<jats:p>Recurrent Neural Network (RNN) inference exhibits low hardware utilization due to the strict data dependencies across time-steps. Batching multiple requests can increase throughput. However, RNN batching requires a large amount of padding since the batched input sequences may vastly differ in length. Schemes that dynamically update the batch every few time-steps avoid padding. However, they require executing different RNN layers in a short time span, decreasing energy efficiency. Hence, we propose E-BATCH, a low-latency and energy-efficient batching scheme tailored to RNN accelerators. It consists of a runtime system and effective hardware support. The runtime concatenates multiple sequences to create large batches, resulting in substantial energy savings. Furthermore, the accelerator notifies it when the evaluation of an input sequence is done. Hence, a new input sequence can be immediately added to a batch, thus largely reducing the amount of padding. E-BATCH dynamically controls the number of time-steps evaluated per batch to achieve the best trade-off between latency and energy efficiency for the given hardware platform. We evaluate E-BATCH on top of E-PUR and TPU. E-BATCH improves throughput by 1.8\u00d7 and energy efficiency by 3.6\u00d7 in E-PUR, whereas in TPU, it improves throughput by 2.1\u00d7 and energy efficiency by 1.6\u00d7, over the state-of-the-art.<\/jats:p>","DOI":"10.1145\/3499757","type":"journal-article","created":{"date-parts":[[2022,1,24]],"date-time":"2022-01-24T05:49:00Z","timestamp":1643003340000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":9,"title":["E-BATCH: Energy-Efficient and High-Throughput RNN Batching"],"prefix":"10.1145","volume":"19","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1134-9908","authenticated-orcid":false,"given":"Franyell","family":"Silfa","sequence":"first","affiliation":[{"name":"Universitat Polit\u00e8cnica de Catalunya, Barcelona, Spain"}]},{"given":"Jose Maria","family":"Arnau","sequence":"additional","affiliation":[{"name":"Universitat Polit\u00e8cnica de Catalunya, Barcelona, Spain"}]},{"given":"Antonio","family":"Gonz\u00e1lez","sequence":"additional","affiliation":[{"name":"Universitat Polit\u00e8cnica de Catalunya, Barcelona, Spain"}]}],"member":"320","published-online":{"date-parts":[[2022,1,23]]},"reference":[{"key":"e_1_3_1_2_2","doi-asserted-by":"publisher","DOI":"10.5555\/3026877.3026899"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISQED.2006.102"},{"key":"e_1_3_1_4_2","doi-asserted-by":"publisher","DOI":"10.5555\/3045390.3045410"},{"key":"e_1_3_1_5_2","doi-asserted-by":"crossref","unstructured":"Denny Britz Anna Goldie Minh-Thang Luong and Quoc V. Le. 2017. Massive exploration of neural machine translation architectures. CoRR abs\/1703.03906 (2017). arXiv:1703.03906 http:\/\/arxiv.org\/abs\/1703.03906","DOI":"10.18653\/v1\/D17-1151"},{"key":"e_1_3_1_6_2","unstructured":"Tianqi Chen Mu Li Yutian Li Min Lin Naiyan Wang Minjie Wang Tianjun Xiao Bing Xu Chiyuan Zhang and Zheng Zhang. 2015. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR abs\/1512.01274 (2015). arXiv:1512.01274 http:\/\/arxiv.org\/abs\/1512.01274"},{"key":"e_1_3_1_7_2","unstructured":"Sharan Chetlur Cliff Woolley Philippe Vandermersch Jonathan Cohen John Tran Bryan Catanzaro and Evan Shelhamer. 2014. cuDNN: Efficient primitives for deep learning. CoRR abs\/1410.0759 (2014). arXiv:1410.0759 http:\/\/arxiv.org\/abs\/1410.0759"},{"key":"e_1_3_1_8_2","unstructured":"Kyunghyun Cho Bart van Merrienboer \u00c7aglar G\u00fcl\u00e7ehre Fethi Bougares Holger Schwenk and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR abs\/1406.1078 (2014). arXiv:1406.1078 http:\/\/arxiv.org\/abs\/1406.1078"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00012"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1145\/3190508.3190541"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/ASPDAC.2017.7858394"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/PACT.2019.00009"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/3020078.3021745"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1145\/3302424.3303949"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1145\/3140659.3080246"},{"key":"e_1_3_1_17_2","volume-title":"2016 IEEE First International Conference on Data Stream Mining & Processing (DSMP)","author":"Khomenko Viacheslav","year":"2017","unstructured":"Viacheslav Khomenko, Oleg Shyshkov, Olga Radyvonenko, and Kostiantyn Bokhan. 2017. Accelerating recurrent neural network training using sequence bucketing and multi-GPU data parallelization. In 2016 IEEE First International Conference on Data Stream Mining & Processing (DSMP). IEEE, 100\u2013103."},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.5555\/1661445.1661531"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.5555\/822080.822803"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/ASRU.2015.7404790"},{"key":"e_1_3_1_21_2","article-title":"DDR4 SDRAM","author":"Inc. Micron","unstructured":"Micron Inc.[n. d.]. DDR4 SDRAM. Retrieved 15 October 2021 from https:\/\/www.micron.com\/-\/media\/client\/global\/documents\/products\/data-sheet\/dram\/ddr4\/8gb_ddr4_sdram.pdf.","journal-title":"https:\/\/www.micron.com\/-\/media\/client\/global\/documents\/products\/data-sheet\/dram\/ddr4\/8gb_ddr4_sdram.pdf"},{"key":"e_1_3_1_22_2","article-title":"TN-53-01: LPDDR4 System Power Calculator","author":"Inc. Micron","unstructured":"Micron Inc.[n. d.]. TN-53-01: LPDDR4 System Power Calculator. Retrieved 15 September 2021 from https:\/\/www.micron.com\/support\/tools-and-utilities\/power-calc.","journal-title":"Retrieved 15 September 2021 from https:\/\/www.micron.com\/support\/tools-and-utilities\/power-calc"},{"key":"e_1_3_1_23_2","first-page":"22","article-title":"CACTI 6.0: A tool to model large caches","author":"Muralimanohar Naveen","year":"2009","unstructured":"Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A tool to model large caches. HP Laboratories (2009), 22\u201331.","journal-title":"HP Laboratories"},{"key":"e_1_3_1_24_2","volume-title":"Proceedings of the NIPS-W","author":"Paszke Adam","year":"2017","unstructured":"Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In Proceedings of the NIPS-W."},{"key":"e_1_3_1_25_2","volume-title":"Proceedings of the NIPS Autodiff Workshop","author":"Paszke Adam","year":"2017","unstructured":"Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In Proceedings of the NIPS Autodiff Workshop."},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2019.2929742"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA.2018.00016"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.5555\/1153923.1154520"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/2939672.2945397"},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1145\/3243176.3243184"},{"key":"e_1_3_1_31_2","doi-asserted-by":"publisher","DOI":"10.1145\/3352460.3358309"},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2016.2587640"},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1145\/2019608.2019612"},{"key":"e_1_3_1_34_2","volume-title":"Operations Research: Applications and Algorithms","author":"Winston Wayne L.","year":"2004","unstructured":"Wayne L. Winston and Jeffrey B. Goldberg. 2004. Operations Research: Applications and Algorithms. Vol. 3. Thomson Brooks\/Cole Belmont."},{"key":"e_1_3_1_35_2","unstructured":"Yonghui Wu Mike Schuster Zhifeng Chen Quoc V. Le Mohammad Norouzi Wolfgang Macherey Maxim Krikun Yuan Cao Qin Gao Klaus Macherey et\u00a0al. 2016. Google\u2019s neural machine translation system: Bridging the gap between human and machine translation. CoRR abs\/1609.08144 (2016). arXiv:1609.08144 http:\/\/arxiv.org\/abs\/1609.08144"},{"key":"e_1_3_1_36_2","unstructured":"Reza Yazdani Olatunji Ruwase Minjia Zhang Yuxiong He Jose-Maria Arnau and Antonio Gonz\u00e1lez. 2019. LSTM-Sharp: An Adaptable Energy-Efficient Hardware Accelerator for Long Short-Term Memory. CoRR abs\/1911.01258 (2019). arXiv:1911.01258 http:\/\/arxiv.org\/abs\/1911.01258"},{"key":"e_1_3_1_37_2","first-page":"951","volume-title":"Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference","author":"Zhang Minjia","year":"2018","unstructured":"Minjia Zhang, Samyam Rajbhandari, Wenhan Wang, and Yuxiong He. 2018. DeepCPU: Serving RNN-based deep learning models 10 \\times faster. In Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference. USENIX Association, 951\u2013965. Retrieved from http:\/\/dl.acm.org\/citation.cfm?id=3277355.3277446."}],"container-title":["ACM Transactions on Architecture and Code Optimization"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3499757","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3499757","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:30:38Z","timestamp":1750188638000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3499757"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,1,23]]},"references-count":36,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2022,3,31]]}},"alternative-id":["10.1145\/3499757"],"URL":"https:\/\/doi.org\/10.1145\/3499757","relation":{},"ISSN":["1544-3566","1544-3973"],"issn-type":[{"type":"print","value":"1544-3566"},{"type":"electronic","value":"1544-3973"}],"subject":[],"published":{"date-parts":[[2022,1,23]]},"assertion":[{"value":"2021-05-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2021-11-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-01-23","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}