{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,16]],"date-time":"2026-06-16T15:07:59Z","timestamp":1781622479162,"version":"3.54.5"},"reference-count":142,"publisher":"Association for Computing Machinery (ACM)","issue":"14s","license":[{"start":{"date-parts":[[2023,7,17]],"date-time":"2023-07-17T00:00:00Z","timestamp":1689552000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Natural Sciences and Engineering Research Council of Canada (NSERC), Prompt, Ericsson, Ciena, and EfficiOS"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Comput. Surv."],"published-print":{"date-parts":[[2023,12,31]]},"abstract":"<jats:p>Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et\u00a0al. introduced the Transformer, a model solely based on the attention mechanism that is able to relate any two positions of the input sequence, hence modelling arbitrary long dependencies. The Transformer has improved the state-of-the-art across numerous sequence modelling tasks. However, its effectiveness comes at the expense of a quadratic computational and memory complexity with respect to the sequence length, hindering its adoption. Fortunately, the deep learning community has always been interested in improving the models\u2019 efficiency, leading to a plethora of solutions such as parameter sharing, pruning, mixed-precision, and knowledge distillation. Recently, researchers have directly addressed the Transformer\u2019s limitation by designing lower-complexity alternatives such as the Longformer, Reformer, Linformer, and Performer. However, due to the wide range of solutions, it has become challenging for researchers and practitioners to determine which methods to apply in practice to meet the desired tradeoff between capacity, computation, and memory. This survey addresses this issue by investigating popular approaches to make Transformers faster and lighter and by providing a comprehensive explanation of the methods\u2019 strengths, limitations, and underlying assumptions.<\/jats:p>","DOI":"10.1145\/3586074","type":"journal-article","created":{"date-parts":[[2023,3,4]],"date-time":"2023-03-04T11:36:45Z","timestamp":1677929805000},"page":"1-40","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":109,"title":["A Practical Survey on Faster and Lighter Transformers"],"prefix":"10.1145","volume":"55","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1036-0777","authenticated-orcid":false,"given":"Quentin","family":"Fournier","sequence":"first","affiliation":[{"name":"Polytechnique Montr\u00e9al, Canada"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-7590-7421","authenticated-orcid":false,"given":"Ga\u00e9tan Marceau","family":"Caron","sequence":"additional","affiliation":[{"name":"Mila - Quebec AI Institute, Canada"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9876-2921","authenticated-orcid":false,"given":"Daniel","family":"Aloise","sequence":"additional","affiliation":[{"name":"Polytechnique Montr\u00e9al, Canada"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2023,7,17]]},"reference":[{"key":"e_1_3_3_2_2","unstructured":"Mart\u00edn Abadi Ashish Agarwal Paul Barham Eugene Brevdo Zhifeng Chen Craig Citro et\u00a0al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from https:\/\/www.tensorflow.org\/."},{"key":"e_1_3_3_3_2","volume-title":"NIPS","author":"Ba Jimmy","year":"2014","unstructured":"Jimmy Ba and Rich Caruana. 2014. Do deep nets really need to be deep? In NIPS, Vol. 27."},{"key":"e_1_3_3_4_2","volume-title":"ICLR","author":"Bahdanau Dzmitry","year":"2015","unstructured":"Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR."},{"key":"e_1_3_3_5_2","volume-title":"ICLR","author":"Bello Irwan","year":"2021","unstructured":"Irwan Bello. 2021. LambdaNetworks: Modeling long-range interactions without attention. In ICLR."},{"key":"e_1_3_3_6_2","first-page":"arXiv:2004.0515","article-title":"Longformer: The long-document transformer","author":"Beltagy Iz","year":"2020","unstructured":"Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv e-prints (2020), arXiv:2004.05150.","journal-title":"arXiv e-prints"},{"key":"e_1_3_3_7_2","first-page":"1","volume-title":"SLSP","author":"Bengio Yoshua","year":"2013","unstructured":"Yoshua Bengio. 2013. Deep learning of representations: Looking forward. In SLSP, Vol. 7978, 1\u201337."},{"key":"e_1_3_3_8_2","article-title":"Estimating or propagating gradients through stochastic neurons","volume":"1305","author":"Bengio Yoshua","year":"2013","unstructured":"Yoshua Bengio. 2013. Estimating or propagating gradients through stochastic neurons. CoRR abs\/1305.2982 (2013).","journal-title":"CoRR"},{"key":"e_1_3_3_9_2","first-page":"1533","volume-title":"EMNLP","author":"Berant Jonathan","year":"2013","unstructured":"Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on freebase from question-answer pairs. In EMNLP. 1533\u20131544."},{"key":"e_1_3_3_10_2","first-page":"12","volume-title":"SIGMT","author":"Bojar Ondrej","year":"2014","unstructured":"Ondrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, et\u00a0al. 2014. Findings of the 2014 workshop on statistical machine translation. In SIGMT. 12\u201358."},{"key":"e_1_3_3_11_2","first-page":"1877","volume-title":"NeurIPS","author":"Brown Tom","year":"2020","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, et\u00a0al. 2020. Language models are few-shot learners. In NeurIPS, Vol. 33, 1877\u20131901."},{"key":"e_1_3_3_12_2","first-page":"503","volume-title":"ICPP","author":"Buluc A.","year":"2008","unstructured":"A. Buluc and J. R. Gilbert. 2008. Challenges and advances in parallel sparse matrix-matrix multiplication. In ICPP. 503\u2013510."},{"key":"e_1_3_3_13_2","first-page":"213","volume-title":"ECCV","author":"Carion Nicolas","year":"2020","unstructured":"Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In ECCV. 213\u2013229."},{"key":"e_1_3_3_14_2","first-page":"4960","volume-title":"ICASSP","author":"Chan William","year":"2016","unstructured":"William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In ICASSP. 4960\u20134964."},{"key":"e_1_3_3_15_2","article-title":"Training deep nets with sublinear memory cost","volume":"1604","author":"Chen Tianqi","year":"2016","unstructured":"Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. CoRR abs\/1604.06174 (2016).","journal-title":"CoRR"},{"key":"e_1_3_3_16_2","first-page":"551","volume-title":"EMNLP","author":"Cheng Jianpeng","year":"2016","unstructured":"Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016. Long short-term memory-networks for machine reading. In EMNLP. 551\u2013561."},{"key":"e_1_3_3_17_2","article-title":"Generating long sequences with sparse transformers","volume":"1904","author":"Child Rewon","year":"2019","unstructured":"Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. CoRR abs\/1904.10509 (2019).","journal-title":"CoRR"},{"key":"e_1_3_3_18_2","first-page":"1724","volume-title":"EMNLP","author":"Cho Kyunghyun","year":"2014","unstructured":"Kyunghyun Cho, Bart van Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, et\u00a0al. 2014. Learning phrase representations using RNN encoder\u2013decoder for statistical machine translation. In EMNLP. 1724\u20131734."},{"key":"e_1_3_3_19_2","volume-title":"ICLR","author":"Choromanski Krzysztof Marcin","year":"2021","unstructured":"Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, et\u00a0al. 2021. Rethinking attention with performers. In ICLR."},{"key":"e_1_3_3_20_2","volume-title":"ICLR","author":"Clark Kevin","year":"2020","unstructured":"Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. ELECTRA: Pre-training text encoders as discriminators rather than generators. In ICLR."},{"key":"e_1_3_3_21_2","first-page":"2174","volume-title":"EMNLP-IJCNLP","author":"Correia Gon\u00e7alo M.","year":"2019","unstructured":"Gon\u00e7alo M. Correia, Vlad Niculae, and Andr\u00e9 F. T. Martins. 2019. Adaptively sparse transformers. In EMNLP-IJCNLP. 2174\u20132184."},{"key":"e_1_3_3_22_2","volume-title":"NeurIPS","author":"Dai Zihang","year":"2020","unstructured":"Zihang Dai, Guokun Lai, Yiming Yang, and Quoc Le. 2020. Funnel-transformer: Filtering out sequential redundancy for efficient language processing. In NeurIPS."},{"key":"e_1_3_3_23_2","first-page":"2978","volume-title":"ACL","author":"Dai Zihang","year":"2019","unstructured":"Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive language models beyond a fixed-length context. In ACL. 2978\u20132988."},{"key":"e_1_3_3_24_2","volume-title":"ICLR","author":"Dehghani Mostafa","year":"2019","unstructured":"Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. 2019. Universal transformers. In ICLR."},{"key":"e_1_3_3_25_2","first-page":"4171","volume-title":"NAACL-HLT","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT. 4171\u20134186."},{"key":"e_1_3_3_26_2","volume-title":"ICLR","author":"Dinh Laurent","year":"2015","unstructured":"Laurent Dinh, David Krueger, and Yoshua Bengio. 2015. NICE: Non-linear independent components estimation. In ICLR."},{"key":"e_1_3_3_27_2","volume-title":"ICLR","author":"Dinh Laurent","year":"2017","unstructured":"Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. 2017. Density estimation using real NVP. In ICLR."},{"key":"e_1_3_3_28_2","volume-title":"ICLR","author":"Dosovitskiy Alexey","year":"2021","unstructured":"Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, et\u00a0al. 2021. An image is worth 16 \u00d7 16 words: Transformers for image recognition at scale. In ICLR."},{"key":"e_1_3_3_29_2","volume-title":"ICLR","author":"Elbayad Maha","year":"2020","unstructured":"Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. 2020. Depth-adaptive transformer. In ICLR."},{"issue":"55","key":"e_1_3_3_30_2","first-page":"1","article-title":"Neural architecture search: A survey","volume":"20","author":"Elsken Thomas","year":"2019","unstructured":"Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. 2019. Neural architecture search: A survey. J. Mach. Learn. Res. 20, 55 (2019), 1\u201321.","journal-title":"J. Mach. Learn. Res."},{"key":"e_1_3_3_31_2","first-page":"arXiv:2101.0396","article-title":"Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity","author":"Fedus William","year":"2021","unstructured":"William Fedus, Barret Zoph, and Noam Shazeer. 2021. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. arXiv e-prints (2021), arXiv:2101.03961.","journal-title":"arXiv e-prints"},{"key":"e_1_3_3_32_2","first-page":"120","volume-title":"MSR","author":"Fournier Quentin","year":"2021","unstructured":"Quentin Fournier, Daniel Aloise, Seyed Vahid Azhari, and Fran\u00e7ois Tetreault. 2021. On improving deep learning trace analysis with system call arguments. In MSR. 120\u2013130."},{"key":"e_1_3_3_33_2","volume-title":"ICLR","author":"Frankle Jonathan","year":"2019","unstructured":"Jonathan Frankle and Michael Carbin. 2019. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR."},{"key":"e_1_3_3_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2020.3019893"},{"key":"e_1_3_3_35_2","first-page":"249","volume-title":"AISTATS","author":"Glorot Xavier","year":"2010","unstructured":"Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, Vol. 9, 249\u2013256."},{"key":"e_1_3_3_36_2","volume-title":"NeurIPS","author":"Gomez Aidan N.","year":"2017","unstructured":"Aidan N. Gomez, Mengye Ren, Raquel Urtasun, and Roger B. Grosse. 2017. The reversible residual network: Backpropagation without storing activations. In NeurIPS, Vol. 30."},{"key":"e_1_3_3_37_2","volume-title":"Deep Learning","author":"Goodfellow Ian","year":"2016","unstructured":"Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. Retrieved from http:\/\/www.deeplearningbook.org."},{"key":"e_1_3_3_38_2","article-title":"Recurrent independent mechanisms","volume":"1909","author":"Goyal Anirudh","year":"2019","unstructured":"Anirudh Goyal, Alex Lamb, Jordan Hoffmann, Shagun Sodhani, Sergey Levine, Yoshua Bengio, et\u00a0al. 2019. Recurrent independent mechanisms. CoRR abs\/1909.10893 (2019).","journal-title":"CoRR"},{"key":"e_1_3_3_39_2","article-title":"Adaptive computation time for recurrent neural networks","volume":"1603","author":"Graves Alex","year":"2016","unstructured":"Alex Graves. 2016. Adaptive computation time for recurrent neural networks. CoRR abs\/1603.08983 (2016).","journal-title":"CoRR"},{"key":"e_1_3_3_40_2","unstructured":"Scott Gray Alec Radford and Diederik P. Kingma. 2017. GPU kernels for block-sparse weights. https:\/\/cdn.openai.com\/blocksparse\/blocksparsepaper.pdf."},{"key":"e_1_3_3_41_2","first-page":"5036","volume-title":"Interspeech","author":"Gulati Anmol","year":"2020","unstructured":"Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, et\u00a0al. 2020. Conformer: Convolution-augmented transformer for speech recognition. In Interspeech. 5036\u20135040."},{"key":"e_1_3_3_42_2","first-page":"1315","volume-title":"NAACL-HLT","author":"Guo Qipeng","year":"2019","unstructured":"Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Xiangyang Xue, and Zheng Zhang. 2019. Star-transformer. In NAACL-HLT. 1315\u20131325."},{"key":"e_1_3_3_43_2","first-page":"770","volume-title":"CVPR","author":"He K.","year":"2016","unstructured":"K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In CVPR. 770\u2013778."},{"key":"e_1_3_3_44_2","first-page":"arXiv:1503.0253","article-title":"Distilling the knowledge in a neural network","author":"Hinton Geoffrey","year":"2015","unstructured":"Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv e-prints (2015), arXiv:1503.02531.","journal-title":"arXiv e-prints"},{"issue":"8","key":"e_1_3_3_45_2","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long short-term memory","volume":"9","author":"Hochreiter Sepp","year":"1997","unstructured":"Sepp Hochreiter and J\u00fcrgen Schmidhuber. 1997. Long short-term memory. Neural Computat. 9, 8 (1997), 1735\u20131780.","journal-title":"Neural Computat."},{"key":"e_1_3_3_46_2","article-title":"The hardware lottery","volume":"2009","author":"Hooker Sara","year":"2020","unstructured":"Sara Hooker. 2020. The hardware lottery. CoRR abs\/2009.06489 (2020).","journal-title":"CoRR"},{"key":"e_1_3_3_47_2","volume-title":"ICLR","author":"Huang Cheng-Zhi Anna","year":"2019","unstructured":"Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, et\u00a0al. 2019. Music transformer. In ICLR."},{"key":"e_1_3_3_48_2","first-page":"4475","volume-title":"ICML","author":"Huang Xiao Shi","year":"2020","unstructured":"Xiao Shi Huang, Felipe P\u00e9rez, Jimmy Ba, and Maksims Volkovs. 2020. Improving transformer optimization through better initialization. In ICML, Vol. 119, 4475\u20134483."},{"key":"e_1_3_3_49_2","volume-title":"NeurIPS","author":"Huang Yanping","year":"2019","unstructured":"Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, et\u00a0al. 2019. GPipe: Efficient training of giant neural networks using pipeline parallelism. In NeurIPS, Vol. 32."},{"key":"e_1_3_3_50_2","unstructured":"IEA. 2018. World gross electricity production by source 2018. Retrieved from https:\/\/www.iea.org\/data-and-statistics\/charts\/world-gross-electricity-production-by-source-2018."},{"key":"e_1_3_3_51_2","first-page":"448","volume-title":"ICML","author":"Ioffe Sergey","year":"2015","unstructured":"Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, Vol. 37, 448\u2013456."},{"key":"e_1_3_3_52_2","first-page":"2704","volume-title":"IEEE\/CVF","author":"Jacob B.","year":"2018","unstructured":"B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, et\u00a0al. 2018. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In IEEE\/CVF. 2704\u20132713."},{"key":"e_1_3_3_53_2","doi-asserted-by":"crossref","first-page":"79","DOI":"10.1162\/neco.1991.3.1.79","article-title":"Adaptive mixtures of local experts","volume":"3","author":"Jacobs R. A.","year":"1991","unstructured":"R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. 1991. Adaptive mixtures of local experts. Neural Computat. 3 (1991), 79\u201387.","journal-title":"Neural Computat."},{"key":"e_1_3_3_54_2","article-title":"Attention is not explanation","volume":"1902","author":"Jain Sarthak","year":"2019","unstructured":"Sarthak Jain and Byron C. Wallace. 2019. Attention is not explanation. CoRR abs\/1902.10186 (2019).","journal-title":"CoRR"},{"key":"e_1_3_3_55_2","first-page":"4163","volume-title":"EMNLP","author":"Jiao Xiaoqi","year":"2020","unstructured":"Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, et\u00a0al. 2020. TinyBERT: Distilling BERT for natural language understanding. In EMNLP. 4163\u20134174."},{"key":"e_1_3_3_56_2","doi-asserted-by":"publisher","DOI":"10.1109\/ASRU46091.2019.9003750"},{"key":"e_1_3_3_57_2","first-page":"5156","volume-title":"ICML","author":"Katharopoulos Angelos","year":"2020","unstructured":"Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Fran\u00e7ois Fleuret. 2020. Transformers are RNNs: Fast autoregressive transformers with linear attention. In ICML, Vol. 119, 5156\u20135165."},{"key":"e_1_3_3_58_2","doi-asserted-by":"publisher","DOI":"10.1145\/3505244"},{"key":"e_1_3_3_59_2","first-page":"284","volume-title":"ACL","author":"Khandelwal Urvashi","year":"2018","unstructured":"Urvashi Khandelwal, He He, Peng Qi, and Dan Jurafsky. 2018. Sharp nearby, fuzzy far away: How neural language models use context. In ACL. 284\u2013294."},{"key":"e_1_3_3_60_2","volume-title":"ICLR","author":"Kitaev Nikita","year":"2020","unstructured":"Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The efficient transformer. In ICLR."},{"key":"e_1_3_3_61_2","first-page":"491","volume-title":"ECCV","author":"Kolesnikov Alexander","year":"2020","unstructured":"Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, et\u00a0al. 2020. Big transfer (BiT): General visual representation learning. In ECCV, Vol. 12350, 491\u2013507."},{"key":"e_1_3_3_62_2","first-page":"arXiv:2103.0033","article-title":"Transformers with competitive ensembles of independent mechanisms","author":"Lamb Alex","year":"2021","unstructured":"Alex Lamb, Di He, Anirudh Goyal, Guolin Ke, Chien-Feng Liao, Mirco Ravanelli, et\u00a0al. 2021. Transformers with competitive ensembles of independent mechanisms. arXiv e-prints (2021), arXiv:2103.00336.","journal-title":"arXiv e-prints"},{"key":"e_1_3_3_63_2","volume-title":"ICLR","author":"Lan Zhenzhong","year":"2020","unstructured":"Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A lite BERT for self-supervised learning of language representations. In ICLR."},{"key":"e_1_3_3_64_2","first-page":"598","volume-title":"NIPS","author":"LeCun Yann","year":"1990","unstructured":"Yann LeCun, John S. Denker, and Sara A. Solla. 1990. Optimal brain damage. In NIPS. 598\u2013605."},{"key":"e_1_3_3_65_2","first-page":"arXiv:1607.0645","article-title":"Layer normalization","author":"Ba Jimmy Lei","year":"2016","unstructured":"Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. arXiv e-prints (2016), arXiv:1607.06450.","journal-title":"arXiv e-prints"},{"key":"e_1_3_3_66_2","first-page":"arXiv:2011.1383","article-title":"Transformer-based online speech recognition with decoder-end adaptive computation steps","author":"Li Mohan","year":"2020","unstructured":"Mohan Li, Catalin Zorila, and Rama Doddipatla. 2020. Transformer-based online speech recognition with decoder-end adaptive computation steps. arXiv e-prints (2020), arXiv:2011.13834.","journal-title":"arXiv e-prints"},{"key":"e_1_3_3_67_2","first-page":"5244","volume-title":"NeurIPS","author":"Li Shiyang","year":"2019","unstructured":"Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, et\u00a0al. 2019. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. In NeurIPS. 5244\u20135254."},{"key":"e_1_3_3_68_2","first-page":"14544","volume-title":"NeurIPS","author":"Li Zhiyuan","year":"2020","unstructured":"Zhiyuan Li, Kaifeng Lyu, and Sanjeev Arora. 2020. Reconciling modern deep learning with traditional optimization analyses: The intrinsic learning rate. In NeurIPS, Vol. 33, 14544\u201314555."},{"key":"e_1_3_3_69_2","unstructured":"Tianyang Lin Yuxin Wang Xiangyang Liu and Xipeng Qiu. 2021. A Survey of Transformers. arxiv:2106.04554."},{"key":"e_1_3_3_70_2","first-page":"172","volume-title":"SLT","author":"Liu Chunxi","year":"2021","unstructured":"Chunxi Liu, Frank Zhang, Duc Le, Suyoun Kim, Yatharth Saraf, and Geoffrey Zweig. 2021. Improving RNN transducer based ASR with auxiliary tasks. In SLT. 172\u2013179."},{"key":"e_1_3_3_71_2","article-title":"Pay attention to MLPs","volume":"2105","author":"Liu Hanxiao","year":"2021","unstructured":"Hanxiao Liu, Zihang Dai, David R. So, and Quoc V. Le. 2021. Pay attention to MLPs. ArXiv abs\/2105.08050 (2021).","journal-title":"ArXiv"},{"key":"e_1_3_3_72_2","volume-title":"ICLR","author":"Liu Hanxiao","year":"2019","unstructured":"Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2019. DARTS: Differentiable architecture search. In ICLR."},{"key":"e_1_3_3_73_2","volume-title":"ICLR","author":"Liu Liyuan","year":"2020","unstructured":"Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, et\u00a0al. 2020. On the variance of the adaptive learning rate and beyond. In ICLR."},{"key":"e_1_3_3_74_2","first-page":"5747","volume-title":"EMNLP","author":"Liu Liyuan","year":"2020","unstructured":"Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han. 2020. Understanding the difficulty of training transformers. In EMNLP. 5747\u20135763."},{"key":"e_1_3_3_75_2","article-title":"RoBERTa: A robustly optimized BERT pretraining approach","volume":"1907","author":"Liu Yinhan","year":"2019","unstructured":"Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, et\u00a0al. 2019. RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs\/1907.11692 (2019).","journal-title":"CoRR"},{"issue":"2","key":"e_1_3_3_76_2","doi-asserted-by":"crossref","first-page":"48","DOI":"10.1109\/MTS.2020.2991496","article-title":"Estimating carbon emissions of artificial intelligence [opinion]","volume":"39","author":"Luccioni Alexandra","year":"2020","unstructured":"Alexandra Luccioni, Alexandre Lacoste, and Victor Schmidt. 2020. Estimating carbon emissions of artificial intelligence [opinion]. IEEE Technol. Societ. Mag. 39, 2 (2020), 48\u201351.","journal-title":"IEEE Technol. Societ. Mag."},{"key":"e_1_3_3_77_2","volume-title":"ICLR","author":"Maddison Chris J.","year":"2017","unstructured":"Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. 2017. The concrete distribution: A continuous relaxation of discrete random variables. In ICLR."},{"key":"e_1_3_3_78_2","unstructured":"Matt Mahoney. 2011. Large Text Compression Benchmark. Retrieved from http:\/\/mattmahoney.net\/dc\/text.html."},{"key":"e_1_3_3_79_2","volume-title":"ICLR","author":"Merity Stephen","year":"2017","unstructured":"Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017. Pointer Sentinel Mixture models. In ICLR."},{"key":"e_1_3_3_80_2","volume-title":"NeurIPS","author":"Michel Paul","year":"2019","unstructured":"Paul Michel, Omer Levy, and Graham Neubig. 2019. Are sixteen heads really better than one? In NeurIPS, Vol. 32."},{"key":"e_1_3_3_81_2","volume-title":"ICLR","author":"Micikevicius Paulius","year":"2018","unstructured":"Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, et\u00a0al. 2018. Mixed precision training. In ICLR."},{"key":"e_1_3_3_82_2","volume-title":"NAACL","author":"Nangia Nikita","year":"2018","unstructured":"Nikita Nangia and Samuel R. Bowman. 2018. ListOps: A diagnostic dataset for latent tree learning. In NAACL."},{"key":"e_1_3_3_83_2","first-page":"arXiv:2102.1197","article-title":"Do transformer modifications transfer across implementations and applications?","author":"Narang Sharan","year":"2021","unstructured":"Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, Thibault Fevry, Michael Matena, et\u00a0al. 2021. Do transformer modifications transfer across implementations and applications? arXiv e-prints (2021), arXiv:2102.11972.","journal-title":"arXiv e-prints"},{"key":"e_1_3_3_84_2","first-page":"1797","volume-title":"EMNLP","author":"Narayan Shashi","year":"2018","unstructured":"Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018. Don\u2019t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization. In EMNLP. 1797\u20131807."},{"key":"e_1_3_3_85_2","unstructured":"OpenAI. 2013. Saving memory using gradient-checkpointing. Retrieved from https:\/\/github.com\/openai\/gradient-checkpointing."},{"key":"e_1_3_3_86_2","doi-asserted-by":"crossref","unstructured":"Myle Ott Sergey Edunov David Grangier and Michael Auli. 2018. Scaling neural machine translation. In ML . 1\u20139.","DOI":"10.18653\/v1\/W18-6301"},{"key":"e_1_3_3_87_2","first-page":"6879","volume-title":"ICASSP","author":"Park Daniel S.","year":"2020","unstructured":"Daniel S. Park, Yu Zhang, Chung-Cheng Chiu, Youzheng Chen, Bo Li, William Chan, et\u00a0al. 2020. SpecAugment on large scale datasets. In ICASSP. 6879\u20136883."},{"key":"e_1_3_3_88_2","first-page":"8024","volume-title":"NeurIPS","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, et\u00a0al. 2019. PyTorch: An imperative style, high-performance deep learning library. In NeurIPS. 8024\u20138035."},{"key":"e_1_3_3_89_2","first-page":"4092","volume-title":"ICML","author":"Pham Hieu","year":"2018","unstructured":"Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. 2018. Efficient neural architecture search via parameter sharing. In ICML, Vol. 80, 4092\u20134101."},{"key":"e_1_3_3_90_2","first-page":"3208","volume-title":"EMNLP","author":"Prasanna Sai","year":"2020","unstructured":"Sai Prasanna, Anna Rogers, and Anna Rumshisky. 2020. When BERT plays the lottery, all tickets are winning. In EMNLP. 3208\u20133229."},{"key":"e_1_3_3_91_2","first-page":"2555","volume-title":"EMNLP","author":"Qiu Jiezhong","year":"2020","unstructured":"Jiezhong Qiu, Hao Ma, Omer Levy, Wen-tau Yih, Sinong Wang, and Jie Tang. 2020. Blockwise self-attention for long document understanding. In EMNLP. 2555\u20132565."},{"key":"e_1_3_3_92_2","unstructured":"Alec Radford and Karthik Narasimhan. 2018. Improving language understanding by generative pre-training. https:\/\/cdn.openai.com\/research-covers\/language-unsupervised\/language_understanding_paper.pdf."},{"key":"e_1_3_3_93_2","unstructured":"Alec Radford Jeffrey Wu Rewon Child David Luan Dario Amodei and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. (2019). https:\/\/cdn.openai.com\/better-language-models\/language_models_are_unsupervised_multitask_learners.pdf."},{"key":"e_1_3_3_94_2","volume-title":"ICLR","author":"Rae Jack W.","year":"2020","unstructured":"Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P. Lillicrap. 2020. Compressive transformers for long-range sequence modelling. In ICLR."},{"key":"e_1_3_3_95_2","first-page":"2383","volume-title":"EMNLP","author":"Rajpurkar Pranav","year":"2016","unstructured":"Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In EMNLP. 2383\u20132392."},{"key":"e_1_3_3_96_2","first-page":"4780","volume-title":"AAAI","author":"Real Esteban","year":"2019","unstructured":"Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. 2019. Regularized evolution for image classifier architecture search. In AAAI. 4780\u20134789."},{"key":"e_1_3_3_97_2","first-page":"53","article-title":"Efficient content-based sparse attention with routing transformers","volume":"9","author":"Roy Aurko","year":"2021","unstructured":"Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. 2021. Efficient content-based sparse attention with routing transformers. Trans. Assoc. Computat. Ling. 9 (2021), 53\u201368.","journal-title":"Trans. Assoc. Computat. Ling."},{"key":"e_1_3_3_98_2","first-page":"7","volume-title":"Parallel Distributed Processing: Explorations in the Microstructure, Vol. 2: Psychological and Biological Models","author":"Rumelhart D. E.","year":"1986","unstructured":"D. E. Rumelhart, P. Smolensky, J. L. McClelland, and G. E. Hinton. 1986. Schemata and sequential thought processes in PDP models. Parallel Distributed Processing: Explorations in the Microstructure, Vol. 2: Psychological and Biological Models, MIT Press, Cambridge, MA, 7-57."},{"key":"e_1_3_3_99_2","first-page":"arXiv:2004.0384","article-title":"On the effect of dropping layers of pre-trained transformer models","author":"Sajjad Hassan","year":"2020","unstructured":"Hassan Sajjad, Fahim Dalvi, Nadir Durrani, and Preslav Nakov. 2020. On the effect of dropping layers of pre-trained transformer models. arXiv e-prints (2020), arXiv:2004.03844.","journal-title":"arXiv e-prints"},{"key":"e_1_3_3_100_2","article-title":"DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter","volume":"1910","author":"Sanh Victor","year":"2019","unstructured":"Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. CoRR abs\/1910.01108 (2019).","journal-title":"CoRR"},{"key":"e_1_3_3_101_2","volume-title":"NeurIPS","author":"Santurkar Shibani","year":"2018","unstructured":"Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. 2018. How does batch normalization help optimization? In NeurIPS, Vol. 31."},{"key":"e_1_3_3_102_2","first-page":"2931","volume-title":"ACL","author":"Serrano Sofia","year":"2019","unstructured":"Sofia Serrano and Noah A. Smith. 2019. Is attention interpretable? In ACL. 2931\u20132951."},{"key":"e_1_3_3_103_2","first-page":"464","volume-title":"NAACL-HLT","author":"Shaw Peter","year":"2018","unstructured":"Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. In NAACL-HLT. 464\u2013468."},{"key":"e_1_3_3_104_2","volume-title":"ICLR","author":"Shazeer Noam","year":"2017","unstructured":"Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, et\u00a0al. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In ICLR."},{"key":"e_1_3_3_105_2","first-page":"9547","volume-title":"ICML","author":"Shi Han","year":"2021","unstructured":"Han Shi, Jiahui Gao, Xiaozhe Ren, Hang Xu, Xiaodan Liang, Zhenguo Li, et\u00a0al. 2021. SparseBERT: Rethinking the importance analysis in self-attention. In ICML, Vol. 139, 9547\u20139557."},{"key":"e_1_3_3_106_2","first-page":"5877","volume-title":"ICML","author":"So David R.","year":"2019","unstructured":"David R. So, Quoc V. Le, and Chen Liang. 2019. The evolved transformer. In ICML, Vol. 97, 5877\u20135886."},{"issue":"1","key":"e_1_3_3_107_2","first-page":"1929","article-title":"Dropout: A simple way to prevent neural networks from overfitting","volume":"15","author":"Srivastava Nitish","year":"2014","unstructured":"Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1 (2014), 1929\u20131958.","journal-title":"J. Mach. Learn. Res."},{"key":"e_1_3_3_108_2","volume-title":"ICLR","author":"Stock Pierre","year":"2021","unstructured":"Pierre Stock, Angela Fan, Benjamin Graham, Edouard Grave, R\u00e9mi Gribonval, Herve Jegou, et\u00a0al. 2021. Training with quantization noise for extreme model compression. In ICLR."},{"key":"e_1_3_3_109_2","article-title":"Energy and policy considerations for deep learning in NLP","volume":"1906","author":"Strubell Emma","year":"2019","unstructured":"Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and policy considerations for deep learning in NLP. CoRR abs\/1906.02243 (2019).","journal-title":"CoRR"},{"issue":"09","key":"e_1_3_3_110_2","first-page":"13693","article-title":"Energy and policy considerations for modern deep learning research","volume":"34","author":"Strubell Emma","year":"2020","unstructured":"Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2020. Energy and policy considerations for modern deep learning research. Proc. AAAI Conf. Artif. Intell. 34, 09 (2020), 13693\u201313696.","journal-title":"Proc. AAAI Conf. Artif. Intell."},{"key":"e_1_3_3_111_2","first-page":"21","volume-title":"ICASSP","author":"Subakan Cem","year":"2021","unstructured":"Cem Subakan, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi, and Jianyuan Zhong. 2021. Attention is all you need in speech separation. In ICASSP. 21\u201325."},{"key":"e_1_3_3_112_2","first-page":"331","volume-title":"ACL","author":"Sukhbaatar Sainbayar","year":"2019","unstructured":"Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. 2019. Adaptive attention span in transformers. In ACL. 331\u2013335."},{"key":"e_1_3_3_113_2","first-page":"3104","volume-title":"NIPS","author":"Sutskever Ilya","year":"2014","unstructured":"Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In NIPS. 3104\u20133112."},{"key":"e_1_3_3_114_2","first-page":"arXiv:2005.0074","article-title":"Synthesizer: Rethinking self-attention in transformer models","author":"Tay Yi","year":"2020","unstructured":"Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng. 2020. Synthesizer: Rethinking self-attention in transformer models. arXiv e-prints (2020), arXiv:2005.00743.","journal-title":"arXiv e-prints"},{"key":"e_1_3_3_115_2","first-page":"9438","volume-title":"ICML","author":"Tay Yi","year":"2020","unstructured":"Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-Cheng Juan. 2020. Sparse sinkhorn attention. In ICML, Vol. 119, 9438\u20139447."},{"key":"e_1_3_3_116_2","volume-title":"ICLR","author":"Tay Yi","year":"2021","unstructured":"Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, et\u00a0al. 2021. Long range arena : A benchmark for efficient transformers. In ICLR."},{"key":"e_1_3_3_117_2","article-title":"Efficient transformers: A survey","volume":"2009","author":"Tay Yi","year":"2020","unstructured":"Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. 2020. Efficient transformers: A survey. CoRR abs\/2009.06732 (2020).","journal-title":"CoRR"},{"issue":"4","key":"e_1_3_3_118_2","doi-asserted-by":"crossref","first-page":"415","DOI":"10.1177\/107769905303000401","article-title":"\u201cCloze Procedure\u201d: A new tool for measuring readability","volume":"30","author":"Taylor Wilson L.","year":"1953","unstructured":"Wilson L. Taylor. 1953. \u201cCloze Procedure\u201d: A new tool for measuring readability. Journal. Quart. 30, 4 (1953), 415\u2013433.","journal-title":"Journal. Quart."},{"key":"e_1_3_3_119_2","article-title":"MLP-Mixer: An all-MLP architecture for vision","volume":"2105","author":"Tolstikhin Ilya O.","year":"2021","unstructured":"Ilya O. Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, et\u00a0al. 2021. MLP-Mixer: An all-MLP architecture for vision. CoRR abs\/2105.01601 (2021).","journal-title":"CoRR"},{"key":"e_1_3_3_120_2","first-page":"arXiv:2008.0680","article-title":"Finding fast transformers: One-shot neural architecture search by component composition","author":"Tsai Henry","year":"2020","unstructured":"Henry Tsai, Jayden Ooi, Chun-Sung Ferng, Hyung Won Chung, and Jason Riesa. 2020. Finding fast transformers: One-shot neural architecture search by component composition. arXiv e-prints (2020), arXiv:2008.06808.","journal-title":"arXiv e-prints"},{"key":"e_1_3_3_121_2","first-page":"3632","volume-title":"EMNLP-IJCNLP","author":"Tsai Henry","year":"2019","unstructured":"Henry Tsai, Jason Riesa, Melvin Johnson, Naveen Arivazhagan, Xin Li, and Amelia Archer. 2019. Small and practical BERT models for sequence labeling. In EMNLP-IJCNLP. 3632\u20133636."},{"key":"e_1_3_3_122_2","first-page":"5998","volume-title":"NIPS","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, et\u00a0al. 2017. Attention is all you need. In NIPS. 5998\u20136008."},{"key":"e_1_3_3_123_2","volume-title":"NeurIPS","author":"Vyas Apoorv","year":"2020","unstructured":"Apoorv Vyas, Angelos Katharopoulos, and Fran\u00e7ois Fleuret. 2020. Fast transformers with clustered attention. In NeurIPS."},{"key":"e_1_3_3_124_2","volume-title":"NeurIPS","author":"Wang Alex","year":"2019","unstructured":"Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, et\u00a0al. 2019. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. In NeurIPS, Vol. 32."},{"key":"e_1_3_3_125_2","first-page":"353","volume-title":"EMNLP","author":"Wang Alex","year":"2018","unstructured":"Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In EMNLP. 353\u2013355."},{"key":"e_1_3_3_126_2","first-page":"arXiv:2002.0617","article-title":"Transformer on a diet","author":"Wang Chenguang","year":"2020","unstructured":"Chenguang Wang, Zihao Ye, Aston Zhang, Zheng Zhang, and Alexander J. Smola. 2020. Transformer on a diet. arXiv e-prints (2020), arXiv:2002.06170.","journal-title":"arXiv e-prints"},{"key":"e_1_3_3_127_2","first-page":"arXiv:2006.0476","article-title":"Linformer: Self-attention with linear complexity","author":"Wang Sinong","year":"2020","unstructured":"Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. 2020. Linformer: Self-attention with linear complexity. arXiv e-prints (2020), arXiv:2006.04768.","journal-title":"arXiv e-prints"},{"key":"e_1_3_3_128_2","first-page":"744","volume-title":"CCGRID","author":"Wang Yuxin","year":"2020","unstructured":"Yuxin Wang, Qiang Wang, Shaohuai Shi, Xin He, Zhenheng Tang, Kaiyong Zhao, et\u00a0al. 2020. Benchmarking the performance and energy efficiency of AI accelerators for AI training. In CCGRID. 744\u2013751."},{"key":"e_1_3_3_129_2","unstructured":"CoRR"},{"key":"e_1_3_3_130_2","first-page":"625","article-title":"Neural network acceptability judgments","volume":"7","author":"Warstadt Alex","year":"2019","unstructured":"Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. 2019. Neural network acceptability judgments. Trans. Assoc. Computat. Ling. 7 (2019), 625\u2013641.","journal-title":"Trans. Assoc. Computat. Ling."},{"key":"e_1_3_3_131_2","unstructured":"Lilian Weng. 2018. Attention? Attention! Retrieved from http:\/\/lilianweng.github.io\/lil-log\/2018\/06\/24\/attention-attention.html."},{"key":"e_1_3_3_132_2","first-page":"11","volume-title":"EMNLP-IJCNLP","author":"Wiegreffe Sarah","year":"2019","unstructured":"Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not explanation. In EMNLP-IJCNLP. 11\u201320."},{"key":"e_1_3_3_133_2","first-page":"38","volume-title":"EMNLP","author":"Wolf Thomas","year":"2020","unstructured":"Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, et\u00a0al. 2020. Transformers: State-of-the-art natural language processing. In EMNLP. 38\u201345."},{"key":"e_1_3_3_134_2","volume-title":"ICLR","author":"Wu Zhanghao","year":"2020","unstructured":"Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, and Song Han. 2020. Lite transformer with long-short range attention. In ICLR."},{"key":"e_1_3_3_135_2","first-page":"10684","volume-title":"CVPR","author":"Xie Qizhe","year":"2020","unstructured":"Qizhe Xie, Minh-Thang Luong, Eduard H. Hovy, and Quoc V. Le. 2020. Self-training with noisy student improves ImageNet classification. In CVPR. 10684\u201310695."},{"key":"e_1_3_3_136_2","first-page":"10524","volume-title":"ICML","author":"Xiong Ruibin","year":"2020","unstructured":"Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, et\u00a0al. 2020. On layer normalization in the transformer architecture. In ICML, Vol. 119, 10524\u201310533."},{"key":"e_1_3_3_137_2","first-page":"arXiv:2102.0390","article-title":"Nystr\u00f6mformer: A Nystr\u00f6m-based algorithm for approximating self-attention","author":"Xiong Yunyang","year":"2021","unstructured":"Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, et\u00a0al. 2021. Nystr\u00f6mformer: A Nystr\u00f6m-based algorithm for approximating self-attention. arXiv e-prints (2021), arXiv:2102.03902.","journal-title":"arXiv e-prints"},{"key":"e_1_3_3_138_2","first-page":"10819","volume-title":"CVPR","author":"Yu Weihao","year":"2022","unstructured":"Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng Yan. 2022. MetaFormer is actually what you need for vision. In CVPR. 10819\u201310829."},{"key":"e_1_3_3_139_2","article-title":"Q8BERT: Quantized 8Bit BERT","author":"Zafrir Ofir","year":"2019","unstructured":"Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. 2019. Q8BERT: Quantized 8Bit BERT. In EMC2-NIPS.","journal-title":"EMC2-NIPS"},{"key":"e_1_3_3_140_2","first-page":"17283","volume-title":"NeurIPS","author":"Zaheer Manzil","year":"2020","unstructured":"Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, et\u00a0al. 2020. Big bird: Transformers for longer sequences. In NeurIPS, Vol. 33, 17283\u201317297."},{"key":"e_1_3_3_141_2","first-page":"7829","volume-title":"ICASSP","author":"Zhang Qian","year":"2020","unstructured":"Qian Zhang, Han Lu, Hasim Sak, Anshuman Tripathi, Erik McDermott, Stephen Koo, et\u00a0al. 2020. Transformer transducer: A streamable speech recognition model with transformer encoders and RNN-T loss. In ICASSP. 7829\u20137833."},{"key":"e_1_3_3_142_2","volume-title":"ICLR","author":"Zoph Barret","year":"2017","unstructured":"Barret Zoph and Quoc V. Le. 2017. Neural architecture search with reinforcement learning. In ICLR."},{"key":"e_1_3_3_143_2","first-page":"8697","volume-title":"CVPR","author":"Zoph Barret","year":"2018","unstructured":"Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. 2018. Learning transferable architectures for scalable image recognition. In CVPR. 8697\u20138710."}],"container-title":["ACM Computing Surveys"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3586074","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3586074","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T18:08:11Z","timestamp":1750183691000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3586074"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,7,17]]},"references-count":142,"journal-issue":{"issue":"14s","published-print":{"date-parts":[[2023,12,31]]}},"alternative-id":["10.1145\/3586074"],"URL":"https:\/\/doi.org\/10.1145\/3586074","relation":{},"ISSN":["0360-0300","1557-7341"],"issn-type":[{"value":"0360-0300","type":"print"},{"value":"1557-7341","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,7,17]]},"assertion":[{"value":"2022-01-27","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-02-23","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-07-17","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}