{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,14]],"date-time":"2026-01-14T18:37:27Z","timestamp":1768415847169,"version":"3.49.0"},"publisher-location":"New York, NY, USA","reference-count":44,"publisher":"ACM","license":[{"start":{"date-parts":[[2023,8,4]],"date-time":"2023-08-04T00:00:00Z","timestamp":1691107200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2023,8,6]]},"DOI":"10.1145\/3580305.3599284","type":"proceedings-article","created":{"date-parts":[[2023,8,4]],"date-time":"2023-08-04T18:10:58Z","timestamp":1691172658000},"page":"1280-1290","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":9,"title":["Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0009-0005-3557-4922","authenticated-orcid":false,"given":"Junyan","family":"Li","sequence":"first","affiliation":[{"name":"Zhejiang University, Hangzhou, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4465-1628","authenticated-orcid":false,"given":"Li Lyna","family":"Zhang","sequence":"additional","affiliation":[{"name":"Microsoft Research, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9186-619X","authenticated-orcid":false,"given":"Jiahang","family":"Xu","sequence":"additional","affiliation":[{"name":"Microsoft Research, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7940-5216","authenticated-orcid":false,"given":"Yujing","family":"Wang","sequence":"additional","affiliation":[{"name":"Microsoft, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-1990-5743","authenticated-orcid":false,"given":"Shaoguang","family":"Yan","sequence":"additional","affiliation":[{"name":"Microsoft, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-8608-574X","authenticated-orcid":false,"given":"Yunqing","family":"Xia","sequence":"additional","affiliation":[{"name":"Microsoft, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3518-5212","authenticated-orcid":false,"given":"Yuqing","family":"Yang","sequence":"additional","affiliation":[{"name":"Microsoft Research, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9107-013X","authenticated-orcid":false,"given":"Ting","family":"Cao","sequence":"additional","affiliation":[{"name":"Microsoft Research, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-5027-7478","authenticated-orcid":false,"given":"Hao","family":"Sun","sequence":"additional","affiliation":[{"name":"Microsoft, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-4793-9715","authenticated-orcid":false,"given":"Weiwei","family":"Deng","sequence":"additional","affiliation":[{"name":"Microsoft, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-7438-7248","authenticated-orcid":false,"given":"Qi","family":"Zhang","sequence":"additional","affiliation":[{"name":"Microsoft, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-6455-3898","authenticated-orcid":false,"given":"Mao","family":"Yang","sequence":"additional","affiliation":[{"name":"Microsoft Research, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2023,8,4]]},"reference":[{"key":"e_1_3_2_2_1_1","unstructured":"2022. thop. https:\/\/github.com\/Lyken17\/pytorch-OpCounter  2022. thop. https:\/\/github.com\/Lyken17\/pytorch-OpCounter"},{"key":"e_1_3_2_2_2_1","volume-title":"Jamie Ryan Kiros, and Geoffrey E Hinton","author":"Ba Jimmy Lei","year":"2016","unstructured":"Jimmy Lei Ba , Jamie Ryan Kiros, and Geoffrey E Hinton . 2016 . Layer normalization. arXiv preprint arXiv:1607.06450 (2016). Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016)."},{"key":"e_1_3_2_2_3_1","volume-title":"Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150","author":"Beltagy Iz","year":"2020","unstructured":"Iz Beltagy , Matthew E Peters , and Arman Cohan . 2020 . Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 (2020). Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 (2020)."},{"key":"e_1_3_2_2_4_1","doi-asserted-by":"crossref","unstructured":"Wentao Chen Hailong Qiu Jian Zhuang Chutong Zhang Yu Hu Qing Lu Tianchen Wang Yiyu Shi Meiping Huang and Xiaowe Xu. 2021. Quantization of Deep Neural Networks for Accurate Edge Computing. arXiv:2104.12046 [cs.CV]  Wentao Chen Hailong Qiu Jian Zhuang Chutong Zhang Yu Hu Qing Lu Tianchen Wang Yiyu Shi Meiping Huang and Xiaowe Xu. 2021. Quantization of Deep Neural Networks for Accurate Edge Computing. arXiv:2104.12046 [cs.CV]","DOI":"10.1145\/3451211"},{"key":"e_1_3_2_2_5_1","volume-title":"Generating Long Sequences with Sparse Transformers. URL https:\/\/openai.com\/blog\/sparse-transformers","author":"Child Rewon","year":"2019","unstructured":"Rewon Child , Scott Gray , Alec Radford , and Ilya Sutskever . 2019. Generating Long Sequences with Sparse Transformers. URL https:\/\/openai.com\/blog\/sparse-transformers ( 2019 ). Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating Long Sequences with Sparse Transformers. URL https:\/\/openai.com\/blog\/sparse-transformers (2019)."},{"key":"e_1_3_2_2_6_1","volume-title":"Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP.","author":"Clark Kevin","unstructured":"Kevin Clark , Urvashi Khandelwal , Omer Levy , and Christopher D. Manning . 2019. What Does BERT Look at? An Analysis of BERT's Attention . In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What Does BERT Look at? An Analysis of BERT's Attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP."},{"key":"e_1_3_2_2_7_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)."},{"key":"e_1_3_2_2_8_1","doi-asserted-by":"crossref","unstructured":"Mitchell A. Gordon Kevin Duh and Nicholas Andrews. 2020. Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning. arXiv:2002.08307 [cs.CL]  Mitchell A. Gordon Kevin Duh and Nicholas Andrews. 2020. Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning. arXiv:2002.08307 [cs.CL]","DOI":"10.18653\/v1\/2020.repl4nlp-1.18"},{"key":"e_1_3_2_2_9_1","volume-title":"Proceedings of the 37th International Conference on Machine Learning (Proceedings of Machine Learning Research","author":"Goyal Saurabh","year":"2020","unstructured":"Saurabh Goyal , Anamitra Roy Choudhury , Saurabh Raje , Venkatesan Chakaravarthy , Yogish Sabharwal , and Ashish Verma . 2020 . PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination . In Proceedings of the 37th International Conference on Machine Learning (Proceedings of Machine Learning Research , Vol. 119), Hal Daum\u00e9 III and Aarti Singh (Eds.). PMLR, 3690--3699. Saurabh Goyal, Anamitra Roy Choudhury, Saurabh Raje, Venkatesan Chakaravarthy, Yogish Sabharwal, and Ashish Verma. 2020. PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination. In Proceedings of the 37th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 119), Hal Daum\u00e9 III and Aarti Singh (Eds.). PMLR, 3690--3699."},{"key":"e_1_3_2_2_10_1","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 7275--7286","author":"Guan Yue","year":"2022","unstructured":"Yue Guan , Zhengyi Li , Jingwen Leng , Zhouhan Lin , and Minyi Guo . 2022 . Tran-skimmer: Transformer Learns to Layer-wise Skim . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 7275--7286 . Yue Guan, Zhengyi Li, Jingwen Leng, Zhouhan Lin, and Minyi Guo. 2022. Tran-skimmer: Transformer Learns to Layer-wise Skim. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 7275--7286."},{"key":"e_1_3_2_2_11_1","volume-title":"Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415","author":"Hendrycks Dan","year":"2016","unstructured":"Dan Hendrycks and Kevin Gimpel . 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 ( 2016 ). Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016)."},{"key":"e_1_3_2_2_12_1","volume-title":"Distilling the Knowl- edge in a Neural Network. ArXiv abs\/1503.02531","author":"Hinton Geoffrey E.","year":"2015","unstructured":"Geoffrey E. Hinton , Oriol Vinyals , and Jeffrey Dean . 2015. Distilling the Knowl- edge in a Neural Network. ArXiv abs\/1503.02531 ( 2015 ). Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the Knowl- edge in a Neural Network. ArXiv abs\/1503.02531 (2015)."},{"key":"e_1_3_2_2_13_1","volume-title":"Categorical Reparameterization with Gumbel-Softmax. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=rkE3y85ee","author":"Jang Eric","year":"2017","unstructured":"Eric Jang , Shixiang Gu , and Ben Poole . 2017 . Categorical Reparameterization with Gumbel-Softmax. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=rkE3y85ee Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical Reparameterization with Gumbel-Softmax. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=rkE3y85ee"},{"key":"e_1_3_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1356"},{"key":"e_1_3_2_2_15_1","doi-asserted-by":"crossref","unstructured":"Xiaoqi Jiao Yichun Yin Lifeng Shang Xin Jiang Xiao Chen Linlin Li Fang Wang and Qun Liu. 2020. TinyBERT: Distilling BERT for Natural Language Understanding.  Xiaoqi Jiao Yichun Yin Lifeng Shang Xin Jiang Xiao Chen Linlin Li Fang Wang and Qun Liu. 2020. TinyBERT: Distilling BERT for Natural Language Understanding.","DOI":"10.18653\/v1\/2020.findings-emnlp.372"},{"key":"e_1_3_2_2_16_1","doi-asserted-by":"crossref","unstructured":"Gyuwan Kim and Kyunghyun Cho. 2021. Length-Adaptive Transformer: Train Once with Length Drop Use Anytime with Search. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics 6501--6511.  Gyuwan Kim and Kyunghyun Cho. 2021. Length-Adaptive Transformer: Train Once with Length Drop Use Anytime with Search. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics 6501--6511.","DOI":"10.18653\/v1\/2021.acl-long.508"},{"key":"e_1_3_2_2_17_1","volume-title":"I-bert: Integer-only bert quantization. arXiv preprint arXiv:2101.01321","author":"Kim Sehoon","year":"2021","unstructured":"Sehoon Kim , Amir Gholami , Zhewei Yao , Michael W Mahoney , and Kurt Keutzer . 2021 . I-bert: Integer-only bert quantization. arXiv preprint arXiv:2101.01321 (2021). Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. 2021. I-bert: Integer-only bert quantization. arXiv preprint arXiv:2101.01321 (2021)."},{"key":"e_1_3_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3534678.3539260"},{"key":"e_1_3_2_2_19_1","volume-title":"Rush","author":"Lagunas Fran\u00e7ois","year":"2021","unstructured":"Fran\u00e7ois Lagunas , Ella Charlaix , Victor Sanh , and Alexander M . Rush . 2021 . Block Pruning For Faster Transformers. In EMNLP. Fran\u00e7ois Lagunas, Ella Charlaix, Victor Sanh, and Alexander M. Rush. 2021. Block Pruning For Faster Transformers. In EMNLP."},{"key":"e_1_3_2_2_20_1","volume-title":"Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942","author":"Lan Zhenzhong","year":"2019","unstructured":"Zhenzhong Lan , Mingda Chen , Sebastian Goodman , Kevin Gimpel , Piyush Sharma , and Radu Soricut . 2019 . Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019). Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019)."},{"key":"e_1_3_2_2_21_1","doi-asserted-by":"crossref","unstructured":"Ken Lang. 1995. NewsWeeder: Learning to Filter Netnews. (1995) 331--339.  Ken Lang. 1995. NewsWeeder: Learning to Filter Netnews. (1995) 331--339.","DOI":"10.1016\/B978-1-55860-377-6.50048-7"},{"key":"e_1_3_2_2_22_1","volume-title":"Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692","author":"Liu Yinhan","year":"2019","unstructured":"Yinhan Liu , Myle Ott , Naman Goyal , Jingfei Du , Mandar Joshi , Danqi Chen , Omer Levy , Mike Lewis , Luke Zettlemoyer , and Veselin Stoyanov . 2019 . Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019). Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)."},{"key":"e_1_3_2_2_23_1","volume-title":"International Conference on Learning Representations.","author":"Louizos Christos","unstructured":"Christos Louizos , Max Welling , and Diederik P. Kingma . 2018. Learning Sparse Neural Networks through L0 Regularization . In International Conference on Learning Representations. Christos Louizos, Max Welling, and Diederik P. Kingma. 2018. Learning Sparse Neural Networks through L0 Regularization. In International Conference on Learning Representations."},{"key":"e_1_3_2_2_24_1","unstructured":"Microsoft. 2022. onnxruntime. https:\/\/onnxruntime.ai\/  Microsoft. 2022. onnxruntime. https:\/\/onnxruntime.ai\/"},{"key":"e_1_3_2_2_25_1","first-page":"140","article-title":"2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer","volume":"21","author":"Raffel Colin","year":"2020","unstructured":"Colin Raffel , Noam Shazeer , Adam Roberts , Katherine Lee , Sharan Narang , Michael Matena , Yanqi Zhou , Wei Li , and Peter J. Liu . 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer . Journal of Machine Learning Research 21 , 140 ( 2020 ), 1--67. http:\/\/jmlr.org\/papers\/v21\/20-074.html Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research 21, 140 (2020), 1--67. http:\/\/jmlr.org\/papers\/v21\/20-074.html","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-2124"},{"key":"e_1_3_2_2_27_1","volume-title":"Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems 34","author":"Rao Yongming","year":"2021","unstructured":"Yongming Rao , Wenliang Zhao , Benlin Liu , Jiwen Lu , Jie Zhou , and Cho-Jui Hsieh . 2021 . Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems 34 (2021), 13937--13949. Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. 2021. Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems 34 (2021), 13937--13949."},{"key":"e_1_3_2_2_28_1","volume-title":"A Primer in BERTology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics","author":"Rogers Anna","year":"2020","unstructured":"Anna Rogers , Olga Kovaleva , and Anna Rumshisky . 2020. A Primer in BERTology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics ( 2020 ). Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2020. A Primer in BERTology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics (2020)."},{"key":"e_1_3_2_2_29_1","unstructured":"Victor Sanh Lysandre Debut Julien Chaumond and Thomas Wolf. 2020. DistilBERT a distilled version of BERT: smaller faster cheaper and lighter. arXiv:1910.01108 [cs.CL]  Victor Sanh Lysandre Debut Julien Chaumond and Thomas Wolf. 2020. DistilBERT a distilled version of BERT: smaller faster cheaper and lighter. arXiv:1910.01108 [cs.CL]"},{"key":"e_1_3_2_2_30_1","unstructured":"Victor Sanh Thomas Wolf and Alexander M Rush. 2020. Movement pruning: Adaptive sparsity by fine-tuning. In NeurIPS.  Victor Sanh Thomas Wolf and Alexander M Rush. 2020. Movement pruning: Adaptive sparsity by fine-tuning. In NeurIPS."},{"key":"e_1_3_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6409"},{"key":"e_1_3_2_2_32_1","volume-title":"Patient Knowledge Distillation for BERT Model Compression. In Conference on Empirical Methods in Natural Language Processing.","author":"Sun S.","year":"2019","unstructured":"S. Sun , Yu Cheng , Zhe Gan , and Jingjing Liu . 2019 . Patient Knowledge Distillation for BERT Model Compression. In Conference on Empirical Methods in Natural Language Processing. S. Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019. Patient Knowledge Distillation for BERT Model Compression. In Conference on Empirical Methods in Natural Language Processing."},{"key":"e_1_3_2_2_33_1","volume-title":"Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation. ArXiv abs\/1908.08962","author":"Turc Iulia","year":"2019","unstructured":"Iulia Turc , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2019. Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation. ArXiv abs\/1908.08962 ( 2019 ). Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation. ArXiv abs\/1908.08962 (2019)."},{"key":"e_1_3_2_2_34_1","volume-title":"GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=rJ4km2R5t7","author":"Wang Alex","unstructured":"Alex Wang , Amanpreet Singh , Julian Michael , Felix Hill , Omer Levy , and Samuel R. Bowman . 2019 . GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=rJ4km2R5t7 Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=rJ4km2R5t7"},{"key":"e_1_3_2_2_35_1","volume-title":"HAT: Hardware-Aware Transformers for Efficient Natural Language Processing. In Annual Conference of the Association for Computational Linguistics.","author":"Wang Hanrui","year":"2020","unstructured":"Hanrui Wang , Zhanghao Wu , Zhijian Liu , Han Cai , Ligeng Zhu , Chuang Gan , and Song Han . 2020 . HAT: Hardware-Aware Transformers for Efficient Natural Language Processing. In Annual Conference of the Association for Computational Linguistics. Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang Gan, and Song Han. 2020. HAT: Hardware-Aware Transformers for Efficient Natural Language Processing. In Annual Conference of the Association for Computational Linguistics."},{"key":"e_1_3_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/HPCA51647.2021.00018"},{"key":"e_1_3_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/3269206.3271784"},{"key":"e_1_3_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3269206.3271784"},{"key":"e_1_3_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.496"},{"key":"e_1_3_2_2_40_1","volume-title":"Structured Pruning Learns Compact and Accurate Models","author":"Xia Mengzhou","unstructured":"Mengzhou Xia , Zexuan Zhong , and Danqi Chen . 2022. Structured Pruning Learns Compact and Accurate Models . In Association for Computational Linguistics (ACL) . Mengzhou Xia, Zexuan Zhong, and Danqi Chen. 2022. Structured Pruning Learns Compact and Accurate Models. In Association for Computational Linguistics (ACL)."},{"key":"e_1_3_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1145\/3447548.3467262"},{"key":"e_1_3_2_2_42_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.naacl-main.463"},{"key":"e_1_3_2_2_43_1","first-page":"33","article-title":"2020. Big bird: Transformers for longer sequences","author":"Zaheer Manzil","year":"2020","unstructured":"Manzil Zaheer , Guru Guruganesh , Kumar Avinava Dubey , Joshua Ainslie , Chris Alberti , Santiago Ontanon , Philip Pham , Anirudh Ravula , Qifan Wang , Li Yang , 2020. Big bird: Transformers for longer sequences . Advances in Neural Information Processing Systems 33 ( 2020 ). Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. 2020. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems 33 (2020).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_2_44_1","doi-asserted-by":"publisher","DOI":"10.1145\/3511808.3557139"}],"event":{"name":"KDD '23: The 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining","location":"Long Beach CA USA","acronym":"KDD '23","sponsor":["SIGMOD ACM Special Interest Group on Management of Data","SIGKDD ACM Special Interest Group on Knowledge Discovery in Data"]},"container-title":["Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3580305.3599284","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3580305.3599284","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:51:16Z","timestamp":1750182676000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3580305.3599284"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,8,4]]},"references-count":44,"alternative-id":["10.1145\/3580305.3599284","10.1145\/3580305"],"URL":"https:\/\/doi.org\/10.1145\/3580305.3599284","relation":{},"subject":[],"published":{"date-parts":[[2023,8,4]]},"assertion":[{"value":"2023-08-04","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}