{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,21]],"date-time":"2025-08-21T16:48:29Z","timestamp":1755794909113,"version":"3.44.0"},"publisher-location":"New York, NY, USA","reference-count":60,"publisher":"ACM","funder":[{"name":"Guangdong Basic and Applied Basic Research Foundation","award":["2024A1515011667"],"award-info":[{"award-number":["2024A1515011667"]}]},{"name":"Hong Kong RGC TRS","award":["T41-603\/20R"],"award-info":[{"award-number":["T41-603\/20R"]}]},{"DOI":"10.13039\/https:\/\/doi.org\/10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62441220"],"award-info":[{"award-number":["62441220"]}],"id":[{"id":"10.13039\/https:\/\/doi.org\/10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,8,3]]},"DOI":"10.1145\/3711896.3736949","type":"proceedings-article","created":{"date-parts":[[2025,8,3]],"date-time":"2025-08-03T20:54:17Z","timestamp":1754254457000},"page":"3055-3066","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Exploiting Student Parallelism for Low-latency GPU Inference of BERT-like Models in Online Services"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4105-0691","authenticated-orcid":false,"given":"Weiyan","family":"Wang","sequence":"first","affiliation":[{"name":"Tencent, Beijing, China and Hong Kong University of Science and Technology, Hong Kong SAR, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9502-7622","authenticated-orcid":false,"given":"Yilun","family":"Jin","sequence":"additional","affiliation":[{"name":"Hong Kong University of Science and Technology, Hong Kong SAR, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6450-8485","authenticated-orcid":false,"given":"Yiming","family":"Zhang","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5548-7301","authenticated-orcid":false,"given":"Victor Junqiu","family":"Wei","sequence":"additional","affiliation":[{"name":"Macau University of Science and Technology, Macau SAR, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3238-8500","authenticated-orcid":false,"given":"Han","family":"Tian","sequence":"additional","affiliation":[{"name":"Hong Kong University of Science and Technology, Hong Kong SAR, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4228-7885","authenticated-orcid":false,"given":"Li","family":"Chen","sequence":"additional","affiliation":[{"name":"Zhongguancun Laboratory, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-4087-9873","authenticated-orcid":false,"given":"Jinbao","family":"Xue","sequence":"additional","affiliation":[{"name":"Tencent, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-0536-4321","authenticated-orcid":false,"given":"Yangyu","family":"Tao","sequence":"additional","affiliation":[{"name":"Tencent, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-6713-1638","authenticated-orcid":false,"given":"Di","family":"Wang","sequence":"additional","affiliation":[{"name":"Tencent, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2587-6028","authenticated-orcid":false,"given":"Kai","family":"Chen","sequence":"additional","affiliation":[{"name":"Hong Kong University of Science and Technology, Hong Kong SAR, China"}]}],"member":"320","published-online":{"date-parts":[[2025,8,3]]},"reference":[{"key":"e_1_3_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1145\/2951913.2976746"},{"key":"e_1_3_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/SC41405.2020.00073"},{"key":"e_1_3_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1093\/qje\/qjab032"},{"key":"e_1_3_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.3233\/FAIA200188"},{"key":"e_1_3_2_2_5_1","unstructured":"Microsoft Azure. Machine learning inference during deployment. https:\/\/github.com\/triton-inference-server\/server."},{"key":"e_1_3_2_2_6_1","volume-title":"Language models are few-shot learners. Advances in neural information processing systems, 33:1877-1901","author":"Brown Tom","year":"2020","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877-1901, 2020."},{"key":"e_1_3_2_2_7_1","volume-title":"Once-for-all: Train one network and specialize it for efficient deployment. arXiv preprint arXiv:1908.09791","author":"Cai Han","year":"2019","unstructured":"Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. Once-for-all: Train one network and specialize it for efficient deployment. arXiv preprint arXiv:1908.09791, 2019."},{"key":"e_1_3_2_2_8_1","first-page":"613","volume-title":"14th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2017","author":"Crankshaw Daniel","year":"2017","unstructured":"Daniel Crankshaw, Xin Wang, Giulio Zhou, Michael J. Franklin, Joseph E. Gonzalez, and Ion Stoica. Clipper: A low-latency online prediction serving system. In 14th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2017, Boston, MA, USA, 2017, pages 613-627. USENIX Association, 2017."},{"key":"e_1_3_2_2_9_1","first-page":"183","volume-title":"2022 USENIX Annual Technical Conference, USENIX ATC 2022","author":"Cui Weihao","year":"2022","unstructured":"Weihao Cui, Han Zhao, Quan Chen, Hao Wei, Zirui Li, Deze Zeng, Chao Li, and Minyi Guo. Dvabatch: Diversity-aware multi-entry multi-exit batching for efficient processing of DNN services on gpus. In Jiri Schindler and Noa Zilberman, editors, 2022 USENIX Annual Technical Conference, USENIX ATC 2022, Carlsbad, CA, USA, July 11-13, 2022, pages 183-198. USENIX Association, 2022."},{"key":"e_1_3_2_2_10_1","first-page":"43525","volume-title":"Advances in Neural Information Processing Systems","author":"Dennis Don","year":"2023","unstructured":"Don Dennis, Abhishek Shetty, Anish Prasad Sevekari, Kazuhito Koishida, and Virginia Smith. Progressive ensemble distillation: Building ensembles for efficient inference. Advances in Neural Information Processing Systems, pages 43525-43543, 2023."},{"key":"e_1_3_2_2_11_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018."},{"key":"e_1_3_2_2_12_1","volume-title":"Proceedings of Machine Learning and Systems 2022","author":"Fegade Pratik","year":"2022","unstructured":"Pratik Fegade, Tianqi Chen, Phillip B. Gibbons, and Todd C. Mowry. The cora tensor compiler: Compilation for ragged tensors with minimal padding. In Proceedings of Machine Learning and Systems 2022, MLSys 2022, Santa Clara, CA, USA, August 29 - September 1, 2022. mlsys.org, 2022."},{"key":"e_1_3_2_2_13_1","first-page":"143","volume-title":"Proceedings of the 5th Workshop on Representation Learning for NLP, RepL4NLP@ACL 2020","author":"Gordon Mitchell A.","year":"2020","unstructured":"Mitchell A. Gordon, Kevin Duh, and Nicholas Andrews. Compressing BERT: studying the effects of weight pruning on transfer learning. In Proceedings of the 5th Workshop on Representation Learning for NLP, RepL4NLP@ACL 2020, Online, July 9, 2020, pages 143-155. Association for Computational Linguistics, 2020."},{"key":"e_1_3_2_2_14_1","volume-title":"The llama 3 herd of models. arXiv preprint arXiv:2407.21783","author":"Grattafiori Aaron","year":"2024","unstructured":"Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024."},{"key":"e_1_3_2_2_15_1","first-page":"1041","volume-title":"19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22)","author":"Gunasekaran Jashwant Raj","year":"2022","unstructured":"Jashwant Raj Gunasekaran, Cyan Subhra Mishra, Prashanth Thinakaran, Bikash Sharma, Mahmut Taylan Kandemir, and Chita R. Das. Cocktail: A multidimensional optimization for model serving in cloud. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 1041-1057, Renton, WA, April 2022. USENIX Association."},{"key":"e_1_3_2_2_16_1","first-page":"1041","volume-title":"19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22)","author":"Gunasekaran Jashwant Raj","year":"2022","unstructured":"Jashwant Raj Gunasekaran, Cyan Subhra Mishra, Prashanth Thinakaran, Bikash Sharma, Mahmut Taylan Kandemir, and Chita R Das. Cocktail: A multidimensional optimization for model serving in cloud. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 1041-1057, 2022."},{"key":"e_1_3_2_2_17_1","volume-title":"International Conference on Learning Representations","author":"He Pengcheng","year":"2021","unstructured":"Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations, 2021."},{"key":"e_1_3_2_2_18_1","first-page":"9782","article-title":"Dynamic bert with adaptive width and depth","volume":"33","author":"Hou Lu","year":"2020","unstructured":"Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. Dynabert: Dynamic bert with adaptive width and depth. Advances in Neural Information Processing Systems, 33:9782-9793, 2020.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_2_19_1","first-page":"2058","volume-title":"International Conference on Machine Learning","author":"Huang Furong","year":"2018","unstructured":"Furong Huang, Jordan Ash, John Langford, and Robert Schapire. Learning deep resnet blocks sequentially using boosting theory. In International Conference on Machine Learning, pages 2058-2067. PMLR, 2018."},{"key":"e_1_3_2_2_20_1","volume-title":"GPU Technology Conference (GTC)","volume":"2","author":"Jeaugey Sylvain","year":"2017","unstructured":"Sylvain Jeaugey. Nccl 2.0. In GPU Technology Conference (GTC), volume 2, 2017."},{"key":"e_1_3_2_2_21_1","volume-title":"Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351","author":"Jiao Xiaoqi","year":"2019","unstructured":"Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351, 2019."},{"key":"e_1_3_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.391"},{"key":"e_1_3_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/GHTC.2018.8601887"},{"key":"e_1_3_2_2_24_1","first-page":"5506","volume-title":"International conference on machine learning","author":"Kim Sehoon","year":"2021","unstructured":"Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. I-bert: Integer-only bert quantization. In International conference on machine learning, pages 5506-5518. PMLR, 2021."},{"key":"e_1_3_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1145\/3534678.3539260"},{"key":"e_1_3_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1145\/3600006.3613165"},{"key":"e_1_3_2_2_27_1","volume-title":"Deep learning. nature, 521(7553):436-444","author":"LeCun Yann","year":"2015","unstructured":"Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436-444, 2015."},{"key":"e_1_3_2_2_28_1","first-page":"9459","article-title":"Retrieval-augmented generation for knowledge-intensive nlp tasks","volume":"33","author":"Lewis Patrick","year":"2020","unstructured":"Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K\u00fcttler, Mike Lewis, Wen-tau Yih, Tim Rockt\u00e4schel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459-9474, 2020.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.537"},{"key":"e_1_3_2_2_30_1","volume-title":"Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692","author":"Liu Yinhan","year":"2019","unstructured":"Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019."},{"key":"e_1_3_2_2_31_1","first-page":"6682","volume-title":"International Conference on Machine Learning","author":"Malach Eran","year":"2020","unstructured":"Eran Malach, Gilad Yehudai, Shai Shalev-Schwartz, and Ohad Shamir. Proving the lottery ticket hypothesis: Pruning is all you need. In International Conference on Machine Learning, pages 6682-6691. PMLR, 2020."},{"key":"e_1_3_2_2_32_1","volume-title":"Boosting algorithms as gradient descent. Advances in neural information processing systems, 12","author":"Mason Llew","year":"1999","unstructured":"Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. Boosting algorithms as gradient descent. Advances in neural information processing systems, 12, 1999."},{"key":"e_1_3_2_2_33_1","first-page":"221","volume-title":"Advances in Neural Information Processing Systems","author":"Mason Llew","year":"1999","unstructured":"Llew Mason, Jonathan Baxter, Peter L Bartlett, Marcus Frean, et al. Functional gradient techniques for combining hypotheses. Advances in Neural Information Processing Systems, pages 221-246, 1999."},{"key":"e_1_3_2_2_34_1","volume-title":"Are sixteen heads really better than one? Advances in neural information processing systems, 32","author":"Michel Paul","year":"2019","unstructured":"Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? Advances in neural information processing systems, 32, 2019."},{"key":"e_1_3_2_2_35_1","unstructured":"Nvidia. Multi-process service. https:\/\/docs.nvidia.com\/deploy\/mps\/index.html."},{"key":"e_1_3_2_2_36_1","unstructured":"Automatic Differentiation In Pytorch. Pytorch 2018."},{"key":"e_1_3_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISCA45697.2020.00045"},{"key":"e_1_3_2_2_38_1","first-page":"397","volume-title":"2021 USENIX Annual Technical Conference, USENIX ATC 2021","author":"Romero Francisco","year":"2021","unstructured":"Francisco Romero, Qian Li, Neeraja J. Yadwadkar, and Christos Kozyrakis. Infaas: Automated model-less inference serving. In Irina Calciu and Geoff Kuenning, editors, 2021 USENIX Annual Technical Conference, USENIX ATC 2021, July 14-16, 2021, pages 397-411. USENIX Association, 2021."},{"key":"e_1_3_2_2_39_1","volume-title":"a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108","author":"Sanh Victor","year":"2019","unstructured":"Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019."},{"key":"e_1_3_2_2_40_1","volume-title":"5th International Conference on Learning Representations, ICLR 2017","author":"Shazeer Noam","year":"2017","unstructured":"Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 2017."},{"key":"e_1_3_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6409"},{"key":"e_1_3_2_2_42_1","volume-title":"Collaborative teacher-student learning via multiple knowledge transfer. arXiv preprint arXiv:2101.08471","author":"Sun Liyuan","year":"2021","unstructured":"Liyuan Sun, Jianping Gou, Baosheng Yu, Lan Du, and Dacheng Tao. Collaborative teacher-student learning via multiple knowledge transfer. arXiv preprint arXiv:2101.08471, 2021."},{"key":"e_1_3_2_2_43_1","volume-title":"Patient knowledge distillation for bert model compression. arXiv preprint arXiv:1908.09355","author":"Sun Siqi","year":"2019","unstructured":"Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. Patient knowledge distillation for bert model compression. arXiv preprint arXiv:1908.09355, 2019."},{"key":"e_1_3_2_2_44_1","volume-title":"Mobilebert: a compact task-agnostic bert for resource-limited devices. arXiv preprint arXiv:2004.02984","author":"Sun Zhiqing","year":"2020","unstructured":"Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. Mobilebert: a compact task-agnostic bert for resource-limited devices. arXiv preprint arXiv:2004.02984, 2020."},{"key":"e_1_3_2_2_45_1","volume-title":"Mkq-bert: Quantized bert with 4-bits weights and activations. arXiv preprint arXiv:2203.13483","author":"Tang Hanlin","year":"2022","unstructured":"Hanlin Tang, Xipeng Zhang, Kai Liu, Jianchen Zhu, and Zhanhui Kang. Mkq-bert: Quantized bert with 4-bits weights and activations. arXiv preprint arXiv:2203.13483, 2022."},{"key":"e_1_3_2_2_46_1","volume-title":"Well-read students learn better: On the importance of pre-training compact models. arXiv preprint arXiv:1908.08962","author":"Turc Iulia","year":"2019","unstructured":"Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Well-read students learn better: On the importance of pre-training compact models. arXiv preprint arXiv:1908.08962, 2019."},{"key":"e_1_3_2_2_47_1","volume-title":"Residual networks behave like ensembles of relatively shallow networks. Advances in neural information processing systems, 29","author":"Veit Andreas","year":"2016","unstructured":"Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks. Advances in neural information processing systems, 29, 2016."},{"key":"e_1_3_2_2_48_1","volume-title":"Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461","author":"Wang Alex","year":"2018","unstructured":"Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018."},{"key":"e_1_3_2_2_49_1","first-page":"664","article-title":"Convnets decomposition via class parallelism for fast inference on live data","volume":"3","author":"Wang Guanhua","year":"2021","unstructured":"Guanhua Wang, Zhuang Liu, Brandon Hsieh, Siyuan Zhuang, Joseph Gonzalez, Trevor Darrell, and Ion Stoica. sensai: Convnets decomposition via class parallelism for fast inference on live data. Proceedings of Machine Learning and Systems, 3:664-679, 2021.","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"e_1_3_2_2_50_1","volume-title":"Huggingface's transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771","author":"Wolf Thomas","year":"2019","unstructured":"Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R\u00e9mi Louf, Morgan Funtowicz, et al. Huggingface's transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019."},{"key":"e_1_3_2_2_51_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.204"},{"key":"e_1_3_2_2_52_1","doi-asserted-by":"publisher","DOI":"10.1145\/3447548.3467262"},{"key":"e_1_3_2_2_53_1","doi-asserted-by":"publisher","DOI":"10.1145\/3336191.3371792"},{"key":"e_1_3_2_2_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/EMC2-NIPS53020.2019.00016"},{"key":"e_1_3_2_2_55_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-15919-0_43"},{"key":"e_1_3_2_2_56_1","volume-title":"Bytetransformer: A high-performance transformer boosted for variable-length inputs. arXiv preprint arXiv:2210.03052","author":"Zhai Yujia","year":"2022","unstructured":"Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang Zhang, Zizhong Chen, Xin Liu, and Yibo Zhu. Bytetransformer: A high-performance transformer boosted for variable-length inputs. arXiv preprint arXiv:2210.03052, 2022."},{"key":"e_1_3_2_2_57_1","first-page":"1049","volume-title":"USENIX Annual Technical Conference (ATC)","author":"Zhang Chengliang","year":"2019","unstructured":"Chengliang Zhang, Minchen Yu, Wei Wang, and Feng Yan. {MArk}: Exploiting cloud services for {Cost-Effective},{SLO-Aware} machine learning inference serving. In USENIX Annual Technical Conference (ATC), pages 1049-1062, 2019."},{"key":"e_1_3_2_2_58_1","first-page":"18330","article-title":"Bert loses patience: Fast and robust inference with early exit","volume":"33","author":"Zhou Wangchunshu","year":"2020","unstructured":"Wangchunshu Zhou, Canwen Xu, Tao Ge, Julian McAuley, Ke Xu, and Furu Wei. Bert loses patience: Fast and robust inference with early exit. Advances in Neural Information Processing Systems, 33:18330-18341, 2020.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_2_59_1","volume-title":"Knowledge distillation by on-the-fly native ensemble. Advances in neural information processing systems, 31","author":"Zhu Xiatian","year":"2018","unstructured":"Xiatian Zhu, Shaogang Gong, et al. Knowledge distillation by on-the-fly native ensemble. Advances in neural information processing systems, 31, 2018."},{"key":"e_1_3_2_2_60_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.naacl-main.116"}],"event":{"name":"KDD '25: The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining","sponsor":["SIGMOD ACM Special Interest Group on Management of Data","SIGKDD ACM Special Interest Group on Knowledge Discovery in Data"],"location":"Toronto ON Canada","acronym":"KDD '25"},"container-title":["Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3711896.3736949","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,8,16]],"date-time":"2025-08-16T14:31:01Z","timestamp":1755354661000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3711896.3736949"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,8,3]]},"references-count":60,"alternative-id":["10.1145\/3711896.3736949","10.1145\/3711896"],"URL":"https:\/\/doi.org\/10.1145\/3711896.3736949","relation":{},"subject":[],"published":{"date-parts":[[2025,8,3]]},"assertion":[{"value":"2025-08-03","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}