{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,14]],"date-time":"2026-07-14T09:07:01Z","timestamp":1784020021721,"version":"3.55.0"},"reference-count":288,"publisher":"MIT Press","license":[{"start":{"date-parts":[[2023,7,13]],"date-time":"2023-07-13T00:00:00Z","timestamp":1689206400000},"content-version":"vor","delay-in-days":193,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2023,7,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Recent work in natural language processing (NLP) has yielded appealing results from scaling model parameters and training data; however, using only scale to improve performance means that resource consumption also grows. Such resources include data, time, storage, or energy, all of which are naturally limited and unevenly distributed. This motivates research into efficient methods that require fewer resources to achieve similar results. This survey synthesizes and relates current methods and findings in efficient NLP. We aim to provide both guidance for conducting NLP under limited resources, and point towards promising research directions for developing more efficient methods.<\/jats:p>","DOI":"10.1162\/tacl_a_00577","type":"journal-article","created":{"date-parts":[[2023,7,13]],"date-time":"2023-07-13T16:51:52Z","timestamp":1689267112000},"page":"826-860","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":81,"title":["Efficient Methods for Natural Language Processing: A Survey"],"prefix":"10.1162","volume":"11","author":[{"given":"Marcos","family":"Treviso","sequence":"first","affiliation":[{"name":"Stony Brook University, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Ji-Ung","family":"Lee","sequence":"additional","affiliation":[{"name":"Technical University of Darmstadt, Germany"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Tianchu","family":"Ji","sequence":"additional","affiliation":[{"name":"Stony Brook University, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Betty van","family":"Aken","sequence":"additional","affiliation":[{"name":"Berliner Hochschule f\u00fcr Technik, Germany"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Qingqing","family":"Cao","sequence":"additional","affiliation":[{"name":"University of Washington, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Manuel R.","family":"Ciosici","sequence":"additional","affiliation":[{"name":"University of Southern California, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Michael","family":"Hassid","sequence":"additional","affiliation":[{"name":"The Hebrew University of Jerusalem, Israel"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Kenneth","family":"Heafield","sequence":"additional","affiliation":[{"name":"University of Edinburgh, UK"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Sara","family":"Hooker","sequence":"additional","affiliation":[{"name":"Cohere For AI, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Colin","family":"Raffel","sequence":"additional","affiliation":[{"name":"University of North Carolina at Chapel Hill, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Pedro H.","family":"Martins","sequence":"additional","affiliation":[{"name":"IST\/U. of Lisbon and Instituto de Telecomunica\u00e7\u00f5es, Portugal"},{"name":"Unbabel, Portugal"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Andr\u00e9 F. T.","family":"Martins","sequence":"additional","affiliation":[{"name":"IST\/U. of Lisbon and Instituto de Telecomunica\u00e7\u00f5es, Portugal"},{"name":"Unbabel, Portugal"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jessica Zosa","family":"Forde","sequence":"additional","affiliation":[{"name":"Brown University, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Peter","family":"Milder","sequence":"additional","affiliation":[{"name":"Stony Brook University, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Edwin","family":"Simpson","sequence":"additional","affiliation":[{"name":"University of Bristol, UK"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Noam","family":"Slonim","sequence":"additional","affiliation":[{"name":"IBM Research, Israel"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Jesse","family":"Dodge","sequence":"additional","affiliation":[{"name":"Allen Institute for AI, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Emma","family":"Strubell","sequence":"additional","affiliation":[{"name":"Allen Institute for AI, USA"},{"name":"Carnegie Mellon University, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Niranjan","family":"Balasubramanian","sequence":"additional","affiliation":[{"name":"Stony Brook University, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Leon","family":"Derczynski","sequence":"additional","affiliation":[{"name":"University of Washington, USA"},{"name":"IT University of Copenhagen, Denmark"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Iryna","family":"Gurevych","sequence":"additional","affiliation":[{"name":"Technical University of Darmstadt, Germany"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Roy","family":"Schwartz","sequence":"additional","affiliation":[{"name":"The Hebrew University of Jerusalem, Israel"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"281","published-online":{"date-parts":[[2023,7,12]]},"reference":[{"key":"2023071316514707200_bib1","doi-asserted-by":"publisher","first-page":"10368","DOI":"10.1109\/CVPR52688.2022.01012","article-title":"Estimating example difficulty using variance of gradients","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Agarwal","year":"2022"},{"key":"2023071316514707200_bib2","first-page":"29304","article-title":"Deep reinforcement learning at the edge of the statistical precipice","volume-title":"Advances in Neural Information Processing Systems","author":"Agarwal","year":"2021"},{"key":"2023071316514707200_bib3","doi-asserted-by":"publisher","first-page":"5799","DOI":"10.18653\/v1\/2021.emnlp-main.468","article-title":"Muppet: Massive multi-task representations with pre-finetuning","volume-title":"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing","author":"Aghajanyan","year":"2021"},{"key":"2023071316514707200_bib4","doi-asserted-by":"publisher","first-page":"7319","DOI":"10.18653\/v1\/2021.acl-long.568","article-title":"Intrinsic dimensionality explains the effectiveness of language model fine-tuning","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Aghajanyan","year":"2021"},{"key":"2023071316514707200_bib5","doi-asserted-by":"crossref","first-page":"142","DOI":"10.18653\/v1\/2021.sustainlp-1.15","article-title":"On the role of corpus ordering in language modeling","volume-title":"Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing","author":"Agrawal","year":"2021"},{"key":"2023071316514707200_bib6","doi-asserted-by":"crossref","first-page":"3316","DOI":"10.18653\/v1\/2021.findings-emnlp.282","article-title":"The low-resource double bind: An empirical study of pruning for low-resource machine translation","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2021","author":"Ahia","year":"2021"},{"key":"2023071316514707200_bib7","article-title":"The de-democratization of AI: Deep learning and the compute divide in artificial intelligence research","author":"Ahmed","year":"2020","journal-title":"arXiv preprint arXiv:2010.15581v1"},{"key":"2023071316514707200_bib8","doi-asserted-by":"publisher","first-page":"268","DOI":"10.18653\/v1\/2020.emnlp-main.19","article-title":"ETC: Encoding long and structured inputs in transformers","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Ainslie","year":"2020"},{"key":"2023071316514707200_bib9","doi-asserted-by":"publisher","first-page":"131","DOI":"10.18653\/v1\/2022.acl-short.16","article-title":"How does the pre-training objective affect what large language models learn about linguistic properties?","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)","author":"Alajrami","year":"2022"},{"key":"2023071316514707200_bib10","first-page":"468","article-title":"Neuro-symbolic language modeling with automaton-augmented retrieval","volume-title":"Proceedings of the 39th International Conference on Machine Learning","author":"Alon","year":"2022"},{"key":"2023071316514707200_bib11","article-title":"CarbonTracker: Tracking and predicting the carbon footprint of training deep learning models","volume-title":"Proceedings of the workshop on Challenges in Deploying and monitoring Machine Learning Systems, ICML","author":"Wolff Anthony","year":"2020"},{"key":"2023071316514707200_bib12","article-title":"ExT5: Towards extreme multi-task scaling for transfer learning","volume-title":"International Conference on Learning Representations","author":"Aribandi","year":"2022"},{"key":"2023071316514707200_bib13","article-title":"Deep batch active learning by diverse, uncertain gradient lower bounds","volume-title":"International Conference on Learning Representations","author":"Ash","year":"2020"},{"key":"2023071316514707200_bib14","doi-asserted-by":"publisher","first-page":"93","DOI":"10.18653\/v1\/2022.acl-demo.9","article-title":"PromptSource: An integrated development environment and repository for natural language prompts","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations","author":"Bach","year":"2022"},{"key":"2023071316514707200_bib15","doi-asserted-by":"publisher","first-page":"4334","DOI":"10.18653\/v1\/2021.acl-long.334","article-title":"BinaryBERT: Pushing the limit of BERT quantization","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Bai","year":"2021"},{"key":"2023071316514707200_bib16","first-page":"10876","article-title":"Deep learning through the lens of example difficulty","volume-title":"Advances in Neural Information Processing Systems","author":"Baldock","year":"2021"},{"key":"2023071316514707200_bib17","doi-asserted-by":"crossref","first-page":"1538","DOI":"10.18653\/v1\/D19-1165","article-title":"Simple, scalable adaptation for neural machine translation","volume-title":"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)","author":"Bapna","year":"2019"},{"key":"2023071316514707200_bib18","first-page":"430","article-title":"Pathways: Asynchronous distributed dataflow for ML","volume":"4","author":"Barham","year":"2022","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"2023071316514707200_bib19","first-page":"1074","article-title":"Pruning neural machine translation for speed using group lasso","volume-title":"Proceedings of the Sixth Conference on Machine Translation","author":"Behnke","year":"2021"},{"key":"2023071316514707200_bib20","article-title":"Modeling the machine learning multiverse","volume-title":"Advances in Neural Information Processing Systems","author":"Bell","year":"2022"},{"key":"2023071316514707200_bib21","article-title":"Longformer: The long-document transformer","author":"Iz","year":"2020","journal-title":"arXiv preprint arXiv:2004.05150v2"},{"key":"2023071316514707200_bib22","doi-asserted-by":"publisher","first-page":"1","DOI":"10.18653\/v1\/2022.acl-short.1","article-title":"BitFit: Simple parameter- efficient fine-tuning for transformer-based masked language-models","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)","author":"Zaken","year":"2022"},{"key":"2023071316514707200_bib23","doi-asserted-by":"crossref","first-page":"41","DOI":"10.1145\/1553374.1553380","article-title":"Curriculum learning","volume-title":"Proceedings of the 26th Annual International Conference on Machine Learning","author":"Bengio","year":"2009"},{"key":"2023071316514707200_bib24","article-title":"Efficient 8-bit quantization of transformer neural machine language translation model","volume-title":"Proceedings of the Joint Workshop on On-Device Machine Learning & Compact Deep Neural Network Representations, 36th International Conference on Machine Learning","author":"Bhandare","year":"2019"},{"key":"2023071316514707200_bib25","volume-title":"Proceedings of the Fourth Workshop on Neural Generation and Translation","author":"Birch","year":"2020"},{"key":"2023071316514707200_bib26","doi-asserted-by":"publisher","first-page":"3013","DOI":"10.18653\/v1\/2021.findings-emnlp.259","article-title":"Data efficient masked language modeling for vision and language","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2021","author":"Bitton","year":"2021"},{"key":"2023071316514707200_bib27","first-page":"129","article-title":"What is the state of neural network pruning?","volume":"2","author":"Blalock","year":"2020","journal-title":"Proceedings of Machine Learning and Systems"},{"key":"2023071316514707200_bib28","first-page":"127","article-title":"Active learning with clustering","volume-title":"Active Learning and Experimental Design Workshop In conjunction with AISTATS 2010","author":"Bod\u00f3","year":"2011"},{"key":"2023071316514707200_bib29","doi-asserted-by":"crossref","first-page":"218","DOI":"10.18653\/v1\/2020.ngt-1.26","article-title":"Edinburgh\u2019s submissions to the 2020 machine translation efficiency task","volume-title":"Proceedings of the Fourth Workshop on Neural Generation and Translation","author":"Bogoychev","year":"2020"},{"key":"2023071316514707200_bib30","first-page":"2206","article-title":"Improving language models by retrieving from trillions of tokens","volume-title":"Proceedings of the 39th International Conference on Machine Learning","author":"Borgeaud","year":"2022"},{"key":"2023071316514707200_bib31","unstructured":"Xavier Bouthillier and Ga\u00eblVaroquaux. 2020. Survey of machine-learning experimental methods at NeurIPS2019 and ICLR2020. Research report, Inria Saclay Ile de France."},{"key":"2023071316514707200_bib32","doi-asserted-by":"crossref","first-page":"632","DOI":"10.18653\/v1\/D15-1075","article-title":"A large annotated corpus for learning natural language inference","volume-title":"Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing","author":"Bowman","year":"2015"},{"key":"2023071316514707200_bib33","first-page":"1877","article-title":"Language models are few-shot learners","volume-title":"Advances in Neural Information Processing Systems","author":"Brown","year":"2020"},{"key":"2023071316514707200_bib34","doi-asserted-by":"crossref","first-page":"141","DOI":"10.18653\/v1\/2020.sustainlp-1.19","article-title":"Towards accurate and reliable energy measurement of NLP models","volume-title":"Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing","author":"Cao","year":"2020"},{"issue":"1","key":"2023071316514707200_bib35","doi-asserted-by":"publisher","first-page":"41","DOI":"10.1023\/A:1007379606734","article-title":"Multitask learning","volume":"28","author":"Caruana","year":"1997","journal-title":"Machine Learning"},{"key":"2023071316514707200_bib36","article-title":"Pixelated butterfly: Simple and efficient sparse training for neural network models","volume-title":"International Conference on Learning Representations","author":"Chen","year":"2022"},{"key":"2023071316514707200_bib37","article-title":"Evaluating large language models trained on code","author":"Chen","year":"2021","journal-title":"arXiv preprint arXiv:2107.03374v2"},{"key":"2023071316514707200_bib38","article-title":"Generating long sequences with sparse transformers","author":"Child","year":"2019","journal-title":"arXiv preprint arXiv:1904.10509v1"},{"key":"2023071316514707200_bib39","article-title":"Rethinking attention with performers","volume-title":"International Conference on Learning Representations","author":"Choromanski","year":"2021"},{"key":"2023071316514707200_bib40","article-title":"PaLM: Scaling language modeling with pathways","author":"Chowdhery","year":"2022","journal-title":"arXiv:2204.02311v5"},{"key":"2023071316514707200_bib41","article-title":"ELECTRA: Pre-training text encoders as discriminators rather than generators","volume-title":"International Conference on Learning Representations","author":"Clark","year":"2020"},{"key":"2023071316514707200_bib42","doi-asserted-by":"crossref","first-page":"2174","DOI":"10.18653\/v1\/D19-1223","article-title":"Adaptively sparse transformers","volume-title":"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)","author":"Correia","year":"2019"},{"key":"2023071316514707200_bib43","doi-asserted-by":"crossref","first-page":"38","DOI":"10.1007\/978-3-540-87987-9_8","article-title":"Sample selection bias correction theory","volume-title":"Algorithmic Learning Theory","author":"Cortes","year":"2008"},{"key":"2023071316514707200_bib44","doi-asserted-by":"crossref","first-page":"24","DOI":"10.18653\/v1\/2020.ngt-1.3","article-title":"Balancing cost and benefit with tied-multi transformers","volume-title":"Proceedings of the Fourth Workshop on Neural Generation and Translation","author":"Dabre","year":"2020"},{"key":"2023071316514707200_bib45","doi-asserted-by":"crossref","first-page":"2978","DOI":"10.18653\/v1\/P19-1285","article-title":"Transformer-XL: Attentive language models beyond a fixed-length context","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Dai","year":"2019"},{"key":"2023071316514707200_bib46","first-page":"4690","article-title":"Monarch: Expressive structured matrices for efficient and accurate training","volume-title":"International Conference on Machine Learning","author":"Dao","year":"2022"},{"key":"2023071316514707200_bib47","article-title":"FlashAttention: fast and memory-efficient exact attention with IO-awareness","volume-title":"Advances in Neural Information Processing Systems","author":"Dao","year":"2022"},{"key":"2023071316514707200_bib48","first-page":"6476","article-title":"SMYRF - Efficient attention using asymmetric clustering","volume-title":"Advances in Neural Information Processing Systems","author":"Daras","year":"2020"},{"key":"2023071316514707200_bib49","article-title":"Universal transformers","volume-title":"International Conference on Learning Representations","author":"Dehghani","year":"2019"},{"key":"2023071316514707200_bib50","article-title":"The efficiency misnomer","volume-title":"International Conference on Learning Representations","author":"Dehghani","year":"2022"},{"key":"2023071316514707200_bib51","article-title":"Power consumption variation over activation functions","author":"Derczynski","year":"2020","journal-title":"arXiv preprint arXiv:2006.07237v1"},{"key":"2023071316514707200_bib52","article-title":"GPT3.int8(): 8-bit matrix multiplication for transformers at scale","volume-title":"Advances in Neural Information Processing Systems","author":"Dettmers","year":"2022"},{"key":"2023071316514707200_bib53","article-title":"8-bit optimizers via block-wise quantization","volume-title":"International Conference on Learning Representations","author":"Dettmers","year":"2022"},{"key":"2023071316514707200_bib54","first-page":"4171","article-title":"BERT: Pre-training of deep bidirectional transformers for language understanding","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"Devlin","year":"2019"},{"key":"2023071316514707200_bib55","doi-asserted-by":"publisher","first-page":"2185","DOI":"10.18653\/v1\/D19-1224","article-title":"Show your work: Improved reporting of experimental results","volume-title":"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)","author":"Dodge","year":"2019"},{"key":"2023071316514707200_bib56","article-title":"Fine-tuning pre-trained language models: Weight initializations, data orders, and early stopping","author":"Dodge","year":"2020","journal-title":"arXiv preprint arXiv:2002. 06305v1"},{"key":"2023071316514707200_bib57","doi-asserted-by":"publisher","first-page":"1877","DOI":"10.1145\/3531146.3533234","article-title":"Measuring the carbon intensity of AI in cloud instances","volume-title":"2022 ACM Conference on Fairness, Accountability, and Transparency","author":"Dodge","year":"2022"},{"key":"2023071316514707200_bib58","article-title":"Learning to prune deep neural networks via layer-wise optimal brain surgeon","volume-title":"Advances in Neural Information Processing Systems","author":"Dong","year":"2017"},{"key":"2023071316514707200_bib59","article-title":"A tale of two long tails","author":"D\u2019souza","year":"2021","journal-title":"arXiv preprint arXiv:2107.13098v1"},{"key":"2023071316514707200_bib60","first-page":"5547","article-title":"GLaM: Efficient scaling of language models with mixture-of-experts","volume-title":"Proceedings of the 39th International Conference on Machine Learning","author":"Nan","year":"2022"},{"key":"2023071316514707200_bib61","doi-asserted-by":"crossref","first-page":"403","DOI":"10.18653\/v1\/2020.acl-main.39","article-title":"Location attention for extrapolation to longer sequences","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Dubois","year":"2020"},{"key":"2023071316514707200_bib62","doi-asserted-by":"publisher","first-page":"7949","DOI":"10.18653\/v1\/2020.emnlp-main.638","article-title":"Active learning for BERT: An empirical study","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Ein-Dor","year":"2020"},{"key":"2023071316514707200_bib63","article-title":"Depth-adaptive transformer","volume-title":"International Conference on Learning Representations","author":"Elbayad","year":"2020"},{"issue":"1","key":"2023071316514707200_bib64","doi-asserted-by":"publisher","first-page":"71","DOI":"10.1016\/0010-0277(93)90058-4","article-title":"Learning and development in neural networks: The importance of starting small","volume":"48","author":"Elman","year":"1993","journal-title":"Cognition"},{"key":"2023071316514707200_bib65","first-page":"5988","article-title":"Understanding dataset difficulty with V-usable information","volume-title":"International Conference on Machine Learning","author":"Ethayarajh","year":"2022"},{"key":"2023071316514707200_bib66","article-title":"Reducing transformer depth on demand with structured dropout","volume-title":"International Conference on Learning Representations","author":"Fan","year":"2020"},{"key":"2023071316514707200_bib67","article-title":"A review of sparse expert models in deep learning","author":"Fedus","year":"2022","journal-title":"arXiv preprint arXiv:2209.01667v1"},{"issue":"120","key":"2023071316514707200_bib68","first-page":"1","article-title":"Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity","volume":"23","author":"Fedus","year":"2022","journal-title":"Journal of Machine Learning Research"},{"issue":"261","key":"2023071316514707200_bib69","first-page":"1","article-title":"Auto-Sklearn 2.0: Hands-free autoML via meta-learning","volume":"23","author":"Feurer","year":"2022","journal-title":"Journal of Machine Learning Research"},{"key":"2023071316514707200_bib70","article-title":"Efficient and robust automated machine learning","volume":"28","author":"Feurer","year":"2015","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2023071316514707200_bib71","first-page":"1183","article-title":"Deep Bayesian active learning with image data","volume-title":"International Conference on Machine Learning","author":"Gal","year":"2017"},{"key":"2023071316514707200_bib72","article-title":"The state of sparsity in deep neural networks","author":"Gale","year":"2019","journal-title":"arXiv preprint arXiv:1902.09574v1"},{"key":"2023071316514707200_bib73","doi-asserted-by":"crossref","first-page":"10786","DOI":"10.18653\/v1\/2022.emnlp-main.741","article-title":"EdgeFormer: A parameter-efficient transformer for on-device seq2seq generation","volume-title":"Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing","author":"Ge","year":"2022"},{"issue":"12","key":"2023071316514707200_bib74","doi-asserted-by":"publisher","first-page":"86","DOI":"10.1145\/3458723","article-title":"Datasheets for datasets","volume":"64","author":"Gebru","year":"2021","journal-title":"Communications of the ACM"},{"key":"2023071316514707200_bib75","article-title":"Discriminative active learning","author":"Gissin","year":"2019","journal-title":"arXiv preprint arXiv:1907.06347v1"},{"key":"2023071316514707200_bib76","doi-asserted-by":"publisher","first-page":"143","DOI":"10.18653\/v1\/2020.repl4nlp-1.18","article-title":"Compressing BERT: Studying the effects of weight pruning on transfer learning","volume-title":"Proceedings of the 5th Workshop on Representation Learning for NLP","author":"Gordon","year":"2020"},{"issue":"6","key":"2023071316514707200_bib77","doi-asserted-by":"publisher","first-page":"1789","DOI":"10.1007\/s11263-021-01453-z","article-title":"Knowledge distillation: A survey","volume":"129","author":"Gou","year":"2021","journal-title":"International Journal of Computer Vision"},{"key":"2023071316514707200_bib78","article-title":"On the parameterization and initialization of diagonal state space models","volume-title":"Advances in Neural Information Processing Systems","author":"Albert","year":"2022"},{"key":"2023071316514707200_bib79","article-title":"Efficiently modeling long sequences with structured state spaces","volume-title":"International Conference on Learning Representations","author":"Albert","year":"2022"},{"key":"2023071316514707200_bib80","article-title":"Search engine guided non-parametric neural machine translation","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Jiatao","year":"2018"},{"key":"2023071316514707200_bib81","article-title":"Sources of irreproducibility in machine learning: A review","author":"Gundersen","year":"2022","journal-title":"arXiv preprint arXiv:2204.07610v1"},{"key":"2023071316514707200_bib82","doi-asserted-by":"publisher","first-page":"4884","DOI":"10.18653\/v1\/2021.acl-long.378","article-title":"Parameter-efficient transfer learning with diff pruning","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Guo","year":"2021"},{"key":"2023071316514707200_bib83","article-title":"Diagonal state spaces are as effective as structured state spaces","volume-title":"Advances in Neural Information Processing Systems","author":"Gupta","year":"2022"},{"key":"2023071316514707200_bib84","doi-asserted-by":"crossref","first-page":"328","DOI":"10.1109\/HPCA47549.2020.00035","article-title":"A\u22273: Accelerating attention mechanisms in neural networks with approximation","volume-title":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","author":"Ham","year":"2020"},{"key":"2023071316514707200_bib85","doi-asserted-by":"crossref","first-page":"692","DOI":"10.1109\/ISCA52012.2021.00060","article-title":"ELSA: Hardware- software co-design for efficient, lightweight self-attention mechanism in neural networks","volume-title":"2021 ACM\/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)","author":"Ham","year":"2021"},{"key":"2023071316514707200_bib86","article-title":"Learning both weights and connections for efficient neural networks","volume":"28","author":"Han","year":"2015","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2023071316514707200_bib87","doi-asserted-by":"crossref","first-page":"1403","DOI":"10.18653\/v1\/2022.findings-emnlp.101","article-title":"How much does attention actually attend? Questioning the importance of attention in pre-trained transformers","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2022","author":"Hassid","year":"2022"},{"key":"2023071316514707200_bib88","doi-asserted-by":"crossref","first-page":"120","DOI":"10.1145\/3503221.3508418","article-title":"FasterMoE: Modeling and optimizing training of large-scale dynamic pre-trained models","volume-title":"Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","author":"He","year":"2022"},{"key":"2023071316514707200_bib89","doi-asserted-by":"crossref","first-page":"5703","DOI":"10.18653\/v1\/2021.emnlp-main.461","article-title":"Efficient nearest neighbor language models","volume-title":"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing","author":"He","year":"2021"},{"key":"2023071316514707200_bib90","article-title":"Towards a unified view of parameter-efficient transfer learning","volume-title":"International Conference on Learning Representations","author":"He","year":"2022"},{"key":"2023071316514707200_bib91","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00502","article-title":"Rethinking ImageNet pre-training","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"He","year":"2019"},{"key":"2023071316514707200_bib92","article-title":"DeBERTaV3: Improving DeBERTa using electra-style pre-training with gradient-disentangled embedding sharing","volume-title":"The Eleventh International Conference on Learning Representations","author":"He","year":"2023"},{"issue":"248","key":"2023071316514707200_bib93","first-page":"1","article-title":"Towards the systematic reporting of the energy and carbon footprints of machine learning","volume":"21","author":"Henderson","year":"2020","journal-title":"Journal of Machine Learning Research"},{"key":"2023071316514707200_bib94","doi-asserted-by":"crossref","first-page":"2480","DOI":"10.18653\/v1\/2022.emnlp-main.159","article-title":"Towards climate awareness in NLP research","volume-title":"Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing","author":"Hershcovich","year":"2022"},{"key":"2023071316514707200_bib95","doi-asserted-by":"crossref","first-page":"7817","DOI":"10.18653\/v1\/2022.emnlp-main.533","article-title":"Bridging fairness and environmental sustainability in natural language processing","volume-title":"Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing","author":"Hessenthaler","year":"2022"},{"key":"2023071316514707200_bib96","article-title":"The forward-forward algorithm: Some preliminary investigations","author":"Hinton","year":"2022","journal-title":"arXiv preprint arXiv:2212.13345v1"},{"key":"2023071316514707200_bib97","article-title":"Distilling the knowledge in a neural network","volume-title":"NeurIPS Deep Learning and Representation Learning Workshop","author":"Hinton","year":"2015"},{"issue":"241","key":"2023071316514707200_bib98","first-page":"1","article-title":"Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks","volume":"22","author":"Hoefler","year":"2021","journal-title":"Journal of Machine Learning Research"},{"key":"2023071316514707200_bib99","article-title":"An empirical analysis of compute-optimal large language model training","volume-title":"Advances in Neural Information Processing Systems","author":"Hoffmann","year":"2022"},{"key":"2023071316514707200_bib100","doi-asserted-by":"publisher","first-page":"58","DOI":"10.1145\/3467017","article-title":"The hardware lottery","volume":"64","author":"Hooker","year":"2021","journal-title":"Communications of the ACM"},{"key":"2023071316514707200_bib101","article-title":"Characterising bias in compressed models","author":"Hooker","year":"2020","journal-title":"arXiv preprint arXiv:2010.03058v1"},{"key":"2023071316514707200_bib102","article-title":"Parameter- efficient transfer learning for NLP","volume-title":"International Conference on Machine Learning","author":"Houlsby","year":"2019"},{"key":"2023071316514707200_bib103","article-title":"MobileNets: Efficient convolutional neural networks for mobile vision applications","author":"Howard","year":"2017","journal-title":"arXiv preprint arXiv:1704.04861v1"},{"key":"2023071316514707200_bib104","first-page":"8\u2013pp","article-title":"Towards efficient supercomputing: A quest for the right metric","volume-title":"19th IEEE International Parallel and Distributed Processing Symposium","author":"Hsu","year":"2005"},{"key":"2023071316514707200_bib105","article-title":"LoRA: Low-rank adaptation of large language models","volume-title":"International Conference on Learning Representations","author":"Hu","year":"2022"},{"key":"2023071316514707200_bib106","doi-asserted-by":"publisher","first-page":"6512","DOI":"10.18653\/v1\/2021.acl-long.509","article-title":"GhostBERT: Generate more features with cheap operations for BERT","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Huang","year":"2021"},{"key":"2023071316514707200_bib107","first-page":"4466","article-title":"Accurate post training quantization with small calibration sets","volume-title":"Proceedings of the 38th International Conference on Machine Learning","author":"Hubara","year":"2021"},{"key":"2023071316514707200_bib108","doi-asserted-by":"publisher","first-page":"124","DOI":"10.18653\/v1\/2020.sustainlp-1.17","article-title":"SqueezeBERT: What can computer vision teach NLP about efficient neural networks?","volume-title":"Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing","author":"Iandola","year":"2020"},{"key":"2023071316514707200_bib109","doi-asserted-by":"publisher","first-page":"12266","DOI":"10.1109\/CVPR52688.2022.01195","article-title":"How well do sparse imagenet models transfer?","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Iofinova","year":"2022"},{"issue":"1","key":"2023071316514707200_bib110","doi-asserted-by":"publisher","first-page":"79","DOI":"10.1162\/neco.1991.3.1.79","article-title":"Adaptive mixtures of local experts","volume":"3","author":"Jacobs","year":"1991","journal-title":"Neural Computation"},{"key":"2023071316514707200_bib111","first-page":"4651","article-title":"Perceiver: General perception with iterative attention","volume-title":"International conference on machine learning","author":"Jaegle","year":"2021"},{"key":"2023071316514707200_bib112","first-page":"240","article-title":"Non-stochastic best arm identification and hyperparameter optimization","volume-title":"Artificial intelligence and statistics","author":"Jamieson","year":"2016"},{"key":"2023071316514707200_bib113","volume-title":"The Coal Question; An Inquiry Concerning the Progress of the Nation, and the Probable Exhaustion of Our Coal Mines","author":"Jevons","year":"1866"},{"key":"2023071316514707200_bib114","doi-asserted-by":"crossref","first-page":"4147","DOI":"10.18653\/v1\/2021.findings-acl.363","article-title":"On the distribution, sparsity, and inference-time quantization of attention values in transformers","volume-title":"Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021","author":"Ji","year":"2021"},{"key":"2023071316514707200_bib115","doi-asserted-by":"publisher","first-page":"4163","DOI":"10.18653\/v1\/2020.findings-emnlp.372","article-title":"TinyBERT: Distilling BERT for natural language understanding","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2020","author":"Jiao","year":"2020"},{"key":"2023071316514707200_bib116","article-title":"Scaling laws for neural language models","author":"Kaplan","year":"2020","journal-title":"arXiv preprint arXiv:2001.08361v1"},{"key":"2023071316514707200_bib117","doi-asserted-by":"publisher","first-page":"7265","DOI":"10.18653\/v1\/2021.acl-long.564","article-title":"Mind your outliers! Investigating the negative impact of outliers on active learning for visual question answering","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Karamcheti","year":"2021"},{"key":"2023071316514707200_bib118","article-title":"Compacter: Efficient low-rank hypercomplex adapter layers","volume-title":"Advances in Neural Information Processing Systems","author":"Mahabadi","year":"2021"},{"key":"2023071316514707200_bib119","doi-asserted-by":"publisher","first-page":"3638","DOI":"10.18653\/v1\/2022.acl-long.254","article-title":"Prompt-free and efficient few-shot learning with language models","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Mahabadi","year":"2022"},{"key":"2023071316514707200_bib120","first-page":"5156","article-title":"Transformers are RNNs: Fast autoregressive transformers with linear attention","volume-title":"International Conference on Machine Learning","author":"Katharopoulos","year":"2020"},{"key":"2023071316514707200_bib121","article-title":"Nearest neighbor machine translation","volume-title":"International Conference on Learning Representations","author":"Khandelwal","year":"2021"},{"key":"2023071316514707200_bib122","article-title":"Generalization through memorization: Nearest neighbor language models","volume-title":"International Conference on Learning Representations","author":"Khandelwal","year":"2020"},{"key":"2023071316514707200_bib123","first-page":"5506","article-title":"I-BERT: Integer-only BERT quantization","volume-title":"Proceedings of the 38th International Conference on Machine Learning","author":"Kim","year":"2021"},{"key":"2023071316514707200_bib124","doi-asserted-by":"crossref","first-page":"1317","DOI":"10.18653\/v1\/D16-1139","article-title":"Sequence-level knowledge distillation","volume-title":"Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing","author":"Kim","year":"2016"},{"key":"2023071316514707200_bib125","doi-asserted-by":"crossref","first-page":"280","DOI":"10.18653\/v1\/D19-5632","article-title":"From research to production and back: Ludicrously fast neural machine translation","volume-title":"Proceedings of the 3rd Workshop on Neural Generation and Translation","author":"Kim","year":"2019"},{"key":"2023071316514707200_bib126","article-title":"BatchBALD: Efficient and diverse batch acquisition for deep Bayesian active learning","volume-title":"Advances in Neural Information Processing Systems","author":"Kirsch","year":"2019"},{"key":"2023071316514707200_bib127","article-title":"Reformer: The efficient transformer","volume-title":"International Conference on Learning Representations","author":"Kitaev","year":"2020"},{"key":"2023071316514707200_bib128","doi-asserted-by":"crossref","first-page":"6982","DOI":"10.18653\/v1\/2020.acl-main.624","article-title":"From zero to hero: Human-in-the-loop entity linking in low resource domains","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Klie","year":"2020"},{"key":"2023071316514707200_bib129","doi-asserted-by":"publisher","first-page":"50","DOI":"10.1162\/tacl_a_00447","article-title":"Quality at a glance: An audit of web-crawled multilingual datasets","volume":"10","author":"Kreutzer","year":"2022","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2023071316514707200_bib130","article-title":"Self-paced learning for latent variable models","volume-title":"Advances in Neural Information Processing Systems","author":"Kumar","year":"2010"},{"key":"2023071316514707200_bib131","article-title":"FP8 quantization: The power of the exponent","volume-title":"Advances in Neural Information Processing Systems","author":"Kuzmin","year":"2022"},{"key":"2023071316514707200_bib132","doi-asserted-by":"publisher","first-page":"84","DOI":"10.18653\/v1\/2022.bigscience-1.8","article-title":"A holistic assessment of the carbon footprint of Noor, a very large Arabic language model","volume-title":"Proceedings of BigScience Episode #5 \u2013 Workshop on Challenges & Perspectives in Creating Large Language Models","author":"Lakim","year":"2022"},{"key":"2023071316514707200_bib133","article-title":"ALBERT: A lite BERT for self-supervised learning of language representations","volume-title":"International Conference on Learning Representations","author":"Lan","year":"2019"},{"key":"2023071316514707200_bib134","first-page":"1078","article-title":"Adversarial filters of dataset biases","volume-title":"Proceedings of the 37th International Conference on Machine Learning","author":"Bras","year":"2020"},{"key":"2023071316514707200_bib135","article-title":"Optimal brain damage","volume-title":"Advances in Neural Information Processing Systems","author":"LeCun","year":"1989"},{"issue":"2","key":"2023071316514707200_bib136","doi-asserted-by":"publisher","first-page":"343","DOI":"10.1162\/coli_a_00436","article-title":"Annotation curricula to implicitly train non-expert annotators","volume":"48","author":"Lee","year":"2022","journal-title":"Computational Linguistics"},{"key":"2023071316514707200_bib137","doi-asserted-by":"crossref","first-page":"4233","DOI":"10.18653\/v1\/2020.acl-main.390","article-title":"Empowering active learning to jointly optimize system and user demands","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Lee","year":"2020"},{"key":"2023071316514707200_bib138","doi-asserted-by":"crossref","first-page":"8424","DOI":"10.18653\/v1\/2022.acl-long.577","article-title":"Deduplicating training data makes language models better","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Lee","year":"2022"},{"key":"2023071316514707200_bib139","doi-asserted-by":"publisher","first-page":"4296","DOI":"10.18653\/v1\/2022.naacl-main.319","article-title":"FNet: Mixing tokens with Fourier transforms","volume-title":"Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Lee-Thorp","year":"2022"},{"key":"2023071316514707200_bib140","article-title":"{GS}hard: Scaling giant models with conditional computation and automatic sharding","volume-title":"International Conference on Learning Representations","author":"Lepikhin","year":"2021"},{"key":"2023071316514707200_bib141","doi-asserted-by":"crossref","DOI":"10.1017\/9781108684163","volume-title":"Mining of Massive Data Sets","author":"Leskovec","year":"2020"},{"key":"2023071316514707200_bib142","doi-asserted-by":"crossref","first-page":"3045","DOI":"10.18653\/v1\/2021.emnlp-main.243","article-title":"The power of scale for parameter- efficient prompt tuning","volume-title":"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing","author":"Lester","year":"2021"},{"key":"2023071316514707200_bib143","doi-asserted-by":"crossref","first-page":"3","DOI":"10.1007\/978-1-4471-2099-5_1","article-title":"A sequential algorithm for training text classifiers","volume-title":"SIGIR \u201994","author":"Lewis","year":"1994"},{"key":"2023071316514707200_bib144","doi-asserted-by":"publisher","first-page":"7871","DOI":"10.18653\/v1\/2020.acl-main.703","article-title":"BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Lewis","year":"2020"},{"key":"2023071316514707200_bib145","first-page":"9459","article-title":"Retrieval-augmented generation for knowledge-intensive NLP tasks","volume-title":"Advances in Neural Information Processing Systems","author":"Lewis","year":"2020"},{"key":"2023071316514707200_bib146","doi-asserted-by":"crossref","first-page":"8320","DOI":"10.18653\/v1\/2020.acl-main.738","article-title":"Active learning for coreference resolution using discrete annotation","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Li","year":"2020"},{"key":"2023071316514707200_bib147","article-title":"Measuring the intrinsic dimension of objective landscapes","volume-title":"International Conference on Learning Representations","author":"Li","year":"2018"},{"key":"2023071316514707200_bib148","article-title":"A survey on retrieval- augmented text generation","author":"Li","year":"2022","journal-title":"arXiv preprint arXiv:2202.01110v1"},{"key":"2023071316514707200_bib149","article-title":"A system for massively parallel hyperparameter tuning","volume-title":"Third Conference on Systems and Machine Learning","author":"Li","year":"2020"},{"issue":"7","key":"2023071316514707200_bib150","doi-asserted-by":"publisher","first-page":"1866","DOI":"10.1109\/TPDS.2020.3047371","article-title":"Efficient methods for mapping neural machine translator on FPGAs","volume":"32","author":"Li","year":"2021","journal-title":"IEEE Transactions on Parallel and Distributed Systems"},{"key":"2023071316514707200_bib151","first-page":"4582","article-title":"Prefix- tuning: Optimizing continuous prompts for generation","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Li","year":"2021"},{"key":"2023071316514707200_bib152","article-title":"What makes convolutional models great on long sequence modeling?","author":"Li","year":"2022","journal-title":"arXiv preprint arXiv:2210.09298v1"},{"key":"2023071316514707200_bib153","first-page":"5958","article-title":"Train big, then compress: Rethinking model size for efficient training and inference of transformers","volume-title":"Proceedings of the 37th International Conference on Machine Learning","author":"Li","year":"2020"},{"key":"2023071316514707200_bib154","first-page":"54","article-title":"SMAC3: A versatile Bayesian optimization package for hyperparameter optimization","volume":"23","author":"Lindauer","year":"2022","journal-title":"Journal of Machine Learning Research"},{"key":"2023071316514707200_bib155","article-title":"Few-shot parameter- efficient fine-tuning is better and cheaper than in-context learning","volume-title":"Advances in Neural Information Processing Systems","author":"Liu","year":"2022"},{"key":"2023071316514707200_bib156","doi-asserted-by":"crossref","first-page":"334","DOI":"10.18653\/v1\/K18-1033","article-title":"Learning to actively learn neural machine translation","volume-title":"Proceedings of the 22nd Conference on Computational Natural Language Learning","author":"Liu","year":"2018"},{"issue":"9","key":"2023071316514707200_bib157","doi-asserted-by":"publisher","DOI":"10.1145\/3560815","article-title":"Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing","volume":"55","author":"Liu","year":"2023","journal-title":"ACM Computing Surveys"},{"key":"2023071316514707200_bib158","doi-asserted-by":"crossref","first-page":"6035","DOI":"10.18653\/v1\/2020.acl-main.537","article-title":"FastBERT: A self-distilling BERT with adaptive inference time","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Liu","year":"2020"},{"key":"2023071316514707200_bib159","first-page":"3288","article-title":"Towards efficient NLP: A standard evaluation and a strong baseline","volume-title":"Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Liu","year":"2022"},{"key":"2023071316514707200_bib160","article-title":"GPT understands, too","author":"Liu","year":"2021","journal-title":"arXiv preprint arXiv:2103.10385v1"},{"key":"2023071316514707200_bib161","first-page":"2286","article-title":"An empirical study on hyperparameter optimization for fine-tuning pre-trained language models","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Liu","year":"2021"},{"key":"2023071316514707200_bib162","doi-asserted-by":"crossref","DOI":"10.23919\/DATE51398.2021.9474043","article-title":"Hardware acceleration of fully quantized BERT for efficient natural language processing","volume-title":"Design, Automation & Test in Europe Conference & Exhibition (DATE)","author":"Liu","year":"2021"},{"key":"2023071316514707200_bib163","article-title":"Learning sparse neural networks through L0 regularization","volume-title":"International Conference on Learning Representations","author":"Louizos","year":"2018"},{"key":"2023071316514707200_bib164","doi-asserted-by":"crossref","first-page":"21","DOI":"10.18653\/v1\/D19-1003","article-title":"Practical obstacles to deploying active learning","volume-title":"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)","author":"Lowell","year":"2019"},{"key":"2023071316514707200_bib165","first-page":"84","article-title":"Hardware accelerator for multi-head attention and position- wise feed-forward in the transformer","volume-title":"2020 IEEE 33rd International System-on-Chip Conference (SOCC)","author":"Siyuan","year":"2020"},{"key":"2023071316514707200_bib166","article-title":"Quantifying the carbon emissions of machine learning","volume-title":"NeurIPS 2019 Workshop on Tackling Climate Change with Machine Learning","author":"Luccioni","year":"2019"},{"key":"2023071316514707200_bib167","article-title":"Mega: Moving average equipped gated attention","volume-title":"The Eleventh International Conference on Learning Representations","author":"Ma","year":"2023"},{"issue":"4","key":"2023071316514707200_bib168","doi-asserted-by":"publisher","first-page":"1162","DOI":"10.3390\/su10041162","article-title":"Ensuring more sustainable reporting in europe using non-financial disclosure\u2014De facto and de jure evidence","volume":"10","author":"Manes-Rossi","year":"2018","journal-title":"Sustainability"},{"key":"2023071316514707200_bib169","doi-asserted-by":"crossref","first-page":"650","DOI":"10.18653\/v1\/2021.emnlp-main.51","article-title":"Active learning by acquiring contrastive examples","volume-title":"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing","author":"Margatina","year":"2021"},{"key":"2023071316514707200_bib170","doi-asserted-by":"publisher","first-page":"23","DOI":"10.18653\/v1\/2022.spanlp-1.3","article-title":"Efficient machine translation domain adaptation","volume-title":"Proceedings of the 1st Workshop on Semiparametric Methods in NLP: Decoupling Logic from Knowledge","author":"Martins","year":"2022"},{"key":"2023071316514707200_bib171","doi-asserted-by":"crossref","first-page":"5468","DOI":"10.18653\/v1\/2022.acl-long.375","article-title":"\u221e-former: Infinite memory transformer","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Martins","year":"2022"},{"key":"2023071316514707200_bib172","doi-asserted-by":"crossref","first-page":"4228","DOI":"10.18653\/v1\/2022.emnlp-main.284","article-title":"Chunk-based nearest neighbor machine translation","volume-title":"Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing","author":"Martins","year":"2022"},{"key":"2023071316514707200_bib173","article-title":"Long range language modeling via gated state spaces","volume-title":"The Eleventh International Conference on Learning Representations","author":"Mehta","year":"2023"},{"key":"2023071316514707200_bib174","doi-asserted-by":"publisher","first-page":"555","DOI":"10.18653\/v1\/2022.findings-acl.47","article-title":"Fast nearest neighbor machine translation","volume-title":"Findings of the Association for Computational Linguistics: ACL 2022","author":"Meng","year":"2022"},{"key":"2023071316514707200_bib175","first-page":"14014","article-title":"Are sixteen heads really better than one?","volume-title":"Advances in Neural Information Processing Systems","author":"Michel","year":"2019"},{"key":"2023071316514707200_bib176","doi-asserted-by":"publisher","first-page":"169","DOI":"10.18653\/v1\/2020.sustainlp-1.23","article-title":"Do we need to create big datasets to learn a task?","volume-title":"Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing","author":"Mishra","year":"2020"},{"key":"2023071316514707200_bib177","doi-asserted-by":"crossref","first-page":"4308","DOI":"10.18653\/v1\/2022.findings-emnlp.317","article-title":"What do compressed multilingual machine translation models forget?","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2022","author":"Mohammadshahi","year":"2022"},{"key":"2023071316514707200_bib178","doi-asserted-by":"publisher","first-page":"3742","DOI":"10.18653\/v1\/2022.naacl-main.274","article-title":"Adaptable adapters","volume-title":"Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Moosavi","year":"2022"},{"key":"2023071316514707200_bib179","first-page":"4646","article-title":"Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization","volume-title":"Proceedings of the 36th International Conference on Machine Learning","author":"Mostafa","year":"2019"},{"key":"2023071316514707200_bib180","article-title":"Multimodal contrastive learning with LIMoE: The language-image mixture of experts","volume-title":"Advances in Neural Information Processing Systems","author":"Mustafa","year":"2022"},{"key":"2023071316514707200_bib181","first-page":"512","article-title":"What is being transferred in transfer learning?","volume-title":"Advances in Neural Information Processing Systems","author":"Neyshabur","year":"2020"},{"key":"2023071316514707200_bib182","article-title":"8-bit numerical formats for deep neural networks","author":"Noune","year":"2022","journal-title":"arXiv preprint arXiv:2206.02915v1"},{"key":"2023071316514707200_bib183","doi-asserted-by":"crossref","first-page":"9092","DOI":"10.18653\/v1\/2022.emnlp-main.619","article-title":"Intriguing properties of compression on multilingual models","volume-title":"Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing","author":"Ogueji","year":"2022"},{"key":"2023071316514707200_bib184","volume-title":"Cours d\u2019\u00c9conomie Politique profess\u00e9 \u00e0 l\u2019Universit\u00e9 de Lausanne","author":"Pareto","year":"1896"},{"key":"2023071316514707200_bib185","article-title":"Carbon emissions and large neural network training","author":"Patterson","year":"2021","journal-title":"arXiv preprint arXiv: 2104.10350v3"},{"key":"2023071316514707200_bib186","article-title":"Random feature attention","volume-title":"International Conference on Learning Representations","author":"Peng","year":"2020"},{"key":"2023071316514707200_bib187","doi-asserted-by":"publisher","first-page":"2642","DOI":"10.18653\/v1\/2021.naacl-main.210","article-title":"Smoothing and shrinking the sparse seq2seq search space","volume-title":"Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Peters","year":"2021"},{"key":"2023071316514707200_bib188","doi-asserted-by":"crossref","first-page":"1504","DOI":"10.18653\/v1\/P19-1146","article-title":"Sparse sequence-to-sequence models","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Peters","year":"2019"},{"key":"2023071316514707200_bib189","doi-asserted-by":"publisher","first-page":"2227","DOI":"10.18653\/v1\/N18-1202","article-title":"Deep contextualized word representations","volume-title":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)","author":"Peters","year":"2018"},{"key":"2023071316514707200_bib190","doi-asserted-by":"publisher","first-page":"2463","DOI":"10.18653\/v1\/D19-1250","article-title":"Language models as knowledge bases?","volume-title":"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)","author":"Petroni","year":"2019"},{"key":"2023071316514707200_bib191","doi-asserted-by":"publisher","first-page":"46","DOI":"10.18653\/v1\/2020.emnlp-demos.7","article-title":"AdapterHub: A framework for adapting transformers","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations","author":"Pfeiffer","year":"2020"},{"key":"2023071316514707200_bib192","article-title":"Combining modular skills in multitask learning","author":"Ponti","year":"2022","journal-title":"arXiv preprint arXiv: 2202.13914v1"},{"key":"2023071316514707200_bib193","doi-asserted-by":"publisher","first-page":"1","DOI":"10.18653\/v1\/2020.findings-emnlp.1","article-title":"Fully quantized transformer for machine translation","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2020","author":"Prato","year":"2020"},{"key":"2023071316514707200_bib194","article-title":"Train short, test long: Attention with linear biases enables input length extrapolation","volume-title":"International Conference on Learning Representations","author":"Press","year":"2022"},{"key":"2023071316514707200_bib195","doi-asserted-by":"publisher","first-page":"5493","DOI":"10.18653\/v1\/2021.acl-long.427","article-title":"Shortformer: Better language modeling using shorter inputs","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Press","year":"2021"},{"key":"2023071316514707200_bib196","doi-asserted-by":"crossref","first-page":"96","DOI":"10.18653\/v1\/2021.sustainlp-1.12","article-title":"Hyperparameter power impact in transformer language model training","volume-title":"Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing","author":"de Chavannes","year":"2021"},{"key":"2023071316514707200_bib197","first-page":"14","article-title":"DOTA: Detect and omit weak attentions for scalable transformer acceleration","volume-title":"Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems","author":"Zheng","year":"2022"},{"key":"2023071316514707200_bib198","doi-asserted-by":"publisher","first-page":"114","DOI":"10.18653\/v1\/N18-3014","article-title":"Pieces of eight: 8-bit neural machine translation","volume-title":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)","author":"Quinn","year":"2018"},{"key":"2023071316514707200_bib199","article-title":"Learning to generate reviews and discovering sentiment","author":"Radford","year":"2017","journal-title":"arXiv preprint arXiv:1704.01444v2"},{"issue":"8","key":"2023071316514707200_bib200","first-page":"9","article-title":"Language models are unsupervised multitask learners","volume":"1","author":"Radford","year":"2019","journal-title":"OpenAI blog"},{"key":"2023071316514707200_bib201","article-title":"Scaling language models: Methods, analysis & insights from training gopher","author":"Rae","year":"2021","journal-title":"arXiv preprint arXiv:2112.11446v2"},{"key":"2023071316514707200_bib202","article-title":"Compressive transformers for long-range sequence modelling","volume-title":"International Conference on Learning Representations","author":"Rae","year":"2020"},{"issue":"140","key":"2023071316514707200_bib203","first-page":"1","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","volume":"21","author":"Raffel","year":"2020","journal-title":"Journal of Machine Learning Research"},{"key":"2023071316514707200_bib204","first-page":"18332","article-title":"DeepSpeed-MoE: Advancing mixture-of-experts inference and training to power next-generation AI scale","volume-title":"Proceedings of the 39th International Conference on Machine Learning","author":"Rajbhandari","year":"2022"},{"key":"2023071316514707200_bib205","article-title":"Learning multiple visual domains with residual adapters","volume-title":"Advances in Neural Information Processing Systems","author":"Rebuffi","year":"2017"},{"key":"2023071316514707200_bib206","doi-asserted-by":"publisher","first-page":"4081","DOI":"10.18653\/v1\/2021.findings-emnlp.344","article-title":"Subformer: Exploring weight sharing for parameter efficiency in generative transformers","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2021","author":"Reid","year":"2021"},{"key":"2023071316514707200_bib207","doi-asserted-by":"publisher","first-page":"338","DOI":"10.18653\/v1\/D17-1035","article-title":"Reporting score distributions makes a difference: Performance study of LSTM-networks for sequence tagging","volume-title":"Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing","author":"Reimers","year":"2017"},{"key":"2023071316514707200_bib208","first-page":"551","article-title":"ZeRO-Offload: Democratizing billion-scale model training","volume-title":"2021 USENIX Annual Technical Conference (USENIX ATC 21)","author":"Ren","year":"2021"},{"issue":"9","key":"2023071316514707200_bib209","doi-asserted-by":"publisher","DOI":"10.1145\/3472291","article-title":"A survey of deep active learning","volume":"54","author":"Ren","year":"2021","journal-title":"ACM Computing Surveys"},{"key":"2023071316514707200_bib210","doi-asserted-by":"publisher","first-page":"99","DOI":"10.18653\/v1\/2021.acl-short.15","article-title":"Gender bias amplification during speed-quality optimization in neural machine translation","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)","author":"Renduchintala","year":"2021"},{"key":"2023071316514707200_bib211","doi-asserted-by":"publisher","first-page":"53","DOI":"10.1162\/tacl_a_00353","article-title":"Efficient content-based sparse attention with routing transformers","volume":"9","author":"Roy","year":"2021","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2023071316514707200_bib212","doi-asserted-by":"publisher","first-page":"7930","DOI":"10.18653\/v1\/2021.emnlp-main.626","article-title":"AdapterDrop: On the efficiency of adapters in transformers","volume-title":"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing","author":"R\u00fcckl\u00e9","year":"2021"},{"key":"2023071316514707200_bib213","article-title":"An overview of multi- task learning in deep neural networks","author":"Ruder","year":"2017","journal-title":"arXiv preprint arXiv:1706.05098v1"},{"key":"2023071316514707200_bib214","doi-asserted-by":"publisher","first-page":"101429","DOI":"10.1016\/j.csl.2022.101429","article-title":"On the effect of dropping layers of pre-trained transformer models","volume":"77","author":"Sajjad","year":"2023","journal-title":"Computer Speech & Language"},{"key":"2023071316514707200_bib215","article-title":"DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter","volume-title":"NeurIPS EMC2 Workshop","author":"Sanh","year":"2019"},{"key":"2023071316514707200_bib216","article-title":"Multitask prompted training enables zero-shot task generalization","volume-title":"International Conference on Learning Representations","author":"Sanh","year":"2022"},{"key":"2023071316514707200_bib217","first-page":"20378","article-title":"Movement pruning: Adaptive sparsity by fine-tuning","volume-title":"Advances in Neural Information Processing Systems","author":"Sanh","year":"2020"},{"key":"2023071316514707200_bib218","doi-asserted-by":"publisher","first-page":"2823","DOI":"10.18653\/v1\/2021.eacl-main.246","article-title":"ProFormer: Towards on-device LSH projection based transformers","volume-title":"Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume","author":"Sankar","year":"2021"},{"key":"2023071316514707200_bib219","doi-asserted-by":"publisher","first-page":"2339","DOI":"10.18653\/v1\/2021.naacl-main.185","article-title":"It\u2019s not just size that matters: Small language models are also few-shot learners","volume-title":"Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Schick","year":"2021"},{"issue":"12","key":"2023071316514707200_bib220","doi-asserted-by":"publisher","first-page":"54","DOI":"10.1145\/3381831","article-title":"Green AI","volume":"63","author":"Schwartz","year":"2020","journal-title":"Communications of the ACM (CACM)"},{"key":"2023071316514707200_bib221","doi-asserted-by":"publisher","first-page":"6640","DOI":"10.18653\/v1\/2020.acl-main.593","article-title":"The right tool for the job: Matching model and instance complexities","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Schwartz","year":"2020"},{"key":"2023071316514707200_bib222","article-title":"Active learning for convolutional neural networks: A core-set approach","volume-title":"International Conference on Learning Representations","author":"Sener","year":"2018"},{"key":"2023071316514707200_bib223","volume-title":"Active Learning, volume 18 of Synthesis Lectures on Artificial Intelligence and Machine Learning","author":"Settles","year":"2012"},{"key":"2023071316514707200_bib224","article-title":"Active learning with real annotation costs","volume-title":"Proceedings of the NIPS workshop on cost-sensitive learning","author":"Settles","year":"2008"},{"key":"2023071316514707200_bib225","doi-asserted-by":"publisher","first-page":"464","DOI":"10.18653\/v1\/N18-2074","article-title":"Self-attention with relative position representations","volume-title":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)","author":"Shaw","year":"2018"},{"key":"2023071316514707200_bib226","article-title":"Outrageously large neural networks: The sparsely-gated mixture-of-experts layer","volume-title":"International Conference on Learning Representations","author":"Shazeer","year":"2017"},{"issue":"5","key":"2023071316514707200_bib227","doi-asserted-by":"publisher","first-page":"8815","DOI":"10.1609\/aaai.v34i05.6409","article-title":"Q-BERT: Hessian based ultra low precision quantization of BERT","volume":"34","author":"Shen","year":"2020","journal-title":"Proceedings of the AAAI Conference on Artificial Intelligence"},{"key":"2023071316514707200_bib228","doi-asserted-by":"publisher","first-page":"4222","DOI":"10.18653\/v1\/2020.emnlp-main.346","article-title":"Autoprompt: Eliciting knowledge from language models with automatically generated prompts","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Shin","year":"2020"},{"key":"2023071316514707200_bib229","article-title":"Metadata archaeology: Unearthing data subsets by leveraging training dynamics","author":"Siddiqui","year":"2021","journal-title":"arXiv preprint arXiv:2209.10015v1"},{"key":"2023071316514707200_bib230","doi-asserted-by":"publisher","first-page":"2383","DOI":"10.18653\/v1\/2021.naacl-main.189","article-title":"Towards a comprehensive understanding and accurate evaluation of societal biases in pre-trained transformers","volume-title":"Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Silva","year":"2021"},{"key":"2023071316514707200_bib231","article-title":"Practical Bayesian optimization of machine learning algorithms","volume-title":"Advances in Neural Information Processing Systems","author":"Snoek","year":"2012"},{"key":"2023071316514707200_bib232","first-page":"6906","article-title":"Does knowledge distillation really work?","volume-title":"Advances in Neural Information Processing Systems","author":"Stanton","year":"2021"},{"key":"2023071316514707200_bib233","article-title":"Training with quantization noise for extreme model compression","volume-title":"International Conference on Learning Representations","author":"Stock","year":"2021"},{"key":"2023071316514707200_bib234","doi-asserted-by":"publisher","first-page":"3645","DOI":"10.18653\/v1\/P19-1355","article-title":"Energy and policy considerations for deep learning in NLP","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Strubell","year":"2019"},{"key":"2023071316514707200_bib235","doi-asserted-by":"crossref","first-page":"2158","DOI":"10.18653\/v1\/2020.acl-main.195","article-title":"MobileBERT: A compact task-agnostic BERT for resource-limited devices","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Sun","year":"2020"},{"key":"2023071316514707200_bib236","first-page":"24193","article-title":"Training neural networks with fixed sparse masks","volume-title":"Advances in Neural Information Processing Systems","author":"Sung","year":"2021"},{"key":"2023071316514707200_bib237","doi-asserted-by":"publisher","first-page":"9275","DOI":"10.18653\/v1\/2020.emnlp-main.746","article-title":"Dataset cartography: Mapping and diagnosing datasets with training dynamics","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Swayamdipta","year":"2020"},{"key":"2023071316514707200_bib238","doi-asserted-by":"publisher","first-page":"830","DOI":"10.1145\/3466752.3480095","article-title":"EdgeBERT: Sentence-level energy optimizations for latency-aware multi- task NLP inference","volume-title":"MICRO-54: 54th Annual IEEE\/ACM International Symposium on Microarchitecture","author":"Tambe","year":"2021"},{"key":"2023071316514707200_bib239","first-page":"120","article-title":"Active learning for statistical natural language parsing","volume-title":"Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics","author":"Tang","year":"2002"},{"key":"2023071316514707200_bib240","article-title":"Long range arena : A benchmark for efficient transformers","volume-title":"International Conference on Learning Representations","author":"Yi","year":"2021"},{"key":"2023071316514707200_bib241","doi-asserted-by":"publisher","DOI":"10.1145\/3530811","article-title":"Efficient transformers: A survey","author":"Yi","year":"2022","journal-title":"ACM Computing Surveys"},{"key":"2023071316514707200_bib242","first-page":"4922","article-title":"Simple and effective curriculum pointer-generator networks for reading comprehension over long narratives","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Yi","year":"2019"},{"key":"2023071316514707200_bib243","article-title":"Keep the gradients flowing: Using gradient flow to study sparse network optimization","author":"Tessera","year":"2021","journal-title":"arXiv preprint arXiv:2102.01670v2"},{"key":"2023071316514707200_bib244","article-title":"The computational limits of deep learning","author":"Thompson","year":"2020","journal-title":"arXiv preprint arXiv:2007.05558v1"},{"key":"2023071316514707200_bib245","doi-asserted-by":"crossref","first-page":"67","DOI":"10.18653\/v1\/2022.spnlp-1.7","article-title":"Predicting attention sparsity in transformers","volume-title":"Proceedings of the Sixth Workshop on Structured Prediction for NLP","author":"Treviso","year":"2022"},{"key":"2023071316514707200_bib246","first-page":"1","article-title":"DyLoRA: Parameter efficient tuning of pre-trained models using dynamic search-free low rank adaptation","volume-title":"2nd Workshop on Efficiennt Natural Language and Speech Processing, (NeurIPS workshops)","author":"Valipour","year":"2022"},{"key":"2023071316514707200_bib247","doi-asserted-by":"publisher","first-page":"5797","DOI":"10.18653\/v1\/P19-1580","article-title":"Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Voita","year":"2019"},{"key":"2023071316514707200_bib248","doi-asserted-by":"crossref","first-page":"1074","DOI":"10.18653\/v1\/2020.emnlp-main.80","article-title":"Self-paced learning for neural machine translation","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Wan","year":"2020"},{"key":"2023071316514707200_bib249","doi-asserted-by":"publisher","first-page":"7675","DOI":"10.18653\/v1\/2020.acl-main.686","article-title":"HAT: Hardware-aware transformers for efficient natural language processing","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Wang","year":"2020"},{"key":"2023071316514707200_bib250","doi-asserted-by":"publisher","first-page":"97","DOI":"10.1109\/HPCA51647.2021.00018","article-title":"SpAtten: Efficient sparse attention architecture with cascade token and head pruning","volume-title":"2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)","author":"Wang","year":"2021"},{"key":"2023071316514707200_bib251","article-title":"Robust distillation for worst-class performance","author":"Wang","year":"2022","journal-title":"arXiv preprint arXiv:2206.06479v1"},{"key":"2023071316514707200_bib252","article-title":"Faster nearest neighbor machine translation","author":"Wang","year":"2021","journal-title":"arXiv preprint arXiv:2112.08152v1"},{"key":"2023071316514707200_bib253","doi-asserted-by":"crossref","first-page":"5744","DOI":"10.18653\/v1\/2022.emnlp-main.388","article-title":"AdaMix: Mixture-of-adaptations for parameter- efficient model tuning","volume-title":"Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing","author":"Wang","year":"2022"},{"key":"2023071316514707200_bib254","doi-asserted-by":"crossref","first-page":"6151","DOI":"10.18653\/v1\/2020.emnlp-main.496","article-title":"Structured pruning of large language models","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Wang","year":"2020"},{"key":"2023071316514707200_bib255","article-title":"Finetuned language models are zero-shot learners","volume-title":"International Conference on Learning Representations","author":"Wei","year":"2022"},{"key":"2023071316514707200_bib256","article-title":"Emergent abilities of large language models","author":"Wei","year":"2022","journal-title":"Transactions on Machine Learning Research"},{"key":"2023071316514707200_bib257","first-page":"11058","article-title":"Meta-learning hyperparameter performance prediction with neural processes","volume-title":"Proceedings of the 38th International Conference on Machine Learning","author":"Wei","year":"2021"},{"key":"2023071316514707200_bib258","article-title":"Should you mask 15% in masked language modeling?","author":"Wettig","year":"2022","journal-title":"arXiv preprint arXiv:2202.08005v1"},{"key":"2023071316514707200_bib259","first-page":"795","article-title":"Sustainable AI: Environmental implications, challenges and opportunities","volume-title":"Proceedings of Machine Learning and Systems","author":"Carole-Jean","year":"2022"},{"key":"2023071316514707200_bib260","article-title":"Extreme compression for pre-trained transformers made simple and efficient","volume-title":"Advances in Neural Information Processing Systems","author":"Xiaoxia","year":"2022"},{"key":"2023071316514707200_bib261","article-title":"Lite transformer with long-short range attention","volume-title":"International Conference on Learning Representations","author":"Zhanghao","year":"2020"},{"key":"2023071316514707200_bib262","doi-asserted-by":"crossref","first-page":"1513","DOI":"10.18653\/v1\/2022.acl-long.107","article-title":"Structured pruning learns compact and accurate models","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Xia","year":"2022"},{"key":"2023071316514707200_bib263","doi-asserted-by":"publisher","first-page":"2246","DOI":"10.18653\/v1\/2020.acl-main.204","article-title":"DeeBERT: Dynamic early exiting for accelerating BERT inference","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Ji","year":"2020"},{"key":"2023071316514707200_bib264","first-page":"6095","article-title":"Curriculum learning for natural language understanding","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Benfeng","year":"2020"},{"key":"2023071316514707200_bib265","article-title":"A survey on dynamic neural networks for natural language processing","volume-title":"Findings of EACL","author":"Canwen","year":"2023"},{"key":"2023071316514707200_bib266","first-page":"10653","article-title":"Beyond preserved accuracy: Evaluating loyalty and robustness of BERT compression","volume-title":"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing","author":"Canwen","year":"2021"},{"key":"2023071316514707200_bib267","article-title":"Can Model compression improve NLP fairness","author":"Guangxuan","year":"2022","journal-title":"arXiv preprint arXiv:2201.08542v1"},{"key":"2023071316514707200_bib268","first-page":"17084","article-title":"Tuning large neural networks via zero-shot hyperparameter transfer","volume-title":"Advances in Neural Information Processing Systems","author":"Ge","year":"2021"},{"key":"2023071316514707200_bib269","doi-asserted-by":"publisher","first-page":"362","DOI":"10.1162\/tacl_a_00371","article-title":"Adaptive semiparametric language models","volume":"9","author":"Yogatama","year":"2021","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2023071316514707200_bib270","doi-asserted-by":"crossref","first-page":"7935","DOI":"10.18653\/v1\/2020.emnlp-main.637","article-title":"Cold-start active learning through self-supervised language modeling","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Yuan","year":"2020"},{"key":"2023071316514707200_bib271","doi-asserted-by":"publisher","first-page":"7533","DOI":"10.18653\/v1\/2022.acl-long.519","article-title":"Adapting coreference resolution models through active learning","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Yuan","year":"2022"},{"key":"2023071316514707200_bib272","doi-asserted-by":"publisher","first-page":"811","DOI":"10.1109\/MICRO50266.2020.00071","article-title":"GOBO: Quantizing attention-based NLP models for low latency and energy efficient inference","volume-title":"2020 53rd Annual IEEE\/ACM International Symposium on Microarchitecture (MICRO)","author":"Zadeh","year":"2020"},{"key":"2023071316514707200_bib273","doi-asserted-by":"publisher","first-page":"888","DOI":"10.1145\/3470496.3527438","article-title":"Mokey: Enabling narrow fixed-point inference for out-of-the-box floating-point transformer models","volume-title":"Proceedings of the 49th Annual International Symposium on Computer Architecture","author":"Zadeh","year":"2022"},{"key":"2023071316514707200_bib274","doi-asserted-by":"publisher","first-page":"36","DOI":"10.1109\/EMC2-NIPS53020.2019.00016","article-title":"Q8BERT: Quantized 8bit BERT","volume-title":"2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS)","author":"Zafrir","year":"2019"},{"key":"2023071316514707200_bib275","article-title":"Prune once for all: Sparse pre-trained language models","author":"Zafrir","year":"2021","journal-title":"arXiv preprint arXiv:2111.05754v1"},{"key":"2023071316514707200_bib276","first-page":"17283","article-title":"Big bird: Transformers for longer sequences","volume-title":"Advances in Neural Information Processing Systems","author":"Zaheer","year":"2020"},{"key":"2023071316514707200_bib277","doi-asserted-by":"publisher","first-page":"93","DOI":"10.18653\/v1\/D18-1009","article-title":"SWAG: A large-scale adversarial dataset for grounded commonsense inference","volume-title":"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing","author":"Zellers","year":"2018"},{"key":"2023071316514707200_bib278","article-title":"An attention free transformer","author":"Zhai","year":"2021","journal-title":"arXiv preprint arXiv:2105.14103v1"},{"key":"2023071316514707200_bib279","article-title":"OPT: Open pre-trained transformer language models","author":"Zhang","year":"2022","journal-title":"arXiv preprint arXiv:2205.01068v4"},{"key":"2023071316514707200_bib280","doi-asserted-by":"publisher","first-page":"509","DOI":"10.18653\/v1\/2020.emnlp-main.37","article-title":"TernaryBERT: Distillation-aware ultra-low bit BERT","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Zhang","year":"2020"},{"key":"2023071316514707200_bib281","doi-asserted-by":"crossref","first-page":"393","DOI":"10.1162\/tacl_a_00322","article-title":"Reproducible and efficient benchmarks for hyperparameter optimization of neural machine translation systems","volume":"8","author":"Zhang","year":"2020","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2023071316514707200_bib282","doi-asserted-by":"publisher","first-page":"1903","DOI":"10.18653\/v1\/N19-1189","article-title":"Curriculum learning for domain adaptation in neural machine translation","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"Zhang","year":"2019"},{"key":"2023071316514707200_bib283","first-page":"9652","article-title":"Reinforced curriculum learning on pre-trained neural machine translation models","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Zhao","year":"2020"},{"key":"2023071316514707200_bib284","doi-asserted-by":"crossref","first-page":"6934","DOI":"10.18653\/v1\/2020.acl-main.620","article-title":"Uncertainty- aware curriculum learning for neural machine translation","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Zhou","year":"2020"},{"key":"2023071316514707200_bib285","doi-asserted-by":"crossref","first-page":"1284","DOI":"10.18653\/v1\/2021.findings-emnlp.111","article-title":"Combining curriculum learning and knowledge distillation for dialogue generation","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2021","author":"Zhu","year":"2021"},{"key":"2023071316514707200_bib286","article-title":"Teach less, learn more: On the undistillable classes in knowledge distillation","volume-title":"Advances in Neural Information Processing Systems","author":"Zhu","year":"2022"},{"issue":"9","key":"2023071316514707200_bib287","doi-asserted-by":"publisher","first-page":"3079","DOI":"10.1109\/TPAMI.2021.3067763","article-title":"Auto-PyTorch: Multi-fidelity metalearning for efficient and robust autoDL","volume":"43","author":"Zimmer","year":"2021","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"2023071316514707200_bib288","doi-asserted-by":"publisher","first-page":"1044","DOI":"10.1109\/IPDPSW55747.2022.00171","article-title":"Designing effective sparse expert models","volume-title":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","author":"Zoph","year":"2022"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00577\/2143614\/tacl_a_00577.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00577\/2143614\/tacl_a_00577.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,12,17]],"date-time":"2023-12-17T01:18:22Z","timestamp":1702775902000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00577\/116725\/Efficient-Methods-for-Natural-Language-Processing"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023]]},"references-count":288,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00577","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2023]]},"published":{"date-parts":[[2023]]}}}