{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,8]],"date-time":"2026-04-08T09:20:48Z","timestamp":1775640048546,"version":"3.50.1"},"reference-count":66,"publisher":"MIT Press - Journals","license":[{"start":{"date-parts":[[2021,12,23]],"date-time":"2021-12-23T00:00:00Z","timestamp":1640217600000},"content-version":"vor","delay-in-days":356,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,12,17]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Multi-head attention, a collection of several attention mechanisms that independently attend to different parts of the input, is the key ingredient in the Transformer. Recent work has shown, however, that a large proportion of the heads in a Transformer\u2019s multi-head attention mechanism can be safely pruned away without significantly harming the performance of the model; such pruning leads to models that are noticeably smaller and faster in practice. Our work introduces a new head pruning technique that we term differentiable subset pruning. ntuitively, our method learns per- head importance variables and then enforces a user-specified hard constraint on the number of unpruned heads. he importance variables are learned via stochastic gradient descent. e conduct experiments on natural language inference and machine translation; we show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.1<\/jats:p>","DOI":"10.1162\/tacl_a_00436","type":"journal-article","created":{"date-parts":[[2021,12,24]],"date-time":"2021-12-24T05:57:49Z","timestamp":1640325469000},"page":"1442-1459","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":22,"title":["Differentiable Subset Pruning of Transformer Heads"],"prefix":"10.1162","volume":"9","author":[{"given":"Jiaoda","family":"Li","sequence":"first","affiliation":[{"name":"ETH Z\u00fcrich, Switzerland. jiaoda.li@inf.ethz.ch"}]},{"given":"Ryan","family":"Cotterell","sequence":"additional","affiliation":[{"name":"ETH Z\u00fcrich, Switzerland"},{"name":"University of Cambridge, UK. ryan.cotterell@inf.ethz.ch"}]},{"given":"Mrinmaya","family":"Sachan","sequence":"additional","affiliation":[{"name":"ETH Z\u00fcrich, Switzerland. mrinmaya.sachan@inf.ethz.ch"}]}],"member":"281","published-online":{"date-parts":[[2021,12,17]]},"reference":[{"key":"2021122316152495400_bib1","article-title":"Neural machine translation by jointly\n                        learning to align and translate","volume-title":"3rd\n                        International Conference on Learning\n                    Representations","author":"Bahdanau","year":"2015"},{"key":"2021122316152495400_bib2","doi-asserted-by":"crossref","first-page":"2664","DOI":"10.18653\/v1\/2020.emnlp-main.211","article-title":"Losing heads in the lottery: Pruning\n                        transformer attention in neural machine translation","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural\n                        Language Processing (EMNLP)","author":"Behnke","year":"2020"},{"key":"2021122316152495400_bib3","article-title":"Estimating or propagating gradients\n                        through stochastic neurons for conditional computation","author":"Bengio","year":"2013","journal-title":"CoRR"},{"key":"2021122316152495400_bib4","doi-asserted-by":"crossref","first-page":"3909","DOI":"10.18653\/v1\/2020.acl-main.360","article-title":"Successfully applying the stabilized lottery ticket\n                        hypothesis to the transformer architecture","volume-title":"Proceedings of the 58th Annual Meeting of the Association for\n                        Computational Linguistics","author":"Brix","year":"2020"},{"key":"2021122316152495400_bib5","first-page":"1877","article-title":"Language models are few-shot\n                        learners","volume-title":"Advances in Neural Information\n                        Processing Systems","author":"Brown","year":"2020"},{"key":"2021122316152495400_bib6","doi-asserted-by":"crossref","first-page":"3230","DOI":"10.18653\/v1\/2020.emnlp-main.260","article-title":"On the weak link between importance and\n                        prunability of attention heads","volume-title":"Proceedings of\n                        the 2020 Conference on Empirical Methods in Natural Language Processing\n                        (EMNLP)","author":"Budhraja","year":"2020"},{"key":"2021122316152495400_bib7","article-title":"Report on the 11th IWSLT\n                        evaluation campaign","volume-title":"Proceedings of the\n                        International Workshop on Spoken Language Translation, Hanoi,\n                        Vietnam","author":"Cettolo","year":"2014"},{"key":"2021122316152495400_bib8","doi-asserted-by":"crossref","first-page":"276","DOI":"10.18653\/v1\/W19-4828","article-title":"What does BERT look at? An analysis of\n                        BERT\u2019s attention","volume-title":"Proceedings of the 2019\n                        ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for\n                        NLP","author":"Clark","year":"2019"},{"key":"2021122316152495400_bib9","doi-asserted-by":"crossref","first-page":"1182","DOI":"10.18653\/v1\/P17-1109","article-title":"Probabilistic typology: Deep generative\n                        models of vowel inventories","volume-title":"Proceedings of the\n                        55th Annual Meeting of the Association for Computational Linguistics (Volume\n                        1: Long Papers)","author":"Cotterell","year":"2017"},{"key":"2021122316152495400_bib10","first-page":"4171","article-title":"BERT: Pre-training of deep bidirectional\n                        transformers for language understanding","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of\n                        the Association for Computational Linguistics: Human Language Technologies,\n                        Volume 1 (Long and Short Papers)","author":"Devlin","year":"2019"},{"key":"2021122316152495400_bib11","doi-asserted-by":"crossref","first-page":"34","DOI":"10.1162\/tacl_a_00298","article-title":"What BERT is not: Lessons from a new suite\n                        of psycholinguistic diagnostics for language models","volume":"8","author":"Ettinger","year":"2020","journal-title":"Transactions of the Association for Computational\n                        Linguistics"},{"key":"2021122316152495400_bib12","article-title":"Reducing transformer depth on demand with\n                        structured dropout","volume-title":"8th International Conference\n                        on Learning Representations","author":"Fan","year":"2020"},{"key":"2021122316152495400_bib13","article-title":"The lottery ticket hypothesis: Finding\n                        sparse, trainable neural networks","volume-title":"7th\n                        International Conference on Learning\n                    Representations","author":"Frankle","year":"2019"},{"key":"2021122316152495400_bib14","first-page":"1050","article-title":"Dropout as a Bayesian approximation:\n                        Representing model uncertainty in deep learning","volume-title":"Proceedings of The 33rd International Conference on Machine\n                        Learning","author":"Gal","year":"2016"},{"key":"2021122316152495400_bib15","first-page":"710","article-title":"Discovering diverse and salient threads in\n                        document collections","volume-title":"Proceedings of the 2012\n                        Joint Conference on Empirical Methods in Natural Language Processing and\n                        Computational Natural Language Learning","author":"Gillenwater","year":"2012"},{"key":"2021122316152495400_bib16","article-title":"Assessing BERT\u2019s syntactic\n                        abilities","author":"Goldberg","year":"2019","journal-title":"CoRR"},{"key":"2021122316152495400_bib17","article-title":"Learning sparse networks using targeted\n                        dropout","author":"Gomez","year":"2019","journal-title":"CoRR"},{"key":"2021122316152495400_bib18","doi-asserted-by":"crossref","first-page":"143","DOI":"10.18653\/v1\/2020.repl4nlp-1.18","article-title":"Compressing BERT: Studying the effects of\n                        weight pruning on transfer learning","volume-title":"Proceedings\n                        of the 5th Workshop on Representation Learning for NLP","author":"Gordon","year":"2020"},{"issue":"527","key":"2021122316152495400_bib19","first-page":"792","article-title":"Statistical theory of extreme values and\n                        some practical applications","volume":"58","author":"Gumbel","year":"1954","journal-title":"Journal of the Royal\n                        Aeronautical Society"},{"key":"2021122316152495400_bib20","article-title":"Dynamic network surgery for efficient DNNs","volume-title":"Advances in Neural Information Processing Systems","author":"Guo","year":"2016"},{"key":"2021122316152495400_bib21","doi-asserted-by":"crossref","first-page":"243","DOI":"10.1109\/ISCA.2016.30","article-title":"EIE: Efficient inference engine on compressed\n                        deep neural network","volume-title":"2016 ACM\/IEEE 43rd Annual\n                        International Symposium on Computer Architecture (ISCA)","author":"Han","year":"2016"},{"key":"2021122316152495400_bib22","article-title":"Deep compression: Compressing deep neural\n                        network with pruning, trained quantization and huffman\n                        coding","volume-title":"4th International Conference on Learning\n                        Representations","author":"Han","year":"2016"},{"key":"2021122316152495400_bib23","article-title":"Learning both weights and connections for\n                        efficient neural network","volume-title":"Advances in Neural\n                        Information Processing Systems","author":"Han","year":"2015"},{"key":"2021122316152495400_bib24","article-title":"Optimal brain surgeon: Extensions and\n                        performance comparisons","volume-title":"Advances in Neural\n                        Information Processing Systems","author":"Hassibi","year":"1994"},{"key":"2021122316152495400_bib25","doi-asserted-by":"crossref","first-page":"1398","DOI":"10.1109\/ICCV.2017.155","article-title":"Channel pruning for accelerating very deep neural\n                        networks","volume-title":"2017 IEEE International Conference on\n                        Computer Vision (ICCV)","author":"He","year":"2017"},{"key":"2021122316152495400_bib26","doi-asserted-by":"crossref","DOI":"10.1007\/978-3-030-01270-0_19","article-title":"Data- driven sparse structure selection for deep neural\n                        networks","volume-title":"Proceedings of the European Conference\n                        on Computer Vision (ECCV)","author":"Huang","year":"2018"},{"key":"2021122316152495400_bib27","article-title":"Categorical reparameterization with Gumbel-\n                        softmax","volume-title":"5th International Conference on Learning\n                        Representations","author":"Jang","year":"2017"},{"key":"2021122316152495400_bib28","doi-asserted-by":"crossref","first-page":"3651","DOI":"10.18653\/v1\/P19-1356","article-title":"What does BERT learn about the structure\n                        of language?","volume-title":"Proceedings of the 57th Annual\n                        Meeting of the Association for Computational Linguistics","author":"Jawahar","year":"2019"},{"key":"2021122316152495400_bib29","article-title":"Variational dropout and the local\n                        reparameterization trick","volume-title":"Advances in Neural\n                        Information Processing Systems","author":"Kingma","year":"2015"},{"key":"2021122316152495400_bib30","first-page":"177","article-title":"Moses: Open source toolkit for statistical\n                        machine translation","volume-title":"Proceedings of the 45th\n                        Annual Meeting of the Association for Computational Linguistics Companion\n                        Volume Proceedings of the Demo and Poster Sessions","author":"Koehn","year":"2007"},{"key":"2021122316152495400_bib31","first-page":"3499","article-title":"Stochastic beams and where to find them:\n                        The Gumbel-top-k trick for sampling sequences without\n                        replacement","volume-title":"Proceedings of the 36th\n                        International Conference on Machine Learning, volume 97 of Proceedings of\n                        Machine Learning Research","author":"Kool","year":"2019"},{"key":"2021122316152495400_bib32","article-title":"Optimal brain damage","volume-title":"Advances in Neural Information Processing Systems","author":"LeCun","year":"1990"},{"key":"2021122316152495400_bib33","article-title":"Pruning filters for efficient convnets","volume-title":"5th International Conference on Learning\n                    Representations","author":"Li","year":"2017"},{"key":"2021122316152495400_bib34","first-page":"2736","article-title":"Learning efficient convolutional networks\n                        through network slimming","volume-title":"Proceedings of the IEEE\n                        International Conference on Computer Vision","author":"Liu","year":"2017"},{"key":"2021122316152495400_bib35","article-title":"Learning sparse neural networks through\n                            L0 regularization","volume-title":"6th International Conference on Learning\n                    Representations","author":"Louizos","year":"2018"},{"key":"2021122316152495400_bib36","doi-asserted-by":"crossref","DOI":"10.1109\/ICCV.2017.541","article-title":"Thinet: A filter level pruning method for deep neural network\n                        compression","volume-title":"Proceedings of the IEEE\n                        International Conference on Computer Vision\n                    (ICCV)","author":"Luo","year":"2017"},{"key":"2021122316152495400_bib37","article-title":"The concrete distribution: A continuous relaxation of\n                        discrete random variables","volume-title":"5th International\n                        Conference on Learning Representations","author":"Maddison","year":"2017"},{"key":"2021122316152495400_bib38","article-title":"A* sampling","volume-title":"Advances in Neural Information Processing Systems","author":"Maddison","year":"2014"},{"key":"2021122316152495400_bib39","article-title":"Structured pruning of a BERT-based question answering\n                        model","author":"McCarley","year":"2021","journal-title":"CoRR"},{"key":"2021122316152495400_bib40","article-title":"Are sixteen heads really better than\n                        one?","volume-title":"Advances in Neural Information Processing\n                        Systems","author":"Michel","year":"2019"},{"key":"2021122316152495400_bib41","first-page":"2498","article-title":"Variational dropout sparsifies deep\n                        neural networks","volume-title":"Proceedings of the 34th\n                        International Conference on Machine Learning, volume 70 of Proceedings of\n                        Machine Learning Research","author":"Molchanov","year":"2017"},{"key":"2021122316152495400_bib42","article-title":"Pruning convolutional neural networks for\n                        resource efficient inference","volume-title":"5th International\n                        Conference on Learning Representations","author":"Molchanov","year":"2017"},{"key":"2021122316152495400_bib43","doi-asserted-by":"crossref","first-page":"314","DOI":"10.18653\/v1\/W19-5333","article-title":"Facebook FAIR\u2019s WMT19 news\n                        translation task submission","volume-title":"Proceedings of the\n                        Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day\n                        1)","author":"Ng","year":"2019"},{"key":"2021122316152495400_bib44","first-page":"48","article-title":"fairseq: A fast, extensible toolkit for sequence\n                        modeling","volume-title":"Proceedings of the 2019 Conference of\n                        the North American Chapter of the Association for Computational Linguistics\n                        (Demonstrations)","author":"Ott","year":"2019"},{"key":"2021122316152495400_bib45","article-title":"Neural nearest neighbors networks","volume-title":"Advances in Neural Information Processing Systems","author":"Pl\u00f6tz","year":"2018"},{"key":"2021122316152495400_bib46","doi-asserted-by":"crossref","first-page":"3208","DOI":"10.18653\/v1\/2020.emnlp-main.259","article-title":"When BERT plays the lottery, all tickets\n                        are winning","volume-title":"Proceedings of the 2020 Conference\n                        on Empirical Methods in Natural Language Processing (EMNLP)","author":"Prasanna","year":"2020"},{"key":"2021122316152495400_bib47","article-title":"Language models are unsupervised multitask\n                        learners","author":"Radford","year":"2019"},{"key":"2021122316152495400_bib48","article-title":"On the effect of dropping layers of\n                        pre-trained transformer models","author":"Sajjad","year":"2021","journal-title":"CoRR"},{"key":"2021122316152495400_bib49","first-page":"20378","article-title":"Movement pruning: Adaptive sparsity by\n                        fine-tuning","volume-title":"Advances in Neural Information\n                        Processing Systems","author":"Sanh","year":"2020"},{"key":"2021122316152495400_bib50","first-page":"1929","article-title":"Dropout: A simple way to prevent neural\n                        networks from overfitting","volume":"15","author":"Srivastava","year":"2014","journal-title":"Journal of Machine\n                        Learning Research"},{"key":"2021122316152495400_bib51","article-title":"Faster gaze prediction with dense networks\n                        and Fisher pruning","author":"Theis","year":"2018","journal-title":"CoRR"},{"key":"2021122316152495400_bib52","article-title":"REBAR: Low-variance, unbiased gradient\n                        estimates for discrete latent variable models","volume-title":"Advances in Neural Information Processing Systems","author":"Tucker","year":"2017"},{"key":"2021122316152495400_bib53","article-title":"Attention is all you need","volume-title":"Advances in Neural Information Processing Systems","author":"Vaswani","year":"2017"},{"key":"2021122316152495400_bib54","article-title":"Gumbel-max trick and weighted reservoir\n                        sampling","author":"Vieira","year":"2014"},{"key":"2021122316152495400_bib55","article-title":"On the distribution function of order\n                        statistics","author":"Vieira","year":"2021"},{"key":"2021122316152495400_bib56","article-title":"On the distribution of the smallest\n                        indices","author":"Vieira","year":"2021"},{"issue":"1","key":"2021122316152495400_bib57","doi-asserted-by":"crossref","first-page":"37","DOI":"10.1145\/3147.3165","article-title":"Random sampling with a\n                        reservoir","volume":"11","author":"Vitter","year":"1985","journal-title":"ACM Transactions on Mathematical\n                        Software"},{"key":"2021122316152495400_bib58","doi-asserted-by":"crossref","first-page":"5797","DOI":"10.18653\/v1\/P19-1580","article-title":"Analyzing multi-head self-attention:\n                        Specialized heads do the heavy lifting, the rest can be\n                        pruned","volume-title":"Proceedings of the 57th Annual Meeting of\n                        the Association for Computational Linguistics","author":"Voita","year":"2019"},{"key":"2021122316152495400_bib59","article-title":"Learning structured sparsity in deep neural\n                        networks","volume-title":"Advances in Neural Information\n                        Processing Systems","author":"Wen","year":"2016"},{"key":"2021122316152495400_bib60","first-page":"1112","article-title":"A broad-coverage challenge corpus for\n                        sentence understanding through inference","volume-title":"Proceedings of the 2018 Conference of the North American Chapter of\n                        the Association for Computational Linguistics: Human Language Technologies,\n                        Volume 1 (Long Papers)","author":"Williams","year":"2018"},{"key":"2021122316152495400_bib61","doi-asserted-by":"crossref","first-page":"38","DOI":"10.18653\/v1\/2020.emnlp-demos.6","article-title":"Transformers: State-of-the-art natural language\n                        processing","volume-title":"Proceedings of the 2020 Conference on\n                        Empirical Methods in Natural Language Processing: System\n                        Demonstrations","author":"Wolf","year":"2020"},{"key":"2021122316152495400_bib62","doi-asserted-by":"crossref","DOI":"10.24963\/ijcai.2019\/544","article-title":"Reparameterizable subset sampling via\n                        continuous relaxations","volume-title":"International Joint\n                        Conference on Artificial Intelligence","author":"Xie","year":"2019"},{"key":"2021122316152495400_bib63","article-title":"XLNet: Generalized autoregressive pretraining for language\n                        understanding","volume-title":"Advances in Neural Information\n                        Processing Systems","author":"Yang","year":"2019"},{"issue":"2","key":"2021122316152495400_bib64","doi-asserted-by":"crossref","first-page":"109","DOI":"10.1016\/0022-2496(77)90026-8","article-title":"The relationship between Luce\u2019s\n                        Choice Axiom, Thurstone\u2019s Theory of Comparative Judgment, and the\n                        double exponential distribution","volume":"15","author":"Yellott","year":"1977","journal-title":"Journal of\n                        Mathematical Psychology"},{"key":"2021122316152495400_bib65","doi-asserted-by":"crossref","first-page":"2396","DOI":"10.18653\/v1\/P19-1230","article-title":"Head-Driven Phrase Structure Grammar parsing on Penn\n                        Treebank","volume-title":"Proceedings of the 57th Annual Meeting\n                        of the Association for Computational Linguistics","author":"Zhou","year":"2019"},{"key":"2021122316152495400_bib66","article-title":"To prune, or not to prune: Exploring the\n                        efficacy of pruning for model compression","volume-title":"6th\n                        International Conference on Learning\n                    Representations","author":"Zhu","year":"2018"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00436\/1979279\/tacl_a_00436.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00436\/1979279\/tacl_a_00436.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,12,24]],"date-time":"2021-12-24T05:58:44Z","timestamp":1640325524000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00436\/108868\/Differentiable-Subset-Pruning-of-Transformer-Heads"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021]]},"references-count":66,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00436","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2021]]},"published":{"date-parts":[[2021]]}}}