{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,20]],"date-time":"2026-02-20T01:57:36Z","timestamp":1771552656838,"version":"3.50.1"},"reference-count":65,"publisher":"MIT Press - Journals","license":[{"start":{"date-parts":[[2021,12,9]],"date-time":"2021-12-09T00:00:00Z","timestamp":1639008000000},"content-version":"vor","delay-in-days":342,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,12,6]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Recent improvements in the predictive quality of natural language processing systems are often dependent on a substantial increase in the number of model parameters. This has led to various attempts of compressing such models, but existing methods have not considered the differences in the predictive power of various model components or in the generalizability of the compressed models. To understand the connection between model compression and out-of-distribution generalization, we define the task of compressing language representation models such that they perform best in a domain adaptation setting. We choose to address this problem from a causal perspective, attempting to estimate the average treatment effect (ATE) of a model component, such as a single layer, on the model\u2019s predictions. Our proposed ATE-guided Model Compression scheme (AMoC), generates many model candidates, differing by the model components that were removed. Then, we select the best candidate through a stepwise regression model that utilizes the ATE to predict the expected performance on the target domain. AMoC outperforms strong baselines on dozens of domain pairs across three text classification and sequence tagging tasks.1<\/jats:p>","DOI":"10.1162\/tacl_a_00431","type":"journal-article","created":{"date-parts":[[2021,12,9]],"date-time":"2021-12-09T18:17:27Z","timestamp":1639073847000},"page":"1355-1373","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":5,"title":["Model Compression for Domain Adaptation through Causal Effect Estimation"],"prefix":"10.1162","volume":"9","author":[{"given":"Guy","family":"Rotman","sequence":"first","affiliation":[{"name":"Faculty of Industrial Engineering and Management, Technion, IIT, Israel. grotman@campus.technion.ac.il"}]},{"given":"Amir","family":"Feder","sequence":"additional","affiliation":[{"name":"Faculty of Industrial Engineering and Management, Technion, IIT, Israel. feder@campus.technion.ac.il"}]},{"given":"Roi","family":"Reichart","sequence":"additional","affiliation":[{"name":"Faculty of Industrial Engineering and Management, Technion, IIT, Israel. roiri@technion.ac.il"}]}],"member":"281","published-online":{"date-parts":[[2021,12,6]]},"reference":[{"key":"2021120918161230700_bib1","doi-asserted-by":"publisher","first-page":"7350","DOI":"10.1609\/aaai.v34i05.6229","article-title":"Knowledge distillation from internal representations.","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Aguilar","year":"2020"},{"key":"2021120918161230700_bib2","doi-asserted-by":"publisher","first-page":"504","DOI":"10.1162\/tacl_a_00328","article-title":"PERL: Pivot-based domain adaptation for pre-trained deep contextualized embedding models","volume":"8","author":"Ben-David","year":"2020","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2021120918161230700_bib3","doi-asserted-by":"publisher","first-page":"120","DOI":"10.3115\/1610075.1610094","article-title":"Domain adaptation with structural correspondence learning","volume-title":"Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing","author":"Blitzer","year":"2006"},{"issue":"1","key":"2021120918161230700_bib4","first-page":"3207","article-title":"Counterfactual reasoning and learning systems: The example of computational advertising","volume":"14","author":"Bottou","year":"2013","journal-title":"The Journal of Machine Learning Research"},{"key":"2021120918161230700_bib5","doi-asserted-by":"publisher","first-page":"632","DOI":"10.18653\/v1\/D15-1075","article-title":"A large annotated corpus for learning natural language inference","volume-title":"Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing","author":"Bowman","year":"2015"},{"key":"2021120918161230700_bib6","article-title":"Language models are few-shot learners","author":"Brown","year":"2020","journal-title":"arXiv preprint arXiv:2005.14165"},{"key":"2021120918161230700_bib7","doi-asserted-by":"publisher","first-page":"2463","DOI":"10.24963\/ijcai.2020\/341","article-title":"Adabert: Task-adaptive bert compression with differentiable neural architecture search","volume-title":"Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20","author":"Chen","year":"2020"},{"key":"2021120918161230700_bib8","article-title":"Transfer learning for sequences via learning to collocate","volume-title":"International Conference on Learning Representations","author":"Cui","year":"2018"},{"key":"2021120918161230700_bib9","first-page":"53","article-title":"Frustratingly easy semi-supervised domain adaptation","volume-title":"Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing","author":"Daum\u00e9","year":"2010"},{"key":"2021120918161230700_bib10","first-page":"4171","article-title":"BERT: Pre-training of deep bidirectional transformers for language understanding","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"Devlin","year":"2019"},{"key":"2021120918161230700_bib11","doi-asserted-by":"publisher","DOI":"10.1002\/9781118625590","volume-title":"Applied Regression Analysis","author":"Draper","year":"1998"},{"key":"2021120918161230700_bib12","doi-asserted-by":"publisher","first-page":"2377","DOI":"10.18653\/v1\/2020.emnlp-main.186","article-title":"The secret is in the spectra: Predicting cross-lingual task performance with spectral similarity measures","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing","author":"Dubossarsky","year":"2020"},{"key":"2021120918161230700_bib13","doi-asserted-by":"publisher","first-page":"2163","DOI":"10.18653\/v1\/D19-1222","article-title":"To annotate or not? Predicting performance drop under domain shift","volume-title":"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing","author":"Elsahar","year":"2019"},{"key":"2021120918161230700_bib14","article-title":"Reducing transformer depth on demand with structured dropout","volume-title":"International Conference on Learning Representations","author":"Fan","year":"2019"},{"issue":"2","key":"2021120918161230700_bib15","doi-asserted-by":"publisher","first-page":"333","DOI":"10.1162\/coli_a_00404","article-title":"CausaLM: Causal model explanation through counterfactual language models","volume":"47","author":"Feder","year":"2021","journal-title":"Computational Linguistics"},{"key":"2021120918161230700_bib16","article-title":"The lottery ticket hypothesis: Finding sparse, trainable neural networks","volume-title":"International Conference on Learning Representations","author":"Frankle","year":"2018"},{"key":"2021120918161230700_bib17","article-title":"Compressing large-scale transformer-based models: A case study on bert","author":"Ganesh","year":"2020","journal-title":"arXiv preprint arXiv:2002.11985"},{"key":"2021120918161230700_bib18","first-page":"1180","article-title":"Unsupervised domain adaptation by backpropagation","volume-title":"International Conference on Machine Learning","author":"Ganin","year":"2015"},{"issue":"1","key":"2021120918161230700_bib19","first-page":"2096","article-title":"Domain-adversarial training of neural networks","volume":"17","author":"Ganin","year":"2016","journal-title":"The Journal of Machine Learning Research"},{"key":"2021120918161230700_bib20","first-page":"2839","article-title":"Domain adaptation with conditional transferable components","volume-title":"International Conference on Machine Learning","author":"Gong","year":"2016"},{"key":"2021120918161230700_bib21","article-title":"Explaining classifiers with causal concept effect (cace)","author":"Goyal","year":"2019","journal-title":"arXiv preprint arXiv:1907.07165"},{"key":"2021120918161230700_bib22","first-page":"3759","article-title":"Robust learning with the Hilbert-Schmidt independence criterion","volume-title":"Proceedings of the 37th International Conference on Machine Learning","author":"Greenfeld","year":"2020"},{"key":"2021120918161230700_bib23","doi-asserted-by":"crossref","first-page":"507","DOI":"10.1145\/2872427.2883037","article-title":"Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering","volume-title":"Proceedings of the 25th International Conference on World Wide Web","author":"He","year":"2016"},{"key":"2021120918161230700_bib24","article-title":"Distilling the knowledge in a neural network","volume-title":"NIPS Deep Learning and Representation Learning Workshop","author":"Hinton","year":"2015"},{"issue":"1","key":"2021120918161230700_bib25","doi-asserted-by":"publisher","first-page":"1","DOI":"10.2307\/2529336","article-title":"A biometrics invited paper. The analysis and selection of variables in linear regression","volume":"32","author":"Hocking","year":"1976","journal-title":"Biometrics"},{"key":"2021120918161230700_bib26","doi-asserted-by":"publisher","first-page":"57","DOI":"10.3115\/1614049.1614064","article-title":"Ontonotes: The 90% solution","volume-title":"Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers","author":"Hovy","year":"2006"},{"key":"2021120918161230700_bib27","doi-asserted-by":"publisher","first-page":"4163","DOI":"10.18653\/v1\/2020.findings-emnlp.372","article-title":"TinyBERT: Distilling BERT for natural language understanding","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2020","author":"Jiao","year":"2020"},{"key":"2021120918161230700_bib28","first-page":"3020","article-title":"Learning representations for counterfactual inference","volume-title":"International Conference on Machine Learning","author":"Johansson","year":"2016"},{"key":"2021120918161230700_bib29","article-title":"Learning the difference that makes a difference with counterfactually- augmented data","volume-title":"International Conference on Learning Representations","author":"Kaushik","year":"2019"},{"key":"2021120918161230700_bib30","article-title":"Adam: A method for stochastic optimization","volume-title":"International Conference on Learning Representations","author":"Kingma","year":"2015"},{"key":"2021120918161230700_bib31","doi-asserted-by":"publisher","first-page":"2779","DOI":"10.18653\/v1\/D19-1279","article-title":"75 languages, 1 model: Parsing universal dependencies universally","volume-title":"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing","author":"Kondratyuk","year":"2019"},{"key":"2021120918161230700_bib32","article-title":"Albert: A lite BERT for self-supervised learning of language representations","volume-title":"International Conference on Learning Representations","author":"Lan","year":"2020"},{"key":"2021120918161230700_bib33","doi-asserted-by":"crossref","first-page":"2012","DOI":"10.18653\/v1\/D18-1226","article-title":"Neural adaptation layers for cross-domain named entity recognition","volume-title":"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing","author":"Lin","year":"2018"},{"key":"2021120918161230700_bib34","article-title":"RoBERTa: A robustly optimized bert pretraining approach","author":"Liu","year":"2019","journal-title":"arXiv preprint arXiv:1907.11692"},{"key":"2021120918161230700_bib35","doi-asserted-by":"publisher","first-page":"541","DOI":"10.3115\/1609067.1609127","article-title":"Performance confidence estimation for automatic summarization","volume-title":"Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)","author":"Louis","year":"2009"},{"key":"2021120918161230700_bib36","first-page":"10846","article-title":"Domain adaptation by using causal inference to predict invariant conditional distributions","volume-title":"Advances in Neural Information Processing Systems","author":"Magliacane","year":"2018"},{"key":"2021120918161230700_bib37","first-page":"28","article-title":"Automatic domain adaptation for parsing","volume-title":"Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics","author":"McClosky","year":"2010"},{"key":"2021120918161230700_bib38","doi-asserted-by":"publisher","first-page":"5191","DOI":"10.1609\/aaai.v34i04.5963","article-title":"Improved knowledge distillation via teacher assistant","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Mirzadeh","year":"2020"},{"issue":"4","key":"2021120918161230700_bib39","doi-asserted-by":"crossref","first-page":"669","DOI":"10.1093\/biomet\/82.4.669","article-title":"Causal diagrams for empirical research","volume":"82","author":"Pearl","year":"1995","journal-title":"Biometrika"},{"key":"2021120918161230700_bib40","doi-asserted-by":"publisher","DOI":"10.1093\/biomet\/82.4.669","volume-title":"Causality","author":"Pearl","year":"2009"},{"key":"2021120918161230700_bib41","doi-asserted-by":"crossref","first-page":"96","DOI":"10.1214\/09-SS057","article-title":"Causal inference in statistics: An overview","volume":"3","author":"Pearl","year":"2009","journal-title":"Statistics Surveys"},{"key":"2021120918161230700_bib42","doi-asserted-by":"publisher","DOI":"10.1214\/09-SS057","volume-title":"Elements of Causal Inference: Foundations and Learning Algorithms","author":"Peters","year":"2017"},{"key":"2021120918161230700_bib43","article-title":"Improving language understanding with unsupervised learning","author":"Radford","year":"2018","journal-title":"Technical report, OpenAI"},{"key":"2021120918161230700_bib44","doi-asserted-by":"publisher","first-page":"887","DOI":"10.3115\/1613715.1613829","article-title":"Automatic prediction of parser accuracy","volume-title":"Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing","author":"Ravi","year":"2008"},{"key":"2021120918161230700_bib45","first-page":"408","article-title":"An ensemble method for selection of high quality parses","volume-title":"Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics","author":"Reichart","year":"2007"},{"key":"2021120918161230700_bib46","doi-asserted-by":"publisher","first-page":"842","DOI":"10.1162\/tacl_a_00349","article-title":"A primer in BERTology: What we know about how BERT works","volume":"8","author":"Rogers","year":"2021","journal-title":"Transactions of the Association for Computational Linguistics"},{"issue":"1","key":"2021120918161230700_bib47","first-page":"1309","article-title":"Invariant models for causal transfer learning","volume":"19","author":"Rojas-Carulla","year":"2018","journal-title":"The Journal of Machine Learning Research"},{"key":"2021120918161230700_bib48","doi-asserted-by":"publisher","first-page":"695","DOI":"10.1162\/tacl_a_00294","article-title":"Deep contextualized self-training for low resource dependency parsing","volume":"7","author":"Rotman","year":"2019","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2021120918161230700_bib49","article-title":"Poor man\u2019s BERT: Smaller and faster transformer models","author":"Sajjad","year":"2020","journal-title":"arXiv preprint arXiv:2004.03844"},{"key":"2021120918161230700_bib50","article-title":"Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter","volume-title":"Proceedings of the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing in Advances in Neural Information Processing Systems","author":"Sanh","year":"2019"},{"key":"2021120918161230700_bib51","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/K17-3007","article-title":"Adversarial training for cross-domain universal dependency parsing","volume-title":"Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies","author":"Sato","year":"2017"},{"key":"2021120918161230700_bib52","first-page":"459","article-title":"On causal and anticausal learning","volume-title":"Proceedings of the 29th International Conference on International Conference on Machine Learning","author":"Sch\u00f6lkopf","year":"2012"},{"key":"2021120918161230700_bib53","doi-asserted-by":"crossref","first-page":"2158","DOI":"10.18653\/v1\/2020.acl-main.195","article-title":"MobileBERT: A compact task-agnostic BERT for resource-limited devices","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Sun","year":"2020"},{"key":"2021120918161230700_bib54","first-page":"31","article-title":"Using domain similarity for performance estimation","volume-title":"Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing","author":"Asch","year":"2010"},{"key":"2021120918161230700_bib55","first-page":"5998","article-title":"Attention is all you need","volume-title":"Advances in Neural Information Processing Systems","author":"Vaswani","year":"2017"},{"key":"2021120918161230700_bib56","article-title":"On calibration and out-of-domain generalization","author":"Wald","year":"2021","journal-title":"arXiv preprint arXiv:2102.10395"},{"key":"2021120918161230700_bib57","doi-asserted-by":"publisher","first-page":"1","DOI":"10.18653\/v1\/N18-1001","article-title":"Label-aware double transfer learning for cross-specialty medical named entity recognition","volume-title":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)","author":"Wang","year":"2018"},{"key":"2021120918161230700_bib58","doi-asserted-by":"publisher","first-page":"1112","DOI":"10.18653\/v1\/N18-1101","article-title":"A broad-coverage challenge corpus for sentence understanding through inference","volume-title":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)","author":"Williams","year":"2018"},{"key":"2021120918161230700_bib59","doi-asserted-by":"publisher","first-page":"38","DOI":"10.18653\/v1\/2020.emnlp-demos.6","article-title":"Transformers: State-of-the-art natural language processing","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations","author":"Wolf","year":"2020"},{"key":"2021120918161230700_bib60","doi-asserted-by":"crossref","first-page":"8625","DOI":"10.18653\/v1\/2020.acl-main.764","article-title":"Predicting performance for natural language processing tasks","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Xia","year":"2020"},{"key":"2021120918161230700_bib61","first-page":"819","article-title":"Domain adaptation under target and conditional shift","volume-title":"International Conference on Machine Learning","author":"Zhang","year":"2013"},{"key":"2021120918161230700_bib62","doi-asserted-by":"publisher","first-page":"400","DOI":"10.18653\/v1\/K17-1040","article-title":"Neural structural correspondence learning for domain adaptation","volume-title":"Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017)","author":"Ziser","year":"2017"},{"key":"2021120918161230700_bib63","doi-asserted-by":"publisher","first-page":"238","DOI":"10.18653\/v1\/D18-1022","article-title":"Deep pivot-based modeling for cross-language cross-domain transfer with minimal guidance","volume-title":"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing","author":"Ziser","year":"2018"},{"key":"2021120918161230700_bib64","doi-asserted-by":"publisher","first-page":"1241","DOI":"10.18653\/v1\/N18-1112","article-title":"Pivot based language modeling for improved neural domain adaptation","volume-title":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)","author":"Ziser","year":"2018"},{"key":"2021120918161230700_bib65","doi-asserted-by":"publisher","first-page":"5895","DOI":"10.18653\/v1\/P19-1591","article-title":"Task refinement learning for improved accuracy and stability of unsupervised domain adaptation","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Ziser","year":"2019"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00431\/1976778\/tacl_a_00431.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00431\/1976778\/tacl_a_00431.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,12,9]],"date-time":"2021-12-09T18:18:04Z","timestamp":1639073884000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00431\/108609\/Model-Compression-for-Domain-Adaptation-through"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021]]},"references-count":65,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00431","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2021]]},"published":{"date-parts":[[2021]]}}}