{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,19]],"date-time":"2026-01-19T06:07:10Z","timestamp":1768802830172,"version":"3.49.0"},"reference-count":37,"publisher":"MIT Press","issue":"4","license":[{"start":{"date-parts":[[2024,7,30]],"date-time":"2024-07-30T00:00:00Z","timestamp":1722297600000},"content-version":"vor","delay-in-days":211,"URL":"https:\/\/creativecommons.org\/licenses\/by-nc-nd\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2024,12,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Human listeners effortlessly compensate for phonological changes during speech perception, often unconsciously inferring the intended sounds. For example, listeners infer the underlying \/n\/ when hearing an utterance such as \u201cclea[m] pan\u201d, where [m] arises from place assimilation to the following labial [p]. This article explores how the neural speech recognition model Wav2Vec2 perceives assimilated sounds, and identifies the linguistic knowledge that is implemented by the model to compensate for assimilation during Automatic Speech Recognition (ASR). Using psycholinguistic stimuli, we systematically analyze how various linguistic context cues influence compensation patterns in the model\u2019s output. Complementing these behavioral experiments, our probing experiments indicate that the model shifts its interpretation of assimilated sounds from their acoustic form to their underlying form in its final layers. Finally, our causal intervention experiments suggest that the model relies on minimal phonological context cues to accomplish this shift. These findings represent a step towards better understanding the similarities and differences in phonological processing between neural ASR models and humans.<\/jats:p>","DOI":"10.1162\/coli_a_00526","type":"journal-article","created":{"date-parts":[[2024,7,30]],"date-time":"2024-07-30T15:04:19Z","timestamp":1722351859000},"page":"1557-1585","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":2,"title":["Perception of Phonological Assimilation by Neural Speech Recognition Models"],"prefix":"10.1162","volume":"50","author":[{"given":"Charlotte","family":"Pouw","sequence":"first","affiliation":[{"name":"University of Amsterdam, Institute for Logic, Language and Computation. c.m.pouw@uva.nl"}]},{"given":"Marianne de Heer","family":"Kloots","sequence":"additional","affiliation":[{"name":"University of Amsterdam, Institute for Logic, Language and Computation. m.l.s.deheerkloots@uva.nl"}]},{"given":"Afra","family":"Alishahi","sequence":"additional","affiliation":[{"name":"Tilburg University, Cognitive Science & Artificial Intelligence. a.alishahi@tilburguniversity.edu"}]},{"given":"Willem","family":"Zuidema","sequence":"additional","affiliation":[{"name":"University of Amsterdam, Institute for Logic, Language and Computation. w.h.zuidema@uva.nl"}]}],"member":"281","published-online":{"date-parts":[[2024,12,1]]},"reference":[{"key":"2024122021045357000_bib1","doi-asserted-by":"publisher","first-page":"368","DOI":"10.18653\/v1\/K17-1037","article-title":"Encoding of phonology in a recurrent neural model of grounded speech","volume-title":"Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017)","author":"Alishahi","year":"2017"},{"key":"2024122021045357000_bib2","first-page":"12449","article-title":"wav2vec 2.0: A framework for self-supervised learning of speech representations","volume":"33","author":"Baevski","year":"2020","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2024122021045357000_bib3","first-page":"2735","article-title":"Modelling perceptual effects of phonology with ASR systems","volume-title":"CogSci 2020-42nd Annual Virtual Meeting of the Cognitive Science Society","author":"Bing\u2019er","year":"2020"},{"key":"2024122021045357000_bib4","unstructured":"Boersma, Paul and DavidWeenink. 2023. Praat: Doing phonetics by computer [Computer program]."},{"key":"2024122021045357000_bib5","volume-title":"TextGridTools: A TextGrid Processing and Analysis Toolkit for Python","author":"Buschmeier","year":"2013"},{"key":"2024122021045357000_bib6","doi-asserted-by":"publisher","first-page":"379","DOI":"10.18653\/v1\/2023.blackboxnlp-1.29","article-title":"Identifying and adapting transformer-components responsible for gender bias in an English language model","volume-title":"Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP","author":"Chintam","year":"2023"},{"issue":"5\u20136","key":"2024122021045357000_bib7","doi-asserted-by":"publisher","first-page":"535","DOI":"10.1080\/01690960143000155","article-title":"Variation and assimilation in German: Consequences for lexical access and representation","volume":"16","author":"Coenen","year":"2001","journal-title":"Language and Cognitive Processes"},{"key":"2024122021045357000_bib8","doi-asserted-by":"publisher","first-page":"265","DOI":"10.1515\/9783110219326.265","article-title":"Phonological knowledge in compensation for native and non-native assimilation","volume":"14","author":"Darcy","year":"2009","journal-title":"Phonology and Phonetics"},{"issue":"4","key":"2024122021045357000_bib9","doi-asserted-by":"publisher","first-page":"2340","DOI":"10.1121\/1.2772226","article-title":"A study of regressive place assimilation in spontaneous speech and its implications for spoken word recognition","volume":"122","author":"Dilley","year":"2007","journal-title":"Journal of the Acoustical Society of America"},{"issue":"3","key":"2024122021045357000_bib10","doi-asserted-by":"publisher","first-page":"373","DOI":"10.1006\/jpho.2001.0162","article-title":"Categorical and gradient properties of assimilation in alveolar to velar sequences: Evidence from EPG and EMA data","volume":"30","author":"Ellis","year":"2002","journal-title":"Journal of Phonetics"},{"key":"2024122021045357000_bib11","article-title":"TIMIT Acoustic-Phonetic Continuous Speech Corpus","author":"Garofolo","year":"1993"},{"issue":"1","key":"2024122021045357000_bib12","doi-asserted-by":"publisher","first-page":"144","DOI":"10.1037\/\/0096-1523.22.1.144","article-title":"Phonological variation and inference in lexical access.","volume":"22","author":"Gaskell","year":"1996","journal-title":"Journal of Experimental Psychology: Human Perception and Performance"},{"issue":"2","key":"2024122021045357000_bib13","doi-asserted-by":"publisher","first-page":"380","DOI":"10.1037\/\/0096-1523.24.2.380","article-title":"Mechanisms of phonological inference in speech perception.","volume":"24","author":"Gaskell","year":"1998","journal-title":"Journal of Experimental Psychology: Human Perception and Performance"},{"issue":"3","key":"2024122021045357000_bib14","doi-asserted-by":"publisher","first-page":"325","DOI":"10.1006\/jmla.2000.2741","article-title":"Lexical ambiguity resolution and spoken word recognition: Bridging the gap","volume":"44","author":"Gaskell","year":"2001","journal-title":"Journal of Memory and Language"},{"issue":"4","key":"2024122021045357000_bib15","doi-asserted-by":"publisher","first-page":"407","DOI":"10.1207\/s15516709cog1904_1","article-title":"A connectionist model of phonological representation in speech perception","volume":"19","author":"Gaskell","year":"1995","journal-title":"Cognitive Science"},{"key":"2024122021045357000_bib16","doi-asserted-by":"publisher","first-page":"240","DOI":"10.18653\/v1\/W18-5426","article-title":"Under the hood: Using diagnostic classifiers to investigate and improve how language models track agreement information","volume-title":"Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP","author":"Giulianelli","year":"2018"},{"issue":"1","key":"2024122021045357000_bib17","doi-asserted-by":"publisher","first-page":"163","DOI":"10.1037\/\/0096-1523.28.1.163-179","article-title":"Does English coronal place assimilation create lexical ambiguity?","volume":"28","author":"Gow Jr","year":"2002","journal-title":"Journal of Experimental Psychology: Human Perception and Performance"},{"key":"2024122021045357000_bib18","doi-asserted-by":"publisher","first-page":"369","DOI":"10.1145\/1143844.1143891","article-title":"Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks","volume-title":"Proceedings of the 23rd International Conference on Machine Learning","author":"Graves","year":"2006"},{"key":"2024122021045357000_bib19","first-page":"76033","article-title":"How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model","volume":"36","author":"Hanna","year":"2024","journal-title":"Advances in Neural Information Processing Systems"},{"issue":"1\u20132","key":"2024122021045357000_bib20","doi-asserted-by":"publisher","first-page":"59","DOI":"10.1177\/002383099203500206","article-title":"On the role of perception in shaping phonological assimilation rules","volume":"35","author":"Hura","year":"1992","journal-title":"Language and Speech"},{"key":"2024122021045357000_bib21","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1016\/j.wocn.2018.07.001","article-title":"Introducing Parselmouth: A Python interface to Praat","volume":"71","author":"Jadoul","year":"2018","journal-title":"Journal of Phonetics"},{"issue":"3","key":"2024122021045357000_bib22","doi-asserted-by":"publisher","first-page":"245","DOI":"10.1016\/0010-0277(91)90008-R","article-title":"The mental representation of lexical form: A phonological approach to the recognition lexicon","volume":"38","author":"Lahiri","year":"1991","journal-title":"Cognition"},{"issue":"1","key":"2024122021045357000_bib23","doi-asserted-by":"publisher","DOI":"10.1561\/116.00000050","article-title":"Recent advances in end-to-end automatic speech recognition","volume":"11","author":"Li","year":"2022","journal-title":"APSIPA Transactions on Signal and Information Processing"},{"key":"2024122021045357000_bib24","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1162\/coli_a_00511","article-title":"Towards faithful model explanation in NLP: A survey","author":"Lyu","year":"2024","journal-title":"Computational Linguistics"},{"issue":"6","key":"2024122021045357000_bib25","doi-asserted-by":"publisher","first-page":"956","DOI":"10.3758\/BF03194826","article-title":"Coping with phonological assimilation in speech perception: Evidence for early compensation","volume":"65","author":"Mitterer","year":"2003","journal-title":"Perception & Psychophysics"},{"issue":"8","key":"2024122021045357000_bib26","doi-asserted-by":"publisher","first-page":"1395","DOI":"10.1080\/17470210500198726","article-title":"The role of perceptual integration in the recognition of assimilated word forms","volume":"59","author":"Mitterer","year":"2006","journal-title":"Quarterly Journal of Experimental Psychology"},{"key":"2024122021045357000_bib27","doi-asserted-by":"publisher","first-page":"261","DOI":"10.1017\/CBO9780511519918.011","article-title":"The descriptive role of segments: Evidence from assimilation","author":"Nolan","year":"1992","journal-title":"Papers in Laboratory Phonology II: Gesture, Segment, Prosody"},{"key":"2024122021045357000_bib28","doi-asserted-by":"publisher","first-page":"5206","DOI":"10.1109\/ICASSP.2015.7178964","article-title":"LibriSpeech: An ASR corpus based on public domain audio books","volume-title":"2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Panayotov","year":"2015"},{"key":"2024122021045357000_bib29","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00656","article-title":"What do self-supervised speech models know about words?","author":"Pasad","year":"2023","journal-title":"arXiv preprint arXiv:2307.00162"},{"key":"2024122021045357000_bib30","doi-asserted-by":"publisher","first-page":"914","DOI":"10.1109\/ASRU51503.2021.9688093","article-title":"Layer-wise analysis of a self-supervised speech representation model","volume-title":"2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","author":"Pasad","year":"2021"},{"key":"2024122021045357000_bib31","first-page":"28492","article-title":"Robust speech recognition via large-scale weak supervision","volume-title":"International Conference on Machine Learning","author":"Radford","year":"2023"},{"key":"2024122021045357000_bib32","doi-asserted-by":"publisher","first-page":"1259","DOI":"10.21437\/Interspeech.2023-679","article-title":"Wave to syntax: Probing spoken language models for syntax","volume-title":"24th International Speech Communication Association, Interspeech 2023","author":"Shen","year":"2023"},{"key":"2024122021045357000_bib33","first-page":"12388","article-title":"Investigating gender bias in language models using causal mediation analysis","volume":"33","author":"Vig","year":"2020","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2024122021045357000_bib34","article-title":"Interpretability in the wild: A circuit for indirect object identification in GPT-2 small","author":"Wang","year":"2022","journal-title":"arXiv preprint arXiv:2211.00593"},{"issue":"1","key":"2024122021045357000_bib35","doi-asserted-by":"publisher","first-page":"95","DOI":"10.1177\/00238309010440010401","article-title":"Help or hindrance: How violation of different assimilation rules affects spoken-language processing","volume":"44","author":"Weber","year":"2001","journal-title":"Language and Speech"},{"key":"2024122021045357000_bib36","doi-asserted-by":"publisher","first-page":"38","DOI":"10.18653\/v1\/2020.emnlp-demos.6","article-title":"HuggingFace\u2019s Transformers: State-of-the-art natural language processing","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations","author":"Wolf","year":"2020"},{"key":"2024122021045357000_bib37","doi-asserted-by":"publisher","first-page":"13265","DOI":"10.18653\/v1\/2023.findings-acl.839","article-title":"Causal interventions expose implicit situation models for commonsense language understanding","volume-title":"Findings of the Association for Computational Linguistics: ACL 2023","author":"Yamakoshi","year":"2023"}],"container-title":["Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/coli\/article-pdf\/50\/4\/1557\/2470044\/coli_a_00526.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/coli\/article-pdf\/50\/4\/1557\/2470044\/coli_a_00526.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,12,20]],"date-time":"2024-12-20T21:05:13Z","timestamp":1734728713000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/coli\/article\/50\/4\/1557\/123790\/Perception-of-Phonological-Assimilation-by-Neural"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024]]},"references-count":37,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2024,12,1]]},"published-print":{"date-parts":[[2024,12,1]]}},"URL":"https:\/\/doi.org\/10.1162\/coli_a_00526","relation":{},"ISSN":["0891-2017","1530-9312"],"issn-type":[{"value":"0891-2017","type":"print"},{"value":"1530-9312","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2024]]},"published":{"date-parts":[[2024]]}}}