{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,1]],"date-time":"2026-07-01T14:58:46Z","timestamp":1782917926593,"version":"3.54.5"},"reference-count":106,"publisher":"MIT Press","license":[{"start":{"date-parts":[[2024,4,12]],"date-time":"2024-04-12T00:00:00Z","timestamp":1712880000000},"content-version":"vor","delay-in-days":102,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2024,4,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Many self-supervised speech models (S3Ms) have been introduced over the last few years, improving performance and data efficiency on various speech tasks. However, these empirical successes alone do not give a complete picture of what is learned during pre-training. Recent work has begun analyzing how S3Ms encode certain properties, such as phonetic and speaker information, but we still lack a proper understanding of knowledge encoded at the word level and beyond. In this work, we use lightweight analysis methods to study segment-level linguistic properties\u2014word identity, boundaries, pronunciation, syntactic features, and semantic features\u2014encoded in S3Ms. We present a comparative study of layer-wise representations from ten S3Ms and find that (i) the frame-level representations within each word segment are not all equally informative, and (ii) the pre-training objective and model size heavily influence the accessibility and distribution of linguistic information across layers. We also find that on several tasks\u2014word discrimination, word segmentation, and semantic sentence similarity\u2014S3Ms trained with visual grounding outperform their speech-only counterparts. Finally, our task-based analyses demonstrate improved performance on word segmentation and acoustic word discrimination while using simpler methods than prior work.1<\/jats:p>","DOI":"10.1162\/tacl_a_00656","type":"journal-article","created":{"date-parts":[[2024,4,12]],"date-time":"2024-04-12T19:17:55Z","timestamp":1712949475000},"page":"372-391","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":33,"title":["What Do Self-Supervised Speech Models Know About Words?"],"prefix":"10.1162","volume":"12","author":[{"given":"Ankita","family":"Pasad","sequence":"first","affiliation":[{"name":"Toyota Technological Institute at Chicago, USA. ankitap@ttic.edu"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Chung-Ming","family":"Chien","sequence":"additional","affiliation":[{"name":"Toyota Technological Institute at Chicago, USA. cmchien@ttic.edu"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Shane","family":"Settle","sequence":"additional","affiliation":[{"name":"Toyota Technological Institute at Chicago, USA. settle.shane@ttic.edu"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Karen","family":"Livescu","sequence":"additional","affiliation":[{"name":"Toyota Technological Institute at Chicago, USA. klivescu@ttic.edu"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"281","published-online":{"date-parts":[[2024,4,12]]},"reference":[{"key":"2024041219174549200_bib1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2023-2131","article-title":"An information-theoretic analysis of self-supervised discrete representations of speech","volume-title":"Interspeech","author":"Abdullah","year":"2023"},{"key":"2024041219174549200_bib2","article-title":"LRS3-TED: A large-scale dataset for visual speech recognition","author":"Afouras","year":"2018","journal-title":"arXiv preprint arXiv:1809.00496"},{"key":"2024041219174549200_bib3","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00505","article-title":"DP-Parse: Finding word boundaries from raw speech with an instance lexicon","author":"Algayres","year":"2022","journal-title":"Transactions of the Association for Computational Linguistics (TACL)"},{"key":"2024041219174549200_bib4","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2020-2362","article-title":"Evaluating the reliability of acoustic speech embeddings","volume-title":"Interspeech","author":"Algayres","year":"2020"},{"key":"2024041219174549200_bib5","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2023-1823","article-title":"SpeechGLUE: How well can self-supervised speech models capture linguistic knowledge?","volume-title":"Interspeech","author":"Ashihara","year":"2023"},{"key":"2024041219174549200_bib6","article-title":"Unsupervised speech recognition","volume-title":"Advances in Neural Information Processing Systems (NeurIPS)","author":"Baevski","year":"2021"},{"key":"2024041219174549200_bib7","article-title":"Data2vec: A general framework for self-supervised learning in speech, vision and language","volume-title":"International Conference on Machine Learning (ICML)","author":"Baevski","year":"2022"},{"key":"2024041219174549200_bib8","article-title":"wav2vec 2.0: A framework for self-supervised learning of speech representations","volume-title":"Advances in Neural Information Processing Systems (NeurIPS)","author":"Baevski","year":"2020"},{"key":"2024041219174549200_bib9","doi-asserted-by":"publisher","DOI":"10.1109\/SLT54892.2023.10023019","article-title":"Proficiency assessment of l2 spoken English using wav2vec 2.0","volume-title":"IEEE Spoken Language Technology Workshop (SLT)","author":"Bann\u00f2","year":"2023"},{"key":"2024041219174549200_bib10","doi-asserted-by":"publisher","DOI":"10.1162\/coli_a_00422","article-title":"Probing classifiers: Promises, shortcomings, and advances","author":"Belinkov","year":"2022","journal-title":"Computational Linguistics"},{"key":"2024041219174549200_bib11","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00254","article-title":"Analysis methods in neural language processing: A survey","author":"Belinkov","year":"2019","journal-title":"Transactions of the Association for Computational Linguistics (TACL)"},{"key":"2024041219174549200_bib12","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2021-1874","article-title":"Segmental contrastive predictive coding for unsupervised word segmentation","volume-title":"Interspeech","author":"Bhati","year":"2021"},{"key":"2024041219174549200_bib13","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2011-304","article-title":"Rapid evaluation of speech representations for spoken term discovery","volume-title":"Interspeech","author":"Carlin","year":"2011"},{"key":"2024041219174549200_bib14","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP43922.2022.9747490","article-title":"DistilHuBERT: Speech representation learning by layer-wise distillation of hidden-unit BERT","volume-title":"IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Chang","year":"2022"},{"key":"2024041219174549200_bib15","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2021-1965","article-title":"GigaSpeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio","volume-title":"Interspeech","author":"Chen","year":"2021"},{"key":"2024041219174549200_bib16","doi-asserted-by":"publisher","DOI":"10.1109\/JSTSP.2022.3188113","article-title":"WavLM: Large-scale self-supervised pre-training for full stack speech processing","author":"Chen","year":"2022","journal-title":"IEEE Journal of Selected Topics in Signal Processing (JSTSP)"},{"key":"2024041219174549200_bib17","article-title":"Neural analysis and synthesis: Reconstructing speech from self-supervised representations","volume-title":"Advances in Neural Information Processing Systems (NeurIPS)","author":"Choi","year":"2021"},{"key":"2024041219174549200_bib18","article-title":"SentEval: An evaluation toolkit for universal sentence representations","volume-title":"International Conference on Language Resources and Evaluation (LREC)","author":"Conneau","year":"2018"},{"key":"2024041219174549200_bib19","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP43922.2022.9746102","article-title":"Contrastive prediction strategies for unsupervised segmentation and categorization of phonemes and words","volume-title":"International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Cuervo","year":"2022"},{"key":"2024041219174549200_bib20","article-title":"BERT: Pre-training of deep bidirectional transformers for language understanding","volume-title":"North American Chapter of the Association for Computational Linguistics (NAACL)","author":"Devlin","year":"2019"},{"key":"2024041219174549200_bib21","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2020-2743","article-title":"The zero resource speech challenge 2020: Discovering discrete subword and word units","volume-title":"Interspeech","author":"Dunbar","year":"2020"},{"key":"2024041219174549200_bib22","article-title":"Exploring wav2vec 2.0 on speaker verification and language identification","volume-title":"Interspeech","author":"Fan","year":"2021"},{"key":"2024041219174549200_bib23","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/P14-5004","article-title":"Community evaluation and exchange of word vectors at wordvectors. org","volume-title":"Association for Computational Linguistics (ACL): System Demonstrations","author":"Faruqui","year":"2014"},{"key":"2024041219174549200_bib24","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W16-2506","article-title":"Problems with evaluation of word embeddings using word similarity tasks","volume-title":"1st Workshop on Evaluating Vector-Space Representations for NLP","author":"Faruqui","year":"2016"},{"key":"2024041219174549200_bib25","article-title":"Silence is sweeter than speech: Self-supervised model using silence to store speaker information","author":"Feng","year":"2022","journal-title":"arXiv preprint arXiv:2205.03759"},{"key":"2024041219174549200_bib26","doi-asserted-by":"crossref","DOI":"10.1109\/ICASSP49357.2023.10095363","article-title":"Unsupervised word segmentation using temporal gradient pseudo-labels","volume-title":"IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)","author":"Fuchs","year":"2023"},{"key":"2024041219174549200_bib27","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.1992.225858","article-title":"Switchboard: Telephone speech corpus for research and development","volume-title":"IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)","author":"Godfrey","year":"1992"},{"key":"2024041219174549200_bib28","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2020-3015","article-title":"Conformer: Convolution-augmented transformer for speech recognition","volume-title":"Interspeech","author":"Gulati","year":"2020"},{"key":"2024041219174549200_bib29","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P17-1047","article-title":"Learning word-like units from joint audio-visual analysis","volume-title":"Association for Computational Linguistics (ACL)","author":"Harwath","year":"2017"},{"key":"2024041219174549200_bib30","article-title":"Multi-view recurrent neural acoustic word embeddings","volume-title":"International Conference on Learning Representations (ICLR)","author":"He","year":"2017"},{"key":"2024041219174549200_bib31","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1275","article-title":"Designing and interpreting probes with control tasks","volume-title":"Empirical Methods in Natural Language Processing (EMNLP)","author":"Hewitt","year":"2019"},{"key":"2024041219174549200_bib32","doi-asserted-by":"publisher","DOI":"10.1093\/biomet\/28.3-4.321","article-title":"Relations between two sets of variates","author":"Hotelling","year":"1936","journal-title":"Biometrika"},{"key":"2024041219174549200_bib33","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/2021.acl-long.411","article-title":"Text-free image-to-speech synthesis using learned segmental units","volume-title":"Association for Computational Linguistics (ACL)","author":"Hsu","year":"2021"},{"key":"2024041219174549200_bib34","article-title":"Robust wav2vec 2.0: Analyzing domain shift in self-supervised pre-training","volume-title":"Interspeech","author":"Hsu","year":"2021"},{"key":"2024041219174549200_bib35","doi-asserted-by":"crossref","DOI":"10.1109\/ICASSP39728.2021.9414460","article-title":"HuBERT: How much can a bad teacher benefit asr pre-training?","volume-title":"IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)","author":"Hsu","year":"2021"},{"key":"2024041219174549200_bib36","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2020-2828","article-title":"Multilingual jointly trained acoustic and written word embeddings","volume-title":"Interspeech","author":"Yushi","year":"2020"},{"key":"2024041219174549200_bib37","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2022-11034","article-title":"Pseudo label is better than human label","volume-title":"Interspeech","author":"Hwang","year":"2022"},{"key":"2024041219174549200_bib38","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2013.6639241","article-title":"Weak top-down constraints for unsupervised acoustic model training","volume-title":"IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)","author":"Jansen","year":"2013"},{"key":"2024041219174549200_bib39","article-title":"Predicting within and across language phoneme recognition performance of self-supervised learning speech pre-trained models","author":"Ji","year":"2022","journal-title":"arXiv preprint arXiv:2206.12489"},{"key":"2024041219174549200_bib40","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP40776.2020.9052942","article-title":"Libri-Light: A benchmark for ASR with limited or no supervision","volume-title":"IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)","author":"Kahn","year":"2020"},{"key":"2024041219174549200_bib41","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2022.3229264","article-title":"Word segmentation on discovered phone units with dynamic programming and self-supervised scoring","author":"Kamper","year":"2022","journal-title":"IEEE\/ACM Transactions on Audio, Speech, and Language Processing (TASLP)"},{"key":"2024041219174549200_bib42","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2016.7472619","article-title":"Deep convolutional acoustic word embeddings using word-pair side information","volume-title":"IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)","author":"Kamper","year":"2016"},{"key":"2024041219174549200_bib43","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2022-10245","article-title":"Automatic pronunciation assessment using self-supervised speech representation learning","volume-title":"Interspeech","author":"Kim","year":"2022"},{"key":"2024041219174549200_bib44","article-title":"Similarity of neural network representations revisited","volume-title":"International Conference on Machine Learning (ICML)","author":"Kornblith","year":"2019"},{"key":"2024041219174549200_bib45","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2020-2398","article-title":"Self-supervised contrastive learning for unsupervised phoneme segmentation","volume-title":"Interspeech","author":"Kreuk","year":"2020"},{"key":"2024041219174549200_bib46","article-title":"On generative spoken language modeling from raw audio","author":"Lakhotia","year":"2021","journal-title":"Transactions of the Association for Computational Linguistics (TACL)"},{"key":"2024041219174549200_bib47","doi-asserted-by":"publisher","DOI":"10.1109\/ASRU.2013.6707765","article-title":"Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings","volume-title":"IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","author":"Levin","year":"2013"},{"key":"2024041219174549200_bib48","doi-asserted-by":"crossref","DOI":"10.1109\/SLT54892.2023.10023428","article-title":"Exploration of a self-supervised speech model: A study on emotional corpora","volume-title":"IEEE Spoken Language Technology Workshop (SLT)","author":"Li","year":"2023"},{"key":"2024041219174549200_bib49","doi-asserted-by":"crossref","DOI":"10.1109\/ASRU57964.2023.10389795","article-title":"Parameter-efficient cross-language transfer learning for a language-modular audiovisual speech recognition","volume-title":"IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","author":"Li","year":"2023"},{"key":"2024041219174549200_bib50","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2023-871","article-title":"Self-supervised predictive coding models encode speaker and phonetic information in orthogonal subspaces","volume-title":"Interspeech","author":"Liu","year":"2023"},{"key":"2024041219174549200_bib51","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2019-2396","article-title":"Speech model pre-training for end-to-end spoken language understanding","volume-title":"Interspeech","author":"Lugosch","year":"2019"},{"key":"2024041219174549200_bib52","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP39728.2021.9414776","article-title":"Probing acoustic representations for phonetic properties","volume-title":"IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)","author":"Ma","year":"2021"},{"key":"2024041219174549200_bib53","doi-asserted-by":"publisher","DOI":"10.21236\/ADA273556","article-title":"Building a large annotated corpus of English: The Penn treebank","author":"Marcus","year":"1993","journal-title":"Computational Linguistics"},{"key":"2024041219174549200_bib54","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2017-1386","article-title":"Montreal forced aligner: Trainable text-speech alignment using kaldi.","volume-title":"Interspeech","author":"McAuliffe","year":"2017"},{"key":"2024041219174549200_bib55","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2021-1464","article-title":"Semantic sentence similarity: Size does not always matter","volume-title":"Interspeech","author":"Merkx","year":"2021"},{"key":"2024041219174549200_bib56","doi-asserted-by":"publisher","DOI":"10.1007\/s12559-022-10059-7","article-title":"Modelling human word learning and recognition using visually grounded speech","author":"Merkx","year":"2023","journal-title":"Cognitive Computation"},{"key":"2024041219174549200_bib57","doi-asserted-by":"publisher","DOI":"10.3115\/1075671.1075742","article-title":"A semantic concordance","volume-title":"Human Language Technology","author":"Miller","year":"1993"},{"key":"2024041219174549200_bib58","doi-asserted-by":"publisher","DOI":"10.1109\/JSTSP.2022.3207050","article-title":"Self-supervised speech representation learning: A review","author":"Mohamed","year":"2022","journal-title":"IEEE Journal of Selected Topics in Signal Processing (JSTSP)"},{"key":"2024041219174549200_bib59","article-title":"Insights on representational similarity in neural networks with canonical correlation","volume-title":"Advances in Neural Information Processing Systems (NeurIPS)","author":"Morcos","year":"2018"},{"key":"2024041219174549200_bib60","article-title":"Are word boundaries useful for unsupervised language learning?","author":"Anh Nguyen","year":"2022","journal-title":"arXiv preprint arXiv:2210.02956"},{"key":"2024041219174549200_bib61","article-title":"The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling","volume-title":"NeurIPS Workshop on Self-Supervised Learning for Speech and Audio Processing","author":"Anh Nguyen","year":"2020"},{"key":"2024041219174549200_bib62","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00545","article-title":"Generative spoken dialogue language modeling","author":"Anh Nguyen","year":"2023","journal-title":"Transactions of the Association for Computational Linguistics (TACL)"},{"key":"2024041219174549200_bib63","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2019.8683868","article-title":"Learned in speech recognition: Contextual acoustic word embeddings","volume-title":"IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)","author":"Palaskar","year":"2019"},{"key":"2024041219174549200_bib64","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2015.7178964","article-title":"LibriSpeech: An ASR corpus based on public domain audio books","volume-title":"IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)","author":"Panayotov","year":"2015"},{"key":"2024041219174549200_bib65","doi-asserted-by":"publisher","DOI":"10.1109\/ASRU51503.2021.9688093","article-title":"Layer-wise analysis of a self-supervised speech representation model","volume-title":"IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","author":"Pasad","year":"2021"},{"key":"2024041219174549200_bib66","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP49357.2023.10096149","article-title":"Comparative layer-wise analysis of self- supervised speech models","volume-title":"IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)","author":"Pasad","year":"2023"},{"key":"2024041219174549200_bib67","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.naacl-main.53","article-title":"On the use of external data for spoken named entity recognition","volume-title":"North American Chapter of the Association for Computational Linguistics (NAACL)","author":"Pasad","year":"2022"},{"key":"2024041219174549200_bib68","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2022-10652","article-title":"Self-supervised representation learning for speech using visual grounding and masked language modeling","volume-title":"AAAI Workshop on Self-supervised Learning for Audio and Speech Processing","author":"Peng","year":"2022"},{"key":"2024041219174549200_bib69","doi-asserted-by":"crossref","DOI":"10.21437\/Interspeech.2022-10652","article-title":"Word discovery in visually grounded, self-supervised speech models","volume-title":"Interspeech","author":"Peng","year":"2022"},{"key":"2024041219174549200_bib70","article-title":"A correspondence variational autoencoder for unsupervised acoustic word embeddings","volume-title":"NeurIPS Workshop on Self-Supervised Learning for Speech and Audio Processing","author":"Peng","year":"2020"},{"key":"2024041219174549200_bib71","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1162","article-title":"GloVe: Global vectors for word representation","volume-title":"Empirical Methods in Natural Language Processing (ENMLP)","author":"Pennington","year":"2014"},{"key":"2024041219174549200_bib72","doi-asserted-by":"publisher","DOI":"10.1016\/j.specom.2004.09.001","article-title":"The Buckeye corpus of conversational speech: Labeling conventions and a test of transcriber reliability","author":"Pitt","year":"2005","journal-title":"Speech Communication"},{"key":"2024041219174549200_bib73","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.345","article-title":"How accents confound: Probing for accent information in end-to-end speech recognition systems","volume-title":"Association for Computational Linguistics (ACL)","author":"Prasad","year":"2020"},{"key":"2024041219174549200_bib74","article-title":"SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and interpretability","volume-title":"Neural Information Processing Systems (NIPS)","author":"Raghu","year":"2017"},{"key":"2024041219174549200_bib75","doi-asserted-by":"publisher","DOI":"10.5772\/16433","article-title":"Blind segmentation of speech using non-linear filtering methods","author":"R\u00e4s\u00e4nen","year":"2011","journal-title":"Speech Technologies"},{"key":"2024041219174549200_bib76","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.eacl-main.295","article-title":"Probing the probing paradigm: Does probing accuracy entail task relevance?","volume-title":"European Chapter of the Association for Computational Linguistics (EACL)","author":"Ravichander","year":"2021"},{"key":"2024041219174549200_bib77","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.insights-1.11","article-title":"On the difficulty of segmenting words with attention","volume-title":"Second Workshop on Insights from Negative Results in NLP","author":"Sanabria","year":"2021"},{"key":"2024041219174549200_bib78","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP49357.2023.10096099","article-title":"Analyzing acoustic word embeddings from pre-trained self-supervised speech models","volume-title":"IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)","author":"Sanabria","year":"2023"},{"key":"2024041219174549200_bib79","article-title":"Understanding learning dynamics of language models with SVCCA","volume-title":"North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT)","author":"Saphra","year":"2019"},{"key":"2024041219174549200_bib80","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2019.8682903","article-title":"Acoustically grounded word embeddings for improved acoustics-to-word speech recognition","volume-title":"IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)","author":"Settle","year":"2019"},{"key":"2024041219174549200_bib81","doi-asserted-by":"publisher","DOI":"10.1109\/SLT.2016.7846310","article-title":"Discriminative acoustic word embeddings: Recurrent neural network-based approaches","volume-title":"IEEE Spoken Language Technology Workshop (SLT)","author":"Settle","year":"2016"},{"key":"2024041219174549200_bib82","article-title":"What all do audio transformer models hear? Probing acoustic representations for language delivery and its structure","volume-title":"IEEE International Conference on Data Mining Workshops (ICDMW)","author":"Shah","year":"2021"},{"key":"2024041219174549200_bib83","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2023-679","article-title":"Wave to syntax: Probing spoken language models for syntax","volume-title":"Interspeech","author":"Shen","year":"2023"},{"key":"2024041219174549200_bib84","article-title":"Learning audio-visual speech representation by masked multimodal cluster prediction","volume-title":"International Conference on Learning Representations (ICLR)","author":"Shi","year":"2022"},{"key":"2024041219174549200_bib85","doi-asserted-by":"publisher","DOI":"10.1109\/SLT48900.2021.9383578","article-title":"Whole-word segmental speech recognition with acoustic word embeddings","volume-title":"IEEE Spoken Language Technology Workshop (SLT)","author":"Shi","year":"2021"},{"key":"2024041219174549200_bib86","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.acl-long.496","article-title":"SLUE Phase-2: A benchmark suite of diverse spoken language understanding tasks","volume-title":"Association for Computational Linguistics (ACL)","author":"Shon","year":"2023"},{"key":"2024041219174549200_bib87","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP43922.2022.9746137","article-title":"SLUE: New benchmark tasks for spoken language understanding evaluation on natural speech","volume-title":"IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)","author":"Shon","year":"2022"},{"key":"2024041219174549200_bib88","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2007-429","article-title":"A computational model for unsupervised word discovery","volume-title":"Interspeech","author":"Bosch","year":"2007"},{"key":"2024041219174549200_bib89","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1452","article-title":"BERT rediscovers the classical NLP pipeline","volume-title":"Association for Computational Linguistics (ACL)","author":"Tenney","year":"2019"},{"key":"2024041219174549200_bib90","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/2022.acl-long.580","article-title":"SUPERB-SG: Enhanced speech processing universal performance benchmark for semantic and generative capabilities","volume-title":"Association for Computational Linguistics (ACL)","author":"Tsai","year":"2022"},{"key":"2024041219174549200_bib91","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W16-2520","article-title":"Correlation-based intrinsic evaluation of word vector representations","volume-title":"1st Workshop on Evaluating Vector-Space Representations for NLP","author":"Tsvetkov","year":"2016"},{"key":"2024041219174549200_bib92","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D15-1243","article-title":"Evaluation of word vector representations by subspace alignment","volume-title":"Empirical Methods in Natural Language Processing (EMNLP)","author":"Tsvetkov","year":"2015"},{"key":"2024041219174549200_bib93","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2021-1182","article-title":"Analyzing speaker information in self-supervised models to improve zero-resource speech processing","volume-title":"Interspeech","author":"van Niekerk","year":"2021"},{"key":"2024041219174549200_bib94","doi-asserted-by":"publisher","DOI":"10.1109\/SLT48900.2021.9383625","article-title":"A comparison of self-supervised speech representations as input features for unsupervised acoustic word embeddings","volume-title":"IEEE Spoken Language Technology Workshop (SLT)","author":"Van Staden","year":"2021"},{"key":"2024041219174549200_bib95","article-title":"Attention is all you need","author":"Vaswani","year":"2017","journal-title":"Advances in Neural Information Processing Systems (NIPS)"},{"key":"2024041219174549200_bib96","doi-asserted-by":"publisher","DOI":"10.1038\/s41592-020-0772-5","article-title":"SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python","author":"Virtanen","year":"2020","journal-title":"Nature Methods"},{"key":"2024041219174549200_bib97","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1448","article-title":"The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives","volume-title":"North American Chapter of the Association for Computational Linguistics (NAACL)","author":"Voita","year":"2019"},{"key":"2024041219174549200_bib98","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.acl-long.80","article-title":"VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation","volume-title":"Association for Computational Linguistics (ACL)","author":"Wang","year":"2021"},{"key":"2024041219174549200_bib99","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP49357.2023.10096988","article-title":"Wav2seq: Pre-training speech-to-text encoder- decoder models using pseudo languages","volume-title":"IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)","author":"Felix","year":"2023"},{"key":"2024041219174549200_bib100","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.findings-emnlp.422","article-title":"Hidden state variability of pretrained language models can guide computation reduction for transfer learning","volume-title":"Findings of Empirical Methods in Natural Language Processing (EMNLP)","author":"Xie","year":"2022"},{"key":"2024041219174549200_bib101","doi-asserted-by":"crossref","DOI":"10.21437\/Interspeech.2023-2362","article-title":"On-device constrained self-supervised speech representation learning for keyword spotting via knowledge distillation","volume-title":"Interspeech","author":"Yang","year":"2023"},{"key":"2024041219174549200_bib102","article-title":"What can an accent identifier learn? Probing phonetic and prosodic information in a wav2vec2-based accent identification model","volume-title":"Interspeech","author":"Yang","year":"202"},{"key":"2024041219174549200_bib103","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2021-1775","article-title":"SUPERB: Speech processing universal performance benchmark","volume-title":"Interspeech","author":"Yang","year":"2021"},{"key":"2024041219174549200_bib104","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSPW59220.2023.10193042","article-title":"Fine-tuning strategies for faster inference using speech self-supervised models: A comparative study","volume-title":"IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)","author":"Zaiem","year":"2023"},{"key":"2024041219174549200_bib105","doi-asserted-by":"publisher","DOI":"10.2139\/ssrn.4733627","article-title":"Speech self-supervised representations benchmarking: A case for larger probing heads","author":"Zaiem","year":"2023","journal-title":"arXiv preprint arXiv:2308.14456"},{"key":"2024041219174549200_bib106","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.findings-emnlp.81","article-title":"Bootstrapping meaning through listening: Unsupervised learning of spoken sentence embeddings","volume-title":"Findings of Empirical Methods in Natural Language Processing (EMNLP)","author":"Zhu","year":"2022"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00656\/2362252\/tacl_a_00656.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00656\/2362252\/tacl_a_00656.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,4,12]],"date-time":"2024-04-12T19:18:10Z","timestamp":1712949490000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00656\/120586\/What-Do-Self-Supervised-Speech-Models-Know-About"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024]]},"references-count":106,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00656","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2024]]},"published":{"date-parts":[[2024]]}}}