{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,7,2]],"date-time":"2025-07-02T20:40:07Z","timestamp":1751488807712,"version":"3.41.0"},"reference-count":53,"publisher":"MIT Press","license":[{"start":{"date-parts":[[2025,7,2]],"date-time":"2025-07-02T00:00:00Z","timestamp":1751414400000},"content-version":"vor","delay-in-days":182,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,6,27]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Curating datasets that span multiple languages is challenging. To make the collection more scalable, researchers often incorporate one or more imperfect classifiers in the process, like language identification models. These models, however, are prone to failure, resulting in some language partitions being unreliable for downstream tasks. We introduce a statistical test, the Preference Proportion Test, for identifying such unreliable partitions. By annotating only 20 samples for a language partition, we are able to identify systematic transcription errors for 10 language partitions in a recent large multilingual transcribed audio archive, X-IPAPack (Zhu et al., 2024). We find that filtering these low-quality partitions out when training models for the downstream task of phonetic transcription brings substantial benefits, most notably a 25.7% relative improvement on transcribing recordings in out-of-distribution languages. Our work contributes an effective method for auditing multilingual audio archives.1<\/jats:p>","DOI":"10.1162\/tacl_a_00759","type":"journal-article","created":{"date-parts":[[2025,7,2]],"date-time":"2025-07-02T20:07:14Z","timestamp":1751486834000},"page":"595-612","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":0,"title":["A Comparative Approach for Auditing Multilingual Phonetic Transcript Archives"],"prefix":"10.1162","volume":"13","author":[{"given":"Farhan","family":"Samir","sequence":"first","affiliation":[{"name":"University of British Columbia, Canada. fsamir@mail.ubc.ca"},{"name":"Vector Institute for AI, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Emily P.","family":"Ahn","sequence":"additional","affiliation":[{"name":"University of Washington, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Shreya","family":"Prakash","sequence":"additional","affiliation":[{"name":"University of Washington, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"M\u00e1rton","family":"Soskuthy","sequence":"additional","affiliation":[{"name":"University of British Columbia, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Vered","family":"Shwartz","sequence":"additional","affiliation":[{"name":"University of British Columbia, Canada"},{"name":"Vector Institute for AI, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jian","family":"Zhu","sequence":"additional","affiliation":[{"name":"University of British Columbia, Canada"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"281","published-online":{"date-parts":[[2025,6,27]]},"reference":[{"key":"2025070216070546500_bib1","first-page":"5286","article-title":"Voxcommunis: A corpus for cross-linguistic phonetic analysis","volume-title":"Proceedings of the Thirteenth Language Resources and Evaluation Conference","author":"Ahn","year":"2022"},{"key":"2025070216070546500_bib2","doi-asserted-by":"publisher","first-page":"1771","DOI":"10.18653\/v1\/2024.naacl-long.100","article-title":"BUFFET: Benchmarking large language models for few-shot cross-lingual transfer","volume-title":"Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)","author":"Asai","year":"2024"},{"key":"2025070216070546500_bib3","doi-asserted-by":"publisher","first-page":"2278","DOI":"10.21437\/Interspeech.2022-143","article-title":"XLS-R: Self-supervised cross-lingual speech representation learning at scale","volume-title":"Interspeech 2022","author":"Babu","year":"2022"},{"key":"2025070216070546500_bib4","doi-asserted-by":"publisher","first-page":"610","DOI":"10.1145\/3442188.3445922","article-title":"On the dangers of stochastic parrots: Can language models be too big?","volume-title":"Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency","author":"Bender","year":"2021"},{"key":"2025070216070546500_bib5","volume-title":"Race after Technology: Abolitionist Tools for the New Jim Code","author":"Benjamin","year":"2019"},{"issue":"4","key":"2025070216070546500_bib6","doi-asserted-by":"publisher","first-page":"713","DOI":"10.1162\/coli_a_00387","article-title":"Sparse transcription","volume":"46","author":"Bird","year":"2021","journal-title":"Computational Linguistics"},{"key":"2025070216070546500_bib7","doi-asserted-by":"publisher","first-page":"5486","DOI":"10.18653\/v1\/2022.acl-long.376","article-title":"Systematic inequalities in language technology performance across the world\u2019s languages","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Blasi","year":"2022"},{"issue":"3\/4","key":"2025070216070546500_bib8","doi-asserted-by":"publisher","first-page":"324","DOI":"10.1093\/biomet\/39.3-4.324","article-title":"Rank analysis of incomplete block designs: I. the method of paired comparisons","volume":"39","author":"Bradley","year":"1952","journal-title":"Biometrika"},{"key":"2025070216070546500_bib9","doi-asserted-by":"publisher","first-page":"9263","DOI":"10.18653\/v1\/2020.emnlp-main.745","article-title":"With little power comes great responsibility","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Card","year":"2020"},{"key":"2025070216070546500_bib10","article-title":"Deep reinforcement learning from human preferences","volume-title":"Advances in Neural Information Processing Systems","author":"Christiano","year":"2017"},{"issue":"3","key":"2025070216070546500_bib11","doi-asserted-by":"publisher","first-page":"98","DOI":"10.1111\/1467-8721.ep10768783","article-title":"Statistical power analysis","volume":"1","author":"Cohen","year":"1992","journal-title":"Current Directions in Psychological Science"},{"key":"2025070216070546500_bib12","doi-asserted-by":"publisher","first-page":"798","DOI":"10.1109\/SLT54892.2023.10023141","article-title":"FLEURS: Few-shot learning evaluation of universal representations of speech","volume-title":"2022 IEEE Spoken Language Technology Workshop (SLT)","author":"Conneau","year":"2023"},{"issue":"4","key":"2025070216070546500_bib13","doi-asserted-by":"publisher","DOI":"10.1016\/j.patter.2024.100966","article-title":"An archival perspective on pretraining data","volume":"5","author":"Desai","year":"2024","journal-title":"Patterns"},{"key":"2025070216070546500_bib14","doi-asserted-by":"publisher","first-page":"1286","DOI":"10.18653\/v1\/2021.emnlp-main.98","article-title":"Documenting large webtext corpora: A case study on the colossal clean crawled corpus","volume-title":"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing","author":"Dodge","year":"2021"},{"key":"2025070216070546500_bib15","doi-asserted-by":"publisher","first-page":"1383","DOI":"10.18653\/v1\/P18-1128","article-title":"The hitchhiker\u2019s guide to testing statistical significance in natural language processing","volume-title":"Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Dror","year":"2018"},{"key":"2025070216070546500_bib16","doi-asserted-by":"publisher","DOI":"10.5281\/zenodo.7385533","volume-title":"WALS Online (v2020.3)","author":"Dryer","year":"2013"},{"issue":"5","key":"2025070216070546500_bib17","doi-asserted-by":"publisher","first-page":"1322","DOI":"10.1073\/pnas.1417413112","article-title":"Climate, vocal folds, and tonal languages: Connecting the physiological and geographic dots","volume":"112","author":"Everett","year":"2015","journal-title":"Proceedings of the National Academy of Sciences"},{"key":"2025070216070546500_bib18","doi-asserted-by":"publisher","first-page":"13246","DOI":"10.1109\/ICASSP48485.2024.10446286","article-title":"Updated corpora and benchmarks for long-form speech recognition","volume-title":"ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Fox","year":"2024"},{"key":"2025070216070546500_bib19","doi-asserted-by":"publisher","first-page":"2258","DOI":"10.21437\/Interspeech.2023-772","article-title":"Allophant: Cross-lingual phoneme recognition with articulatory attributes","volume-title":"Interspeech 2023","author":"Glocker","year":"2023"},{"key":"2025070216070546500_bib20","article-title":"Data mixture inference: What do BPE tokenizers reveal about their training data?","volume-title":"The Thirty-eighth Annual Conference on Neural Information Processing Systems","author":"Hayase","year":"2024"},{"key":"2025070216070546500_bib21","article-title":"A material lens on coloniality in NLP","author":"Held","year":"2023","journal-title":"arXiv preprint arXiv:2311.08391"},{"key":"2025070216070546500_bib22","article-title":"Training compute-optimal large language models","volume-title":"Proceedings of the 36th International Conference on Neural Information Processing Systems","author":"Hoffmann","year":"2022"},{"key":"2025070216070546500_bib23","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3689904.3694702","article-title":"Who\u2019s in and who\u2019s out? A case study of multimodal CLIP-filtering in DataComp","volume-title":"Proceedings of the 4th ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization","author":"Hong","year":"2024"},{"issue":"9","key":"2025070216070546500_bib24","doi-asserted-by":"publisher","first-page":"1518","DOI":"10.1016\/j.lingua.2011.04.003","article-title":"A panchronic study of aspirated fricatives, with new evidence from Pumi","volume":"121","author":"Jacques","year":"2011","journal-title":"Lingua"},{"key":"2025070216070546500_bib25","doi-asserted-by":"publisher","DOI":"10.4324\/9780203945315","volume-title":"The Indo-Aryan Languages","author":"Jain","year":"2007"},{"key":"2025070216070546500_bib26","volume-title":"Algorithm design","author":"Kleinberg","year":"2006"},{"key":"2025070216070546500_bib27","first-page":"726","article-title":"Findings of the WMT 2020 shared task on parallel corpus filtering and alignment","volume-title":"Proceedings of the Fifth Conference on Machine Translation","author":"Koehn","year":"2020"},{"key":"2025070216070546500_bib28","doi-asserted-by":"publisher","first-page":"50","DOI":"10.1162\/tacl_a_00447","article-title":"Quality at a glance: An audit of web-crawled multilingual datasets","volume":"10","author":"Kreutzer","year":"2022","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2025070216070546500_bib29","article-title":"MADLAD-400: A multilingual and document-level large audited dataset","volume":"36","author":"Kudugunta","year":"2024","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2025070216070546500_bib30","doi-asserted-by":"publisher","first-page":"2058","DOI":"10.18653\/v1\/2021.emnlp-main.157","article-title":"Local word discovery for interactive transcription","volume-title":"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing","author":"Lane","year":"2021"},{"key":"2025070216070546500_bib31","doi-asserted-by":"publisher","first-page":"8249","DOI":"10.1109\/ICASSP40776.2020.9054362","article-title":"Universal phone recognition with a multilingual allophone system","volume-title":"ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Li","year":"2020"},{"key":"2025070216070546500_bib32","doi-asserted-by":"publisher","first-page":"2461","DOI":"10.21437\/Interspeech.2021-1803","article-title":"Hierarchical phone recognition with compositional phonetics","volume-title":"Interspeech 2021","author":"Li","year":"2021"},{"key":"2025070216070546500_bib33","article-title":"LingPy. A Python library for historical linguistics","author":"List","year":"2016","journal-title":"Max Planck Institute for the Science of Human History: Jena"},{"key":"2025070216070546500_bib34","doi-asserted-by":"publisher","DOI":"10.3389\/fpsyg.2020.570895","article-title":"Re-evaluating phoneme frequencies","volume":"11","author":"Macklin-Cordes","year":"2020","journal-title":"Frontiers in Psychology"},{"key":"2025070216070546500_bib35","doi-asserted-by":"publisher","first-page":"4214","DOI":"10.21437\/Interspeech.2021-1966","article-title":"Few-shot keyword spotting in any language","volume-title":"Interspeech 2021","author":"Mazumder","year":"2021"},{"key":"2025070216070546500_bib36","article-title":"PHOIBLE 2.0","volume":"10","author":"Moran","year":"2019","journal-title":"Jena: Max Planck Institute for the Science of Human History"},{"key":"2025070216070546500_bib37","first-page":"3475","article-title":"PanPhon: A resource for mapping IPA segments to articulatory feature vectors","volume-title":"Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers","author":"Mortensen","year":"2016"},{"key":"2025070216070546500_bib38","doi-asserted-by":"publisher","first-page":"15991","DOI":"10.18653\/v1\/2023.acl-long.891","article-title":"Crosslingual generalization through multitask finetuning","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Muennighoff","year":"2023"},{"issue":"3","key":"2025070216070546500_bib39","doi-asserted-by":"publisher","first-page":"443","DOI":"10.1016\/0022-2836(70)90057-4","article-title":"A general method applicable to the search for similarities in the amino acid sequence of two proteins","volume":"48","author":"Needleman","year":"1970","journal-title":"Journal of Molecular Biology"},{"key":"2025070216070546500_bib40","doi-asserted-by":"publisher","first-page":"17753","DOI":"10.18653\/v1\/2024.emnlp-main.983","article-title":"The zeno\u2019s paradox of \u2018low-resource\u2019 languages","volume-title":"Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing","author":"Nigatu","year":"2024"},{"key":"2025070216070546500_bib41","first-page":"2657","article-title":"Building a time-aligned cross-linguistic reference corpus from language documentation data (DoReCo)","volume-title":"Proceedings of the Twelfth Language Resources and Evaluation Conference","author":"Paschen","year":"2020"},{"key":"2025070216070546500_bib42","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2024-1194","article-title":"Owsm v3.1: Better and faster open whisper-style speech models based on e-branchformer","volume":"abs\/2401.16658","author":"Peng","year":"2024","journal-title":"Interspeech"},{"key":"2025070216070546500_bib43","first-page":"28492","article-title":"Robust speech recognition via large-scale weak supervision","volume-title":"Proceedings of the 40th International Conference on Machine Learning","author":"Radford","year":"2023"},{"key":"2025070216070546500_bib44","doi-asserted-by":"publisher","first-page":"4526","DOI":"10.18653\/v1\/2020.acl-main.415","article-title":"A corpus for large-scale phonetic typology","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Salesky","year":"2020"},{"issue":"CSCW2","key":"2025070216070546500_bib45","doi-asserted-by":"publisher","DOI":"10.1145\/3476058","article-title":"Do datasets have politics? Disciplinary values in computer vision dataset development","volume":"5","author":"Scheuerman","year":"2021","journal-title":"Proceedings of the ACM on Human-Computer Interaction"},{"key":"2025070216070546500_bib46","doi-asserted-by":"publisher","first-page":"2548","DOI":"10.21437\/Interspeech.2023-2584","article-title":"Universal automatic phonetic transcription into the international phonetic alphabet","volume-title":"Interspeech 2023","author":"Taguchi","year":"2023"},{"key":"2025070216070546500_bib47","doi-asserted-by":"publisher","first-page":"3959","DOI":"10.21437\/Interspeech.2024-1938","article-title":"On the effects of heterogeneous data sources on speech-to-text foundation models","volume-title":"Interspeech 2024","author":"Tian","year":"2024"},{"key":"2025070216070546500_bib48","doi-asserted-by":"publisher","first-page":"993","DOI":"10.18653\/v1\/2021.acl-long.80","article-title":"VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Wang","year":"2021"},{"key":"2025070216070546500_bib49","doi-asserted-by":"publisher","first-page":"2113","DOI":"10.21437\/Interspeech.2022-60","article-title":"Simple and effective zero-shot cross-lingual phoneme recognition","volume-title":"Interspeech 2022","author":"Qiantong","year":"2022"},{"key":"2025070216070546500_bib50","doi-asserted-by":"publisher","first-page":"483","DOI":"10.1109\/ICASSP49660.2025.10888774","article-title":"mT5: A massively multilingual pre-trained text-to-text transformer","volume-title":"Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Xue","year":"2021"},{"key":"2025070216070546500_bib51","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1109\/ICASSP49660.2025.10888774","article-title":"Scaling a simple approach to zero-shot speech recognition","volume-title":"ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Zhao","year":"2025"},{"key":"2025070216070546500_bib52","doi-asserted-by":"crossref","first-page":"750","DOI":"10.18653\/v1\/2024.naacl-long.43","article-title":"The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language","volume-title":"Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)","author":"Zhu","year":"2024"},{"key":"2025070216070546500_bib53","doi-asserted-by":"publisher","first-page":"446","DOI":"10.21437\/Interspeech.2022-538","article-title":"ByT5 model for massively multilingual grapheme-to-phoneme conversion","volume-title":"Interspeech 2022","author":"Zhu","year":"2022"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00759\/2534926\/tacl_a_00759.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00759\/2534926\/tacl_a_00759.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,7,2]],"date-time":"2025-07-02T20:07:19Z","timestamp":1751486839000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00759\/131563\/A-Comparative-Approach-for-Auditing-Multilingual"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025]]},"references-count":53,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00759","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2025]]},"published":{"date-parts":[[2025]]}}}