{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,11]],"date-time":"2025-09-11T22:18:48Z","timestamp":1757629128811,"version":"3.44.0"},"reference-count":52,"publisher":"Association for Computing Machinery (ACM)","issue":"9","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Asian Low-Resour. Lang. Inf. Process."],"published-print":{"date-parts":[[2025,9,30]]},"abstract":"<jats:p>Mother tongues and regional dialects have a substantial impact on pronunciation, leading to a range of complex and unique accents. This complexity increases in a diverse country such as India, which has code-mixed languages, which necessitates the development of an Automatic Speech Recognition(ASR) system capable of accommodating these variations effectively. In order to address this, we suggest a Hinglish voice recognition task that is cross-accented and uses the Hindi+English code-mixed conversational data to evaluate how well a model can adapt to different accents. The model-agnostic meta-learning (MAML) technique is extended by our accent-agnostic method to enable quick adaptation to unknown accents. We show through numerous experiments that our method works well. It outperforms joint multi-accent training in both mixed and cross-region settings, as measured by word error rate(WER), which is enhanced by 3-5%. Further, we investigate the effects of few-shot fine-tuning on mixed and cross-region samples, and ultimately revealing important insights into the pronunciation of Hindi-accented audio across different geographical areas of the country.<\/jats:p>","DOI":"10.1145\/3748322","type":"journal-article","created":{"date-parts":[[2025,7,10]],"date-time":"2025-07-10T15:45:57Z","timestamp":1752162357000},"page":"1-18","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Hinglish Cross-Accent Model Agnostic Meta-Learning Automatic Speech Recognition"],"prefix":"10.1145","volume":"24","author":[{"ORCID":"https:\/\/orcid.org\/0009-0005-1899-0211","authenticated-orcid":false,"given":"Sanskar","family":"Singh","sequence":"first","affiliation":[{"name":"IIIT Naya Raipur","place":["Naya Raipur, India"]}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-2838-8622","authenticated-orcid":false,"given":"Shivam","family":"Kushwaha","sequence":"additional","affiliation":[{"name":"IIIT Naya Raipur","place":["Naya Raipur, India"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2606-5959","authenticated-orcid":false,"given":"Avantika","family":"Singh","sequence":"additional","affiliation":[{"name":"IIIT Naya Raipur","place":["Naya Raipur, India"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7146-9012","authenticated-orcid":false,"given":"Shaifu","family":"Gupta","sequence":"additional","affiliation":[{"name":"IIT Jammu","place":["Jammu, India"]}]}],"member":"320","published-online":{"date-parts":[[2025,9,10]]},"reference":[{"key":"e_1_3_2_2_2","article-title":"Whisper turns stronger: Augmenting Wav2Vec 2.0 for superior ASR in low-resource languages","volume":"2501","author":"Anidjar Or Haim","year":"2025","unstructured":"Or Haim Anidjar, Revital Marbel, and Roi Yozevitch. 2025. Whisper turns stronger: Augmenting Wav2Vec 2.0 for superior ASR in low-resource languages. CoRR abs\/2501.00425 (2025).","journal-title":"CoRR"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2023.3300790"},{"key":"e_1_3_2_4_2","unstructured":"Kaushal Santosh Bhogale Sai Sundaresan Abhigyan Raman Tahir Javed Mitesh M. Khapra and Pratyush Kumar. 2023. Vistaar: Diverse benchmarks and training sets for indian language ASR. arxiv:2305.15386 [cs.CL] https:\/\/arxiv.org\/abs\/2305.15386"},{"key":"e_1_3_2_5_2","unstructured":"Common Voice by Mozilla is a project that provides open datasets of voice recordings to help make speech recognition technologies accessible to all. [n. d.]. https:\/\/commonvoice.mozilla.org\/en\/datasets"},{"key":"e_1_3_2_6_2","first-page":"5884","volume-title":"IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE","author":"Dong L.","year":"2018","unstructured":"L. Dong, S. Xu, and B. Xu. 2018. Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 5884\u20135888."},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.5555\/3305381.3305498"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.5555\/3327546.3327622"},{"key":"e_1_3_2_9_2","unstructured":"FutureBeeAI is a platform offering diverse AI datasets and services for machine learning and artificial intelligence applications. [n.d.]. https:\/\/www.futurebeeai.com\/"},{"key":"e_1_3_2_10_2","doi-asserted-by":"crossref","first-page":"3622","DOI":"10.18653\/v1\/D18-1398","volume-title":"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Gu J.","year":"2018","unstructured":"J. Gu, Y. Wang, Y. Chen, V. O. Li, and K. Cho. 2018. Meta-learning for low-resource neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). 3622\u20133631."},{"key":"e_1_3_2_11_2","unstructured":"Anirudh Gupta Harveen Singh Chadha Priyanshi Shah Neeraj Chhimwal Ankur Dhuriya Rishabh Gaur and Vivek Raghavan. 2022. CLSRIL-23: Cross Lingual Speech Representations for Indic Languages. arxiv:2107.07402 [cs.CL] https:\/\/arxiv.org\/abs\/2107.07402"},{"key":"e_1_3_2_12_2","first-page":"949","volume-title":"Proceedings of Interspeech","author":"Hori T.","year":"2017","unstructured":"T. Hori, S. Watanabe, Y. Zhang, and W. Chan. 2017. Advances in Joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM. In Proceedings of Interspeech. 949\u2013953."},{"key":"e_1_3_2_13_2","article-title":"Meta learning for end-to-end low-resource speech recognition","author":"Hsu J.-Y.","year":"2019","unstructured":"J.-Y. Hsu, Y.-J. Chen, and H.-y. Lee. 2019. Meta learning for end-to-end low-resource speech recognition. arXiv preprint arXiv:1910.12094 (2019).","journal-title":"arXiv preprint arXiv:1910.12094"},{"key":"e_1_3_2_14_2","first-page":"732","volume-title":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)","author":"Huang P.-S.","year":"2018","unstructured":"P.-S. Huang, C. Wang, R. Singh, W.-t. Yih, and X. He. 2018. Natural language to structured query generation via meta-learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 732\u2013738."},{"key":"e_1_3_2_15_2","first-page":"2454","volume-title":"Proceedings of Interspeech","author":"Jain A.","year":"2018","unstructured":"A. Jain, M. Upreti, and P. Jyothi. 2018. Improved accented speech recognition using accent embeddings and multi-task learning. In Proceedings of Interspeech. 2454\u20132458."},{"key":"e_1_3_2_16_2","first-page":"779","volume-title":"Proceedings of Interspeech","author":"Jain A.","year":"2019","unstructured":"A. Jain, M. Upreti, and P. Jyothi. 2019. A multi-accent acoustic model using mixture of experts for speech recognition. In Proceedings of Interspeech. 779\u2013783."},{"key":"e_1_3_2_17_2","doi-asserted-by":"crossref","unstructured":"T. Javed J. A. Nawale E. I. George S. Joshi K. S. Bhogale D. Mehendale I. V. Sethi A. Ananthanarayanan H. Faquih P. Palit et al. 2024. IndicVoices: Towards Building an Inclusive Multilingual Speech Dataset for Indian Languages. arxiv:2403.01926","DOI":"10.18653\/v1\/2024.findings-acl.639"},{"key":"e_1_3_2_18_2","unstructured":"Jiaaro. 2023. Pydub: Python Library for Audio Processing. https:\/\/github.com\/jiaaro\/pydub"},{"key":"e_1_3_2_19_2","first-page":"221","volume-title":"IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)","author":"Kat L. W.","year":"1999","unstructured":"L. W. Kat and P. Fung. 1999. Fast accent identification and accented speech recognition. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 221\u2013224."},{"key":"e_1_3_2_20_2","first-page":"867","volume-title":"Proc. Interspeech","author":"Klejch O.","year":"2018","unstructured":"O. Klejch, J. Fainberg, and P. Bell. 2018. Learning to adapt: A meta-learning approach for speaker adaptation. In Proc. Interspeech. 867\u2013871."},{"key":"e_1_3_2_21_2","article-title":"Speaker adaptive training using model agnostic meta-learning","author":"Klejch O.","year":"2019","unstructured":"O. Klejch, J. Fainberg, P. Bell, and S. Renals. 2019. Speaker adaptive training using model agnostic meta-learning. arXiv preprint arXiv:1910.10605 (2019).","journal-title":"arXiv preprint arXiv:1910.10605"},{"key":"e_1_3_2_22_2","first-page":"47","volume-title":"Proceedings of the First Workshop on Financial Technology and Natural Language Processing","author":"Lin Z.","year":"2019","unstructured":"Z. Lin, A. Madotto, G. I. Winata, Z. Liu, Y. Xu, C. Gao, and P. Fung. 2019. Learning to learn sales prediction with social media sentiment. In Proceedings of the First Workshop on Financial Technology and Natural Language Processing. 47\u201353."},{"key":"e_1_3_2_23_2","first-page":"1270","volume-title":"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Madotto Andrea","year":"2019","unstructured":"Andrea Madotto, Zhaojiang Lin, Chien-Sheng Wu, and Pascale Fung. 2019. Personalizing dialogue agents via meta-learning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1270\u20131280."},{"key":"e_1_3_2_24_2","volume-title":"Fifteenth annual conference of the international speech communication association","author":"Najafian M.","year":"2014","unstructured":"M. Najafian, A. DeMarco, S. Cox, and M. Russell. 2014. Unsupervised model selection for recognition of regional accented speech. In Fifteenth annual conference of the international speech communication association."},{"key":"e_1_3_2_25_2","first-page":"2504","volume-title":"IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Narayanan A.","year":"2014","unstructured":"A. Narayanan and D. Wang. 2014. Joint noise adaptive training for robust automatic speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2504\u20132508."},{"key":"e_1_3_2_26_2","unstructured":"Alex Nichol Joshua Achiam and John Schulman. 2018. On First-Order Meta-Learning Algorithms. arxiv:1803.02999 [cs.LG] https:\/\/arxiv.org\/abs\/1803.02999"},{"key":"e_1_3_2_27_2","unstructured":"OpenSLR provides speech and language resources including datasets for acoustic models and speech recognition. [n. d.]. https:\/\/www.openslr.org\/103\/"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1253"},{"key":"e_1_3_2_29_2","volume-title":"Journal of Systemics, Cybernetics and Informatics","author":"Rao K.","year":"2011","unstructured":"K. Rao and Shashidhar Koolagudi. 2011. Identification of hindi dialects and emotions using spectral and prosodic features of speech. In Journal of Systemics, Cybernetics and Informatics, Vol. 9."},{"key":"e_1_3_2_30_2","unstructured":"Mirco Ravanelli Titouan Parcollet Peter Plantinga Aku Rouhe Samuele Cornell Loren Lugosch Cem Subakan Nauman Dawalatabad Abdelwahab Heba Jianyuan Zhong et al. 2021. SpeechBrain: A General-Purpose Speech Toolkit. arxiv:2106.04624 [eess.AS] arXiv:2106.04624."},{"key":"e_1_3_2_31_2","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR)","author":"Ravi S.","year":"2016","unstructured":"S. Ravi and H. Larochelle. 2016. Optimization as a model for few-shot learning. In Proceedings of the International Conference on Learning Representations (ICLR)."},{"key":"e_1_3_2_32_2","first-page":"1842","volume-title":"Proceedings of the International Conference on Machine Learning (ICML)","author":"Santoro A.","year":"2016","unstructured":"A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. 2016. Meta-learning with memory-augmented neural networks. In Proceedings of the International Conference on Machine Learning (ICML). 1842\u20131850."},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1992.4.1.131"},{"key":"e_1_3_2_34_2","unstructured":"Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. arxiv:1409.1556 [cs.CV]"},{"key":"e_1_3_2_35_2","series-title":"Advances in Intelligent Systems and Computing","doi-asserted-by":"crossref","first-page":"161","DOI":"10.1007\/978-3-319-04960-1_14","volume-title":"Advances in Signal Processing and Intelligent Recognition Systems","author":"Sinha Shweta","year":"2014","unstructured":"Shweta Sinha, Aruna Jain, and Shyam S. Agrawal. 2014. Speech Processing for Hindi Dialect Recognition. In Advances in Signal Processing and Intelligent Recognition Systems(Advances in Intelligent Systems and Computing, Vol. 264), Sabu M. Thampi, Alexander F. Gelbukh, and Jayanta Mukhopadhyay (Eds.). Springer, 161\u2013169."},{"key":"e_1_3_2_36_2","first-page":"4854","volume-title":"IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","author":"Sun S.","year":"2018","unstructured":"S. Sun, C.-F. Yeh, M.-Y. Hwang, M. Ostendorf, and L. Xie. 2018. Domain adversarial training for accented speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4854\u20134858."},{"key":"e_1_3_2_37_2","article-title":"Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier","author":"Team Silero","year":"2021","unstructured":"Silero Team. 2021. Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier. https:\/\/github.com\/snakers4\/silero-vad.","journal-title":"https:\/\/github.com\/snakers4\/silero-vad"},{"key":"e_1_3_2_38_2","doi-asserted-by":"crossref","first-page":"2382","DOI":"10.1109\/ICACCI.2018.8554413","article-title":"Code-mixing: A brief survey","author":"Thara S.","year":"2018","unstructured":"S. Thara and Prabaharan Poornachandran. 2018. Code-mixing: A brief survey. 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI) (2018), 2382\u20132388. https:\/\/api.semanticscholar.org\/CorpusID:54437427","journal-title":"2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI)"},{"key":"e_1_3_2_39_2","unstructured":"This Hugging Face model asr-whisper-large-v2-commonvoice-hi is a pre-trained speech recognition model fine-tuned for Hindi based on whisper architecture. [n.d.]. https:\/\/huggingface.co\/speechbrain\/asr-whisper-large-v2-commonvoice-hi"},{"key":"e_1_3_2_40_2","unstructured":"This Hugging Face model xlsr-53-wav2vec-hi is a pre-trained speech recognition model fine-tuned for Hindi. [n.d.]. https:\/\/huggingface.co\/harshit345\/xlsr-53-wav2vec-hi"},{"key":"e_1_3_2_41_2","volume-title":"Learning to Learn","author":"Thrun S.","year":"2012","unstructured":"S. Thrun and L. Pratt. 2012. Learning to Learn. Springer Science & Business Media."},{"key":"e_1_3_2_42_2","first-page":"5998","article-title":"Attention Is All You Need","volume":"30","author":"Vaswani A.","year":"2017","unstructured":"A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, \u0141. Kaiser, and I. Polosukhin. 2017. Attention Is All You Need. Advances in Neural Information Processing Systems 30 (2017), 5998\u20136008.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2023-757"},{"key":"e_1_3_2_44_2","first-page":"2140","volume-title":"Proc. Interspeech","author":"Viglino T.","year":"2019","unstructured":"T. Viglino, P. Motlicek, and M. Cernak. 2019. End-to-end accented speech recognition. In Proc. Interspeech. 2140\u20132144."},{"key":"e_1_3_2_45_2","first-page":"3630","volume-title":"Advances in Neural Information Processing Systems","author":"Vinyals O.","year":"2016","unstructured":"O. Vinyals, C. Blundell, T. Lillicrap, and D. Wierstra. 2016. Matching networks for one shot learning. In Advances in Neural Information Processing Systems. 3630\u20133638."},{"key":"e_1_3_2_46_2","unstructured":"Voice Activity Detector from Google which is reportedly one of the best available: it\u2019s fast modern and free. [n.d.]. https:\/\/github.com\/wiseman\/py-webrtcvad"},{"key":"e_1_3_2_47_2","unstructured":"Disong Wang Jianwei Yu Xixin Wu Lifa Sun Xunying Liu and Helen Meng. 2020. Improved End-to-End Dysarthric Speech Recognition via Meta-learning Based Model Re-initialization. arxiv:2011.01686 [eess.AS] https:\/\/arxiv.org\/abs\/2011.01686"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2024.3414340"},{"key":"e_1_3_2_49_2","article-title":"Lightweight and efficient end-to-end speech recognition using low-rank transformer","author":"Winata G. I.","year":"2019","unstructured":"G. I. Winata, S. Cahyawijaya, Z. Lin, Z. Liu, and P. Fung. 2019. Lightweight and efficient end-to-end speech recognition using low-rank transformer. arXiv preprint arXiv:1910.13923 (2019).","journal-title":"arXiv preprint arXiv:1910.13923"},{"key":"e_1_3_2_50_2","doi-asserted-by":"crossref","unstructured":"Genta Indra Winata Samuel Cahyawijaya Zihan Liu Zhaojiang Lin Andrea Madotto Peng Xu and Pascale Fung. 2020. Learning Fast Adaptation on Cross-Accented Speech Recognition. arxiv:2003.01901 [eess.AS] https:\/\/arxiv.org\/abs\/2003.01901","DOI":"10.21437\/Interspeech.2020-45"},{"key":"e_1_3_2_51_2","doi-asserted-by":"crossref","first-page":"271","DOI":"10.18653\/v1\/K19-1026","volume-title":"Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)","author":"Winata G. I.","year":"2019","unstructured":"G. I. Winata, A. Madotto, C.-S. Wu, and P. Fung. 2019. Code-switched language models using neural based synthetic data from parallel sentences. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL). 271\u2013280."},{"key":"e_1_3_2_52_2","first-page":"1206","volume-title":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT)","author":"Yu M.","year":"2018","unstructured":"M. Yu, X. Guo, J. Yi, S. Chang, S. Potdar, Y. Cheng, G. Tesauro, H. Wang, and B. Zhou. 2018. Diverse few-shot text classification with multiple metrics. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 1206\u20131215."},{"key":"e_1_3_2_53_2","volume-title":"Ninth European Conference on Speech Communication and Technology","author":"Zheng Y.","year":"2005","unstructured":"Y. Zheng, R. Sproat, L. Gu, I. Shafran, H. Zhou, Y. Su, D. Jurafsky, R. Starr, and S.-Y. Yoon. 2005. Accent detection and speech recognition for shanghai-accented mandarin. In Ninth European Conference on Speech Communication and Technology."}],"container-title":["ACM Transactions on Asian and Low-Resource Language Information Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3748322","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,10]],"date-time":"2025-09-10T13:29:39Z","timestamp":1757510979000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3748322"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,10]]},"references-count":52,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2025,9,30]]}},"alternative-id":["10.1145\/3748322"],"URL":"https:\/\/doi.org\/10.1145\/3748322","relation":{},"ISSN":["2375-4699","2375-4702"],"issn-type":[{"type":"print","value":"2375-4699"},{"type":"electronic","value":"2375-4702"}],"subject":[],"published":{"date-parts":[[2025,9,10]]},"assertion":[{"value":"2024-08-02","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-05-15","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-09-10","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}