{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,24]],"date-time":"2026-02-24T06:31:37Z","timestamp":1771914697329,"version":"3.50.1"},"reference-count":56,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2023,3,24]],"date-time":"2023-03-24T00:00:00Z","timestamp":1679616000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62001315 and U20A20161"],"award-info":[{"award-number":["62001315 and U20A20161"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Open Fund of Key Laboratory of Flight Techniques and Flight Safety, Civil Aviation Administration of China","award":["FZ2021KF04"],"award-info":[{"award-number":["FZ2021KF04"]}]},{"DOI":"10.13039\/501100012226","name":"Fundamental Research Funds for the Central Universities","doi-asserted-by":"crossref","award":["2021SCU12050"],"award-info":[{"award-number":["2021SCU12050"]}],"id":[{"id":"10.13039\/501100012226","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Asian Low-Resour. Lang. Inf. Process."],"published-print":{"date-parts":[[2023,4,30]]},"abstract":"<jats:p>Automatic spoken instruction understanding (SIU) of the controller-pilot conversations in the air traffic control (ATC) requires not only recognizing the words and semantics of the speech but also determining the role of the speaker. However, few of the published works on the automatic understanding systems in air traffic communication focus on speaker role identification (SRI). In this article, we formulate the SRI task of controller-pilot communication as a binary classification problem. Furthermore, the text-based, speech-based, and speech-and-text-based multi-modal methods are proposed to achieve a comprehensive comparison of the SRI task. To ablate the impacts of the comparative approaches, various advanced neural network architectures are applied to optimize the implementation of text-based and speech-based methods. Most importantly, a multi-modal speaker role identification network (MMSRINet) is designed to achieve the SRI task by considering both the speech and textual modality features. To aggregate modality features, the modal fusion module is proposed to fuse and squeeze acoustic and textual representations by modal attention mechanism and self-attention pooling layer, respectively. Finally, the comparative approaches are validated on the ATCSpeech corpus collected from a real-world ATC environment. The experimental results demonstrate that all the comparative approaches worked for the SRI task, and the proposed MMSRINet shows competitive performance and robustness compared with the other methods on both seen and unseen data, achieving 98.56% and 98.08% accuracy, respectively.<\/jats:p>","DOI":"10.1145\/3572792","type":"journal-article","created":{"date-parts":[[2022,11,24]],"date-time":"2022-11-24T11:45:40Z","timestamp":1669290340000},"page":"1-17","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":13,"title":["A Comparative Study of Speaker Role Identification in Air Traffic Communication Using Deep Learning Approaches"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-0393-5197","authenticated-orcid":false,"given":"Dongyue","family":"Guo","sequence":"first","affiliation":[{"name":"National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Chengdu, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5491-1745","authenticated-orcid":false,"given":"Jianwei","family":"Zhang","sequence":"additional","affiliation":[{"name":"College of Computer Science, Sichuan University, Chengdu, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5866-6492","authenticated-orcid":false,"given":"Bo","family":"Yang","sequence":"additional","affiliation":[{"name":"College of Computer Science, Sichuan University, Chengdu, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7194-5023","authenticated-orcid":false,"given":"Yi","family":"Lin","sequence":"additional","affiliation":[{"name":"College of Computer Science, Sichuan University, Chengdu, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2023,3,24]]},"reference":[{"key":"e_1_3_1_2_1","volume-title":"International Conference on Advances in Neural Information Processing Systems","author":"Baevski Alexei","year":"2020","unstructured":"Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. In International Conference on Advances in Neural Information Processing Systems."},{"issue":"2","key":"e_1_3_1_3_1","doi-asserted-by":"crossref","first-page":"423","DOI":"10.1109\/TPAMI.2018.2798607","article-title":"Multimodal machine learning: A survey and taxonomy","volume":"41","author":"Baltru\u0161aitis Tadas","year":"2019","unstructured":"Tadas Baltru\u0161aitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2019. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2 (Feb.2019), 423\u2013443. DOI:https:\/\/doi.org\/10.1109\/TPAMI.2018.2798607","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"issue":"6","key":"e_1_3_1_4_1","doi-asserted-by":"crossref","first-page":"1291","DOI":"10.1109\/TASLP.2017.2690575","article-title":"Convolutional recurrent neural networks for polyphonic sound event detection","volume":"25","author":"Cak\u0131r Emre","year":"2017","unstructured":"Emre Cak\u0131r, Giambattista Parascandolo, Toni Heittola, Heikki Huttunen, and Tuomas Virtanen. 2017. Convolutional recurrent neural networks for polyphonic sound event detection. IEEE\/ACM Trans. Audio, Speech Lang. Process. 25, 6 (2017), 1291\u20131303.","journal-title":"IEEE\/ACM Trans. Audio, Speech Lang. Process."},{"key":"e_1_3_1_5_1","first-page":"2392","volume-title":"IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201917)","author":"Choi Keunwoo","year":"2017","unstructured":"Keunwoo Choi, Gy\u00f6rgy Fazekas, Mark Sandler, and Kyunghyun Cho. 2017. Convolutional recurrent neural networks for music classification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201917). IEEE, 2392\u20132396."},{"key":"e_1_3_1_6_1","first-page":"4171","volume-title":"Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 4171\u20134186. DOI:https:\/\/doi.org\/10.18653\/v1\/n19-1423"},{"key":"e_1_3_1_7_1","first-page":"457","volume-title":"Conference on Empirical Methods in Natural Language Processing","author":"Fukui Akira","year":"2016","unstructured":"Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 457\u2013468. DOI:https:\/\/doi.org\/10.18653\/v1\/d16-1044"},{"key":"e_1_3_1_8_1","first-page":"1","volume-title":"Digital Image Computing: Techniques and Applications (DICTA\u201918)","author":"Gallo Ignazio","year":"2018","unstructured":"Ignazio Gallo, Alessandro Calefati, Shah Nawaz, and Muhammad Kamran Janjua. 2018. Image and encoded text fusion for multi-modal classification. In Digital Image Computing: Techniques and Applications (DICTA\u201918). 1\u20137. DOI:https:\/\/doi.org\/10.1109\/DICTA.2018.8615789"},{"issue":"11","key":"e_1_3_1_9_1","doi-asserted-by":"crossref","first-page":"348","DOI":"10.3390\/aerospace8110348","article-title":"A context-aware language model to improve the speech recognition in air traffic control","volume":"8","author":"Guo Dongyue","year":"2021","unstructured":"Dongyue Guo, Zichen Zhang, Peng Fan, Jianwei Zhang, and Bo Yang. 2021. A context-aware language model to improve the speech recognition in air traffic control. Aerospace 8, 11 (2021), 348.","journal-title":"Aerospace"},{"issue":"1","key":"e_1_3_1_10_1","doi-asserted-by":"crossref","first-page":"208","DOI":"10.1109\/TASLP.2016.2632307","article-title":"Deep convolutional neural networks for predominant instrument recognition in polyphonic music","volume":"25","author":"Han Yoonchang","year":"2017","unstructured":"Yoonchang Han, Jae-Hun Kim, and Kyogu Lee. 2017. Deep convolutional neural networks for predominant instrument recognition in polyphonic music. IEEE ACM Trans. Audio Speech Lang. Process. 25, 1 (2017), 208\u2013221. DOI:https:\/\/doi.org\/10.1109\/TASLP.2016.2632307","journal-title":"IEEE ACM Trans. Audio Speech Lang. Process."},{"key":"e_1_3_1_11_1","doi-asserted-by":"crossref","first-page":"19","DOI":"10.1016\/j.specom.2015.12.004","article-title":"Unsupervised accent classification for deep data fusion of accent and language information","volume":"78","author":"Hansen John H. L.","year":"2016","unstructured":"John H. L. Hansen and Gang Liu. 2016. Unsupervised accent classification for deep data fusion of accent and language information. Speech Commun. 78 (2016), 19\u201333.","journal-title":"Speech Commun."},{"key":"e_1_3_1_12_1","first-page":"770","volume-title":"IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201916)","author":"He Kaiming","year":"2016","unstructured":"Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201916). IEEE, 770\u2013778. DOI:https:\/\/doi.org\/10.1109\/CVPR.2016.90"},{"key":"e_1_3_1_13_1","volume-title":"11th USA\/Europe Air Traffic Management Research and Development Seminar (ATM\u201915)","author":"Helmke Hartmut","year":"2015","unstructured":"Hartmut Helmke, J\u00fcrgen Rataj, Thorsten M\u00fchlhausen, Oliver Ohneiser, Heiko Ehr, Matthias Kleinert, Youssef Oualil, Marc Schulder, and D. Klakow. 2015. Assistant-based speech recognition for ATM applications. In 11th USA\/Europe Air Traffic Management Research and Development Seminar (ATM\u201915)."},{"key":"e_1_3_1_14_1","first-page":"131","volume-title":"IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201917)","author":"Hershey Shawn","year":"2017","unstructured":"Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin Wilson. 2017. CNN architectures for large-scale audio classification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201917). 131\u2013135. DOI:https:\/\/doi.org\/10.1109\/ICASSP.2017.7952132"},{"issue":"4","key":"e_1_3_1_15_1","doi-asserted-by":"crossref","first-page":"799","DOI":"10.1007\/s10772-020-09690-2","article-title":"Pattern recognition and features selection for speech emotion recognition model using deep learning","volume":"23","author":"Jermsittiparsert Kittisak","year":"2020","unstructured":"Kittisak Jermsittiparsert, Abdurrahman Abdurrahman, Parinya Siriattakul, Ludmila A. Sundeeva, Wahidah Hashim, Robbi Rahim, and Andino Maseleno. 2020. Pattern recognition and features selection for speech emotion recognition model using deep learning. Int. J. Speech Technol. 23, 4 (2020), 799\u2013806.","journal-title":"Int. J. Speech Technol."},{"key":"e_1_3_1_16_1","article-title":"RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification","author":"Jung Jee-weon","year":"2019","unstructured":"Jee-weon Jung, Hee-Soo Heo, Ju-ho Kim, Hye-jin Shim, and Ha-Jin Yu. 2019. RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification. arXiv:1904.08104 [cs, eess] (July2019).","journal-title":"arXiv:1904.08104 [cs, eess]"},{"key":"e_1_3_1_17_1","first-page":"36","volume-title":"Conference on Empirical Methods in Natural Language Processing (EMNLP\u201914)","author":"Kiela Douwe","year":"2014","unstructured":"Douwe Kiela and L\u00e9on Bottou. 2014. Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In Conference on Empirical Methods in Natural Language Processing (EMNLP\u201914). 36\u201345."},{"key":"e_1_3_1_18_1","first-page":"5198","volume-title":"32nd AAAI Conference on Artificial Intelligence, (AAAI\u201918)","author":"Kiela Douwe","year":"2018","unstructured":"Douwe Kiela, Edouard Grave, Armand Joulin, and Tom\u00e1s Mikolov. 2018. Efficient large-scale multi-modal classification. In 32nd AAAI Conference on Artificial Intelligence, (AAAI\u201918). AAAI Press, 5198\u20135204."},{"key":"e_1_3_1_19_1","first-page":"1746","volume-title":"Conference on Empirical Methods in Natural Language Processing","author":"Kim Yoon","year":"2014","unstructured":"Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Conference on Empirical Methods in Natural Language Processing. ACL, 1746\u20131751. DOI:https:\/\/doi.org\/10.3115\/v1\/d14-1181"},{"key":"e_1_3_1_20_1","first-page":"1097","article-title":"ImageNet classification with deep convolutional neural networks","volume":"25","author":"Krizhevsky Alex","year":"2012","unstructured":"Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25 (2012), 1097\u20131105.","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"e_1_3_1_21_1","doi-asserted-by":"crossref","first-page":"1038","DOI":"10.1145\/2964284.2964310","volume-title":"24th ACM International Conference on Multimedia","author":"Kumar Anurag","year":"2016","unstructured":"Anurag Kumar and Bhiksha Raj. 2016. Audio event detection using weakly labeled data. In 24th ACM International Conference on Multimedia. 1038\u20131047."},{"issue":"11","key":"e_1_3_1_22_1","doi-asserted-by":"crossref","first-page":"2278","DOI":"10.1109\/5.726791","article-title":"Gradient-based learning applied to document recognition","volume":"86","author":"LeCun Yann","year":"1998","unstructured":"Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278\u20132324. DOI:https:\/\/doi.org\/10.1109\/5.726791","journal-title":"Proc. IEEE"},{"issue":"3","key":"e_1_3_1_23_1","doi-asserted-by":"crossref","first-page":"65","DOI":"10.3390\/aerospace8030065","article-title":"Spoken instruction understanding in air traffic control: Challenge, technique, and application","volume":"8","author":"Lin Yi","year":"2021","unstructured":"Yi Lin. 2021. Spoken instruction understanding in air traffic control: Challenge, technique, and application. Aerospace 8, 3 (2021), 65.","journal-title":"Aerospace"},{"issue":"11","key":"e_1_3_1_24_1","doi-asserted-by":"crossref","first-page":"4572","DOI":"10.1109\/TITS.2019.2940992","article-title":"A real-time ATC safety monitoring framework using a deep learning approach","volume":"21","author":"Lin Yi","year":"2020","unstructured":"Yi Lin, Linjie Deng, Zhengmao Chen, Xiping Wu, Jianwei Zhang, and Bo Yang. 2020. A real-time ATC safety monitoring framework using a deep learning approach. IEEE Trans. Intell. Transp. Syst. 21, 11 (2020), 4572\u20134581. DOI:https:\/\/doi.org\/10.1109\/TITS.2019.2940992","journal-title":"IEEE Trans. Intell. Transp. Syst."},{"issue":"8","key":"e_1_3_1_25_1","doi-asserted-by":"crossref","first-page":"3608","DOI":"10.1109\/TNNLS.2020.3015830","article-title":"A unified framework for multilingual speech recognition in air traffic control systems","volume":"32","author":"Lin Yi","year":"2021","unstructured":"Yi Lin, Dongyue Guo, Jianwei Zhang, Zhengmao Chen, and Bo Yang. 2021a. A unified framework for multilingual speech recognition in air traffic control systems. IEEE Trans. Neural Netw. Learn. Syst. 32, 8 (2021), 3608\u20133620. DOI:https:\/\/doi.org\/10.1109\/TNNLS.2020.3015830","journal-title":"IEEE Trans. Neural Netw. Learn. Syst."},{"key":"e_1_3_1_26_1","doi-asserted-by":"crossref","DOI":"10.1016\/j.cja.2022.08.020","article-title":"Identifying and managing risks of AI-driven operations: A case study of automatic speech recognition for improving air traffic safety","author":"Lin Yi","year":"2022","unstructured":"Yi Lin, Min Ruan, Kunjie Cai, Dan Li, Ziqiang Zeng, Fan Li, and Bo Yang. 2022. Identifying and managing risks of AI-driven operations: A case study of automatic speech recognition for improving air traffic safety. Chinese J. Aeron. (2022). DOI:https:\/\/doi.org\/10.1016\/j.cja.2022.08.020","journal-title":"Chinese J. Aeron."},{"issue":"3","key":"e_1_3_1_27_1","doi-asserted-by":"crossref","first-page":"679","DOI":"10.3390\/s19030679","article-title":"Real-time controlling dynamics sensing in air traffic system","volume":"19","author":"Lin Yi","year":"2019","unstructured":"Yi Lin, Xianlong Tan, Bo Yang, Kai Yang, Jianwei Zhang, and Jing Yu. 2019. Real-time controlling dynamics sensing in air traffic system. Sensors 19, 3 (2019), 679. DOI:https:\/\/doi.org\/10.3390\/s19030679","journal-title":"Sensors"},{"key":"e_1_3_1_28_1","first-page":"1","article-title":"A deep learning framework of autonomous pilot agent for air traffic controller training","author":"Lin Yi","year":"2021","unstructured":"Yi Lin, YuanKai Wu, Dongyue Guo, Pan Zhang, Changyu Yin, Bo Yang, and Jianwei Zhang. 2021b. A deep learning framework of autonomous pilot agent for air traffic controller training. IEEE Trans. Hum.-mach. Syst. (2021), 1\u20139. DOI:https:\/\/doi.org\/10.1109\/THMS.2021.3102827","journal-title":"IEEE Trans. Hum.-mach. Syst."},{"key":"e_1_3_1_29_1","doi-asserted-by":"crossref","first-page":"107847","DOI":"10.1016\/j.asoc.2021.107847","article-title":"ATCSpeechNet: A multilingual end-to-end speech recognition framework for air traffic control systems","volume":"112","author":"Lin Yi","year":"2021","unstructured":"Yi Lin, Bo Yang, Linchao Li, Dongyue Guo, Jianwei Zhang, Hu Chen, and Yi Zhang. 2021c. ATCSpeechNet: A multilingual end-to-end speech recognition framework for air traffic control systems. Appl. Soft Comput. 112 (2021), 107847. DOI:https:\/\/doi.org\/10.1016\/j.asoc.2021.107847","journal-title":"Appl. Soft Comput."},{"key":"e_1_3_1_30_1","first-page":"2326","volume-title":"Conference on Empirical Methods in Natural Language Processing","author":"Liu Pengfei","year":"2015","unstructured":"Pengfei Liu, Xipeng Qiu, Xinchi Chen, Shiyu Wu, and Xuanjing Huang. 2015. Multi-timescale long short-term memory neural network for modelling sentences and documents. In Conference on Empirical Methods in Natural Language Processing. 2326\u20132335. DOI:https:\/\/doi.org\/10.18653\/v1\/d15-1280"},{"key":"e_1_3_1_31_1","first-page":"5337","volume-title":"IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201914)","author":"Lopez-Moreno Ignacio","year":"2014","unstructured":"Ignacio Lopez-Moreno, Javier Gonzalez-Dominguez, Oldrich Plchot, David Martinez, Joaquin Gonzalez-Rodriguez, and Pedro Moreno. 2014. Automatic language identification using deep neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201914). 5337\u20135341. DOI:https:\/\/doi.org\/10.1109\/ICASSP.2014.6854622"},{"key":"e_1_3_1_32_1","first-page":"1045","volume-title":"11th Annual Conference of the International Speech Communication Association","author":"Mikolov Tom\u00e1s","year":"2010","unstructured":"Tom\u00e1s Mikolov, Martin Karafi\u00e1t, Luk\u00e1s Burget, Jan Cernock\u00fd, and Sanjeev Khudanpur. 2010. Recurrent neural network-based language model. In 11th Annual Conference of the International Speech Communication Association. ISCA, 1045\u20131048."},{"key":"e_1_3_1_33_1","article-title":"Deep learning-based text classification: A comprehensive review","author":"Minaee Shervin","year":"2021","unstructured":"Shervin Minaee, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, and Jianfeng Gao. 2021. Deep learning-based text classification: A comprehensive review. arXiv:2004.03705 [cs, stat] (Jan.2021).","journal-title":"arXiv:2004.03705 [cs, stat]"},{"key":"e_1_3_1_34_1","first-page":"1359","volume-title":"AAAI Conference on Artificial Intelligence","author":"Mittal Trisha","year":"2020","unstructured":"Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, and Dinesh Manocha. 2020. M3er: Multiplicative multimodal emotion recognition using facial, textual, and speech cues. In AAAI Conference on Artificial Intelligence. 1359\u20131367."},{"key":"e_1_3_1_35_1","volume-title":"54th Annual Meeting of the Association for Computational Linguistics","author":"Mou Lili","year":"2016","unstructured":"Lili Mou, Rui Men, Ge Li, Yan Xu, Lu Zhang, Rui Yan, and Zhi Jin. 2016. Natural language inference by tree-based convolution and heuristic matching. In 54th Annual Meeting of the Association for Computational Linguistics. The Association for Computer Linguistics. DOI:https:\/\/doi.org\/10.18653\/v1\/p16-2022"},{"key":"e_1_3_1_36_1","first-page":"8427","volume-title":"IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201918)","author":"Nagrani Arsha","year":"2018","unstructured":"Arsha Nagrani, Samuel Albanie, and Andrew Zisserman. 2018. Seeing voices and hearing faces: Cross-modal biometric matching. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR\u201918). 8427\u20138436."},{"key":"e_1_3_1_37_1","first-page":"689","volume-title":"28th International Conference on Machine Learning","author":"Ngiam Jiquan","year":"2011","unstructured":"Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. 2011. Multimodal deep learning. In 28th International Conference on Machine Learning. Omnipress, 689\u2013696."},{"key":"e_1_3_1_38_1","first-page":"404","volume-title":"IEEE Automatic Speech Recognition and Understanding Workshop","author":"Oualil Youssef","year":"2017","unstructured":"Youssef Oualil, Dietrich Klakow, Gy\u00f6rgy Szasz\u00e1k, Ajay Srinivasamurthy, Hartmut Helmke, and Petr Motl\u00edcek. 2017. A context-aware speech recognition and understanding system for air traffic control domain. In IEEE Automatic Speech Recognition and Understanding Workshop. IEEE, 404\u2013408. DOI:https:\/\/doi.org\/10.1109\/ASRU.2017.8268964"},{"key":"e_1_3_1_39_1","volume-title":"5th International Conference on Learning Representations","author":"Ovalle John Edison Arevalo","year":"2017","unstructured":"John Edison Arevalo Ovalle, Thamar Solorio, Manuel Montes-y-G\u00f3mez, and Fabio A. Gonz\u00e1lez. 2017. Gated multimodal units for information fusion. In 5th International Conference on Learning Representations. OpenReview.net. Retrieved from https:\/\/openreview.net\/forum?id=S12_nquOe."},{"issue":"4","key":"e_1_3_1_40_1","doi-asserted-by":"crossref","first-page":"2709","DOI":"10.1109\/TAES.2011.6034660","article-title":"Automatic understanding of ATC speech: Study of prospectives and field experiments for several controller positions","volume":"47","author":"Pardo Jos\u00e9 Manuel","year":"2011","unstructured":"Jos\u00e9 Manuel Pardo, Javier Ferreiros, Fernando Fern\u00e1ndez Mart\u00ednez, Valent\u00edn Sama Rojo, Ricardo de C\u00f3rdoba, Javier Mac\u00edas Guarasa, Juan Manuel Montero, Rub\u00e9n San-Segundo-Hern\u00e1ndez, Luis Fernando D\u2019Haro, and Germ\u00e1n Gonz\u00e1lez. 2011. Automatic understanding of ATC speech: Study of prospectives and field experiments for several controller positions. IEEE Trans. Aerosp. Electron. Syst. 47, 4 (2011), 2709\u20132730. DOI:https:\/\/doi.org\/10.1109\/TAES.2011.6034660","journal-title":"IEEE Trans. Aerosp. Electron. Syst."},{"key":"e_1_3_1_41_1","unstructured":"Alec Radford Karthik Narasimhan Tim Salimans and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. Retrieved from https:\/\/s3-us-west-2.amazonaws.com\/openai-assets\/research-covers\/language-unsupervised\/language_understanding_paper.pdf."},{"key":"e_1_3_1_42_1","first-page":"1021","volume-title":"IEEE Spoken Language Technology Workshop (SLT\u201918)","author":"Ravanelli Mirco","year":"2018","unstructured":"Mirco Ravanelli and Yoshua Bengio. 2018a. Speaker recognition from raw waveform with SincNet. In IEEE Spoken Language Technology Workshop (SLT\u201918). IEEE, 1021\u20131028. DOI:https:\/\/doi.org\/10.1109\/SLT.2018.8639585"},{"key":"e_1_3_1_43_1","article-title":"Speech and speaker recognition from raw waveform with SincNet","volume":"1812","author":"Ravanelli Mirco","year":"2018","unstructured":"Mirco Ravanelli and Yoshua Bengio. 2018b. Speech and speaker recognition from raw waveform with SincNet. CoRR abs\/1812.05920 (2018).","journal-title":"CoRR"},{"key":"e_1_3_1_44_1","article-title":"Very deep convolutional networks for large-scale image recognition","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).","journal-title":"arXiv preprint arXiv:1409.1556"},{"issue":"3","key":"e_1_3_1_45_1","doi-asserted-by":"crossref","first-page":"449","DOI":"10.1007\/s10579-019-09449-5","article-title":"Air traffic control communication (ATCC) speech corpora and their use for ASR and TTS development","volume":"53","author":"Sm\u00eddl Lubos","year":"2019","unstructured":"Lubos Sm\u00eddl, Jan Svec, Daniel Tihelka, Jindrich Matousek, Jan Romportl, and Pavel Ircing. 2019. Air traffic control communication (ATCC) speech corpora and their use for ASR and TTS development. Lang. Resour. Eval. 53, 3 (2019), 449\u2013464. DOI:https:\/\/doi.org\/10.1007\/s10579-019-09449-5","journal-title":"Lang. Resour. Eval."},{"key":"e_1_3_1_46_1","first-page":"5329","volume-title":"IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201918)","author":"Snyder David","year":"2018","unstructured":"David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. 2018. X-Vectors: Robust DNN embeddings for speaker recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201918). IEEE, 5329\u20135333. DOI:https:\/\/doi.org\/10.1109\/ICASSP.2018.8461375"},{"key":"e_1_3_1_47_1","series-title":"Chinese Computational Linguistics - 18th China National Conference, CCL 2019","first-page":"194","volume":"11856","author":"Sun Chi","year":"2019","unstructured":"Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. 2019. How to fine-tune BERT for text classification? In Chinese Computational Linguistics - 18th China National Conference, CCL 2019(Lecture Notes in Computer Science, Vol. 11856). Springer, 194\u2013206. DOI:https:\/\/doi.org\/10.1007\/978-3-030-32381-3_16"},{"key":"e_1_3_1_48_1","first-page":"1556","volume-title":"53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing","author":"Tai Kai Sheng","year":"2015","unstructured":"Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. In 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing. 1556\u20131566. DOI:https:\/\/doi.org\/10.3115\/v1\/p15-1150"},{"key":"e_1_3_1_49_1","first-page":"5998","volume-title":"International Conference on Advances in Neural Information Processing Systems","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In International Conference on Advances in Neural Information Processing Systems. 5998\u20136008."},{"key":"e_1_3_1_50_1","first-page":"167","volume-title":"22nd ACM International Conference on Multimedia","author":"Wu Zuxuan","year":"2014","unstructured":"Zuxuan Wu, Yu-Gang Jiang, Jun Wang, Jian Pu, and Xiangyang Xue. 2014. Exploring inter-feature and inter-class relationships with deep neural networks for video classification. In 22nd ACM International Conference on Multimedia. 167\u2013176."},{"key":"e_1_3_1_51_1","first-page":"121","volume-title":"IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201918)","author":"Xu Yong","year":"2018","unstructured":"Yong Xu, Qiuqiang Kong, Wenwu Wang, and Mark D. Plumbley. 2018. Large-scale weakly supervised audio classification using gated convolutional neural network. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201918). 121\u2013125. DOI:https:\/\/doi.org\/10.1109\/ICASSP.2018.8461975"},{"key":"e_1_3_1_52_1","first-page":"399","volume-title":"Annual Conference of the International Speech Communication Association","author":"Yang Bo","year":"2020","unstructured":"Bo Yang, Xianlong Tan, Zhengmao Chen, Bing Wang, Min Ruan, Dan Li, Zhongping Yang, Xiping Wu, and Yi Lin. 2020. ATCSpeech: A multilingual pilot-controller speech corpus from real air traffic control environment. In Annual Conference of the International Speech Communication Association. ISCA, 399\u2013403. DOI:https:\/\/doi.org\/10.21437\/Interspeech.2020-1020"},{"issue":"3","key":"e_1_3_1_53_1","doi-asserted-by":"crossref","first-page":"3705","DOI":"10.1007\/s11042-017-5539-3","article-title":"Spectrogram-based multi-task audio classification","volume":"78","author":"Zeng Yuni","year":"2019","unstructured":"Yuni Zeng, Hua Mao, Dezhong Peng, and Zhang Yi. 2019. Spectrogram-based multi-task audio classification. Multim. Tools Appl. 78, 3 (2019), 3705\u20133722. DOI:https:\/\/doi.org\/10.1007\/s11042-017-5539-3","journal-title":"Multim. Tools Appl."},{"key":"e_1_3_1_54_1","first-page":"649","volume-title":"International Conference on Advances in Neural Information Processing Systems","author":"Zhang Xiang","year":"2015","unstructured":"Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In International Conference on Advances in Neural Information Processing Systems. 649\u2013657."},{"key":"e_1_3_1_55_1","volume-title":"54th Annual Meeting of the Association for Computational Linguistics","author":"Zhou Peng","year":"2016","unstructured":"Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li, Hongwei Hao, and Bo Xu. 2016. Attention-based bidirectional long short-term memory networks for relation classification. In 54th Annual Meeting of the Association for Computational Linguistics. The Association for Computer Linguistics. DOI:https:\/\/doi.org\/10.18653\/v1\/p16-2034"},{"key":"e_1_3_1_56_1","first-page":"6565","volume-title":"IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201919)","author":"Zhou Pan","year":"2019","unstructured":"Pan Zhou, Wenwen Yang, Wei Chen, Yanfeng Wang, and Jia Jia. 2019. Modality attention for end-to-end audio-visual speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP\u201919). 6565\u20136569. DOI:https:\/\/doi.org\/10.1109\/ICASSP.2019.8683733"},{"key":"e_1_3_1_57_1","first-page":"2297","volume-title":"21st Annual Conference of the International Speech Communication Association","author":"Zuluaga-Gomez Juan","year":"2020","unstructured":"Juan Zuluaga-Gomez, Petr Motl\u00edcek, Qingran Zhan, Karel Vesel\u00fd, and Rudolf A. Braun. 2020. Automatic speech recognition benchmark for air-traffic communications. In 21st Annual Conference of the International Speech Communication Association. ISCA, 2297\u20132301. DOI:https:\/\/doi.org\/10.21437\/Interspeech.2020-2173"}],"container-title":["ACM Transactions on Asian and Low-Resource Language Information Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3572792","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3572792","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:51:08Z","timestamp":1750182668000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3572792"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,3,24]]},"references-count":56,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2023,4,30]]}},"alternative-id":["10.1145\/3572792"],"URL":"https:\/\/doi.org\/10.1145\/3572792","relation":{},"ISSN":["2375-4699","2375-4702"],"issn-type":[{"value":"2375-4699","type":"print"},{"value":"2375-4702","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,3,24]]},"assertion":[{"value":"2021-09-21","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2022-11-18","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-03-24","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}