{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:16:10Z","timestamp":1750220170104,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":39,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,12,1]],"date-time":"2022-12-01T00:00:00Z","timestamp":1669852800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Research is funded by Viettel Cyberspace Center - Viettel Group"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,12]]},"DOI":"10.1145\/3568562.3568645","type":"proceedings-article","created":{"date-parts":[[2022,11,29]],"date-time":"2022-11-29T00:25:01Z","timestamp":1669681501000},"page":"270-275","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Improving Self-supervised Audio Representation based on Contrastive Learning with Conformer Encoder"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-2575-5701","authenticated-orcid":false,"given":"Quang Tien","family":"Duong","sequence":"first","affiliation":[{"name":"vlsp, VTCC, Viet Nam"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0215-687X","authenticated-orcid":false,"given":"Duc Huy","family":"Nguyen","sequence":"additional","affiliation":[{"name":"VTCC, Viet Nam"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0167-1263","authenticated-orcid":false,"given":"Bao Thang","family":"Ta","sequence":"additional","affiliation":[{"name":"VTCC, Viet Nam"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2498-9314","authenticated-orcid":false,"given":"Nhat Minh","family":"Le","sequence":"additional","affiliation":[{"name":"VTCC, Viet Nam"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9554-5171","authenticated-orcid":false,"given":"Van Hai","family":"Do","sequence":"additional","affiliation":[{"name":"VTCC, Viet Nam"}]}],"member":"320","published-online":{"date-parts":[[2022,12]]},"reference":[{"key":"e_1_3_2_1_1_1","doi-asserted-by":"crossref","unstructured":"Moustafa Alzantot Ziqi Wang and Mani\u00a0B Srivastava. 2019. Deep residual neural networks for audio spoofing detection. arXiv preprint arXiv:1907.00501(2019). Moustafa Alzantot Ziqi Wang and Mani\u00a0B Srivastava. 2019. Deep residual neural networks for audio spoofing detection. arXiv preprint arXiv:1907.00501(2019).","DOI":"10.21437\/Interspeech.2019-3174"},{"key":"e_1_3_2_1_2_1","first-page":"12449","article-title":"wav2vec 2.0: A framework for self-supervised learning of speech representations","volume":"33","author":"Baevski Alexei","year":"2020","unstructured":"Alexei Baevski , Yuhao Zhou , Abdelrahman Mohamed , and Michael Auli . 2020 . wav2vec 2.0: A framework for self-supervised learning of speech representations . Advances in Neural Information Processing Systems 33 (2020), 12449 \u2013 12460 . Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems 33 (2020), 12449\u201312460.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-70136-3_93"},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2017.2690575"},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP39728.2021.9414337"},{"key":"e_1_3_2_1_6_1","volume-title":"International conference on machine learning. PMLR, 1597\u20131607","author":"Chen Ting","year":"2020","unstructured":"Ting Chen , Simon Kornblith , Mohammad Norouzi , and Geoffrey Hinton . 2020 . A simple framework for contrastive learning of visual representations . In International conference on machine learning. PMLR, 1597\u20131607 . Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597\u20131607."},{"key":"e_1_3_2_1_7_1","volume-title":"FMA: A Dataset for Music Analysis. In 18th International Society for Music Information Retrieval Conference (ISMIR). arxiv:1612","author":"Defferrard Micha\u00ebl","year":"2017","unstructured":"Micha\u00ebl Defferrard , Kirell Benzi , Pierre Vandergheynst , and Xavier Bresson . 2017 . FMA: A Dataset for Music Analysis. In 18th International Society for Music Information Retrieval Conference (ISMIR). arxiv:1612 .01840https:\/\/arxiv.org\/abs\/1612.01840 Micha\u00ebl Defferrard, Kirell Benzi, Pierre Vandergheynst, and Xavier Bresson. 2017. FMA: A Dataset for Music Analysis. In 18th International Society for Music Information Retrieval Conference (ISMIR). arxiv:1612.01840https:\/\/arxiv.org\/abs\/1612.01840"},{"key":"e_1_3_2_1_8_1","first-page":"571","article-title":"AST","volume":"2021","author":"Gong Yuan","year":"2021","unstructured":"Yuan Gong , Yu-An Chung , and James Glass . 2021 . AST : Audio Spectrogram Transformer. In Proc. Interspeech 2021. 571 \u2013 575 . https:\/\/doi.org\/10.21437\/Interspeech.2021-698 10.21437\/Interspeech.2021-698 Yuan Gong, Yu-An Chung, and James Glass. 2021. AST: Audio Spectrogram Transformer. In Proc. Interspeech 2021. 571\u2013575. https:\/\/doi.org\/10.21437\/Interspeech.2021-698","journal-title":"Audio Spectrogram Transformer. In Proc. Interspeech"},{"key":"e_1_3_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v36i10.21315"},{"key":"e_1_3_2_1_10_1","first-page":"5036","article-title":"Conformer","volume":"2020","author":"Gulati Anmol","year":"2020","unstructured":"Anmol Gulati , James Qin , Chung-Cheng Chiu , Niki Parmar , Yu Zhang , Jiahui Yu , Wei Han , Shibo Wang , Zhengdong Zhang , Yonghui Wu , and Ruoming Pang . 2020 . Conformer : Convolution-augmented Transformer for Speech Recognition. In Proc. Interspeech 2020. 5036 \u2013 5040 . https:\/\/doi.org\/10.21437\/Interspeech.2020-3015 10.21437\/Interspeech.2020-3015 Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. 2020. Conformer: Convolution-augmented Transformer for Speech Recognition. In Proc. Interspeech 2020. 5036\u20135040. https:\/\/doi.org\/10.21437\/Interspeech.2020-3015","journal-title":"Convolution-augmented Transformer for Speech Recognition. In Proc. Interspeech"},{"key":"e_1_3_2_1_11_1","volume-title":"CNN Architectures for Large-Scale Audio Classification. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). https:\/\/arxiv.org\/abs\/1609","author":"Hershey Shawn","year":"2017","unstructured":"Shawn Hershey , Sourish Chaudhuri , Daniel P.\u00a0W. Ellis , Jort\u00a0 F. Gemmeke , Aren Jansen , Channing Moore , Manoj Plakal , Devin Platt , Rif\u00a0 A. Saurous , Bryan Seybold , Malcolm Slaney , Ron Weiss , and Kevin Wilson . 2017 . CNN Architectures for Large-Scale Audio Classification. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). https:\/\/arxiv.org\/abs\/1609 .09430 Shawn Hershey, Sourish Chaudhuri, Daniel P.\u00a0W. Ellis, Jort\u00a0F. Gemmeke, Aren Jansen, Channing Moore, Manoj Plakal, Devin Platt, Rif\u00a0A. Saurous, Bryan Seybold, Malcolm Slaney, Ron Weiss, and Kevin Wilson. 2017. CNN Architectures for Large-Scale Audio Classification. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). https:\/\/arxiv.org\/abs\/1609.09430"},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2021.3122291"},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/TBDATA.2019.2921572"},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICBME49163.2019.9030421"},{"key":"e_1_3_2_1_15_1","first-page":"18661","article-title":"Supervised contrastive learning","volume":"33","author":"Khosla Prannay","year":"2020","unstructured":"Prannay Khosla , Piotr Teterwak , Chen Wang , Aaron Sarna , Yonglong Tian , Phillip Isola , Aaron Maschinot , Ce Liu , and Dilip Krishnan . 2020 . Supervised contrastive learning . Advances in Neural Information Processing Systems 33 (2020), 18661 \u2013 18673 . Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised contrastive learning. Advances in Neural Information Processing Systems 33 (2020), 18661\u201318673.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"crossref","unstructured":"Khaled Koutini Jan Schl\u00fcter Hamid Eghbal-zadeh and Gerhard Widmer. 2021. Efficient training of audio transformers with patchout. arXiv preprint arXiv:2110.05069(2021). Khaled Koutini Jan Schl\u00fcter Hamid Eghbal-zadeh and Gerhard Widmer. 2021. Efficient training of audio transformers with patchout. arXiv preprint arXiv:2110.05069(2021).","DOI":"10.21437\/Interspeech.2022-227"},{"key":"e_1_3_2_1_17_1","unstructured":"Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. CoRR abs\/1404.5997(2014). arxiv:1404.5997http:\/\/arxiv.org\/abs\/1404.5997 Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. CoRR abs\/1404.5997(2014). arxiv:1404.5997http:\/\/arxiv.org\/abs\/1404.5997"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/ASRU46091.2019.9003906"},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2021.3095662"},{"key":"e_1_3_2_1_20_1","volume-title":"Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983(2016).","author":"Loshchilov Ilya","year":"2016","unstructured":"Ilya Loshchilov and Frank Hutter . 2016 . Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983(2016). Ilya Loshchilov and Frank Hutter. 2016. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983(2016)."},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"crossref","unstructured":"Danqing Luo Yuexian Zou and Dongyan Huang. 2018. Investigation on Joint Representation Learning for Robust Feature Extraction in Speech Emotion Recognition.. In Interspeech. 152\u2013156. Danqing Luo Yuexian Zou and Dongyan Huang. 2018. Investigation on Joint Representation Learning for Robust Feature Extraction in Speech Emotion Recognition.. In Interspeech. 152\u2013156.","DOI":"10.21437\/Interspeech.2018-1832"},{"key":"e_1_3_2_1_22_1","unstructured":"Koichi Miyazaki Tatsuya Komatsu Tomoki Hayashi Shinji Watanabe Tomoki Toda and Kazuya Takeda. 2020. Conformer-based sound event detection with semi-supervised learning and data augmentation. dim 1(2020) 4. Koichi Miyazaki Tatsuya Komatsu Tomoki Hayashi Shinji Watanabe Tomoki Toda and Kazuya Takeda. 2020. Conformer-based sound event detection with semi-supervised learning and data augmentation. dim 1(2020) 4."},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/IJCNN52387.2021.9534474"},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2018-993"},{"key":"e_1_3_2_1_25_1","unstructured":"Kamalesh Palanisamy Dipika Singhania and Angela Yao. 2020. Rethinking CNN models for audio classification. arXiv preprint arXiv:2007.11154(2020). Kamalesh Palanisamy Dipika Singhania and Angela Yao. 2020. Rethinking CNN models for audio classification. arXiv preprint arXiv:2007.11154(2020)."},{"key":"e_1_3_2_1_26_1","unstructured":"Yanzhen Ren Dengkai Liu Qiaochu Xiong Jianming Fu and Lina Wang. 2019. Spec-resnet: a general audio steganalysis scheme based on deep residual network of spectrogram. arXiv preprint arXiv:1901.06838(2019). Yanzhen Ren Dengkai Liu Qiaochu Xiong Jianming Fu and Lina Wang. 2019. Spec-resnet: a general audio steganalysis scheme based on deep residual network of spectrogram. arXiv preprint arXiv:1901.06838(2019)."},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP39728.2021.9413528"},{"key":"e_1_3_2_1_28_1","volume-title":"An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition","author":"Shi Baoguang","year":"2016","unstructured":"Baoguang Shi , Xiang Bai , and Cong Yao . 2016. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition . IEEE transactions on pattern analysis and machine intelligence 39, 11( 2016 ), 2298\u20132304. Baoguang Shi, Xiang Bai, and Cong Yao. 2016. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence 39, 11(2016), 2298\u20132304."},{"key":"e_1_3_2_1_29_1","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556(2014). Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556(2014)."},{"key":"e_1_3_2_1_30_1","volume":"38","author":"Son Dang\u00a0Dinh","year":"2022","unstructured":"Dang\u00a0Dinh Son , Dang\u00a0Xuan Vuong , Duong\u00a0Quang Tien , Ta\u00a0Bao Thang , 2022 . ASR-VLSP 2021: Conformer with Gradient Mask and Stochastic Weight Averaging for Vietnamese Automatic Speech Recognition. VNU Journal of Science: Computer Science and Communication Engineering 38 , 1(2022). Dang\u00a0Dinh Son, Dang\u00a0Xuan Vuong, Duong\u00a0Quang Tien, Ta\u00a0Bao Thang, 2022. ASR-VLSP 2021: Conformer with Gradient Mask and Stochastic Weight Averaging for Vietnamese Automatic Speech Recognition. VNU Journal of Science: Computer Science and Communication Engineering 38, 1(2022).","journal-title":"VNU Journal of Science: Computer Science and Communication Engineering"},{"key":"e_1_3_2_1_31_1","unstructured":"Christian Szegedy Vincent Vanhoucke Sergey Ioffe Jonathon Shlens and Zbigniew Wojna. 2015. Rethinking the Inception Architecture for Computer Vision. CoRR abs\/1512.00567(2015). arxiv:1512.00567http:\/\/arxiv.org\/abs\/1512.00567 Christian Szegedy Vincent Vanhoucke Sergey Ioffe Jonathon Shlens and Zbigniew Wojna. 2015. Rethinking the Inception Architecture for Computer Vision. CoRR abs\/1512.00567(2015). arxiv:1512.00567http:\/\/arxiv.org\/abs\/1512.00567"},{"key":"e_1_3_2_1_32_1","volume":"38","author":"Thang Ta\u00a0Bao","year":"2022","unstructured":"Ta\u00a0Bao Thang , Dang\u00a0Dinh Son , Dang\u00a0Xuan Vuong , Duong\u00a0Quang Tien , 2022 . ASR-VLSP 2021: Automatic Speech Recognition with Blank Label Re-weighting. VNU Journal of Science: Computer Science and Communication Engineering 38 , 1(2022). Ta\u00a0Bao Thang, Dang\u00a0Dinh Son, Dang\u00a0Xuan Vuong, Duong\u00a0Quang Tien, 2022. ASR-VLSP 2021: Automatic Speech Recognition with Blank Label Re-weighting. VNU Journal of Science: Computer Science and Communication Engineering 38, 1(2022).","journal-title":"VNU Journal of Science: Computer Science and Communication Engineering"},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00828"},{"key":"e_1_3_2_1_34_1","unstructured":"Xuenan Xu Heinrich Dinkel Mengyue Wu and Kai Yu. 2020. A CRNN-GRU Based Reinforcement Learning Approach to Audio Captioning.. In DCASE. 225\u2013229. Xuenan Xu Heinrich Dinkel Mengyue Wu and Kai Yu. 2020. A CRNN-GRU Based Reinforcement Learning Approach to Audio Captioning.. In DCASE. 225\u2013229."},{"key":"e_1_3_2_1_35_1","first-page":"5812","article-title":"Graph contrastive learning with augmentations","volume":"33","author":"You Yuning","year":"2020","unstructured":"Yuning You , Tianlong Chen , Yongduo Sui , Ting Chen , Zhangyang Wang , and Yang Shen . 2020 . Graph contrastive learning with augmentations . Advances in Neural Information Processing Systems 33 (2020), 5812 \u2013 5823 . Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. 2020. Graph contrastive learning with augmentations. Advances in Neural Information Processing Systems 33 (2020), 5812\u20135823.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_36_1","unstructured":"Yang You Jing Li Sashank Reddi Jonathan Hseu Sanjiv Kumar Srinadh Bhojanapalli Xiaodan Song James Demmel Kurt Keutzer and Cho-Jui Hsieh. 2019. Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962(2019). Yang You Jing Li Sashank Reddi Jonathan Hseu Sanjiv Kumar Srinadh Bhojanapalli Xiaodan Song James Demmel Kurt Keutzer and Cho-Jui Hsieh. 2019. Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962(2019)."},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"crossref","unstructured":"Yang Zhang Zhiqiang Lv Haibin Wu Shanshan Zhang Pengfei Hu Zhiyong Wu Hung-yi Lee and Helen Meng. 2022. MFA-Conformer: Multi-scale Feature Aggregation Conformer for Automatic Speaker Verification. arXiv preprint arXiv:2203.15249(2022). Yang Zhang Zhiqiang Lv Haibin Wu Shanshan Zhang Pengfei Hu Zhiyong Wu Hung-yi Lee and Helen Meng. 2022. MFA-Conformer: Multi-scale Feature Aggregation Conformer for Automatic Speaker Verification. arXiv preprint arXiv:2203.15249(2022).","DOI":"10.21437\/Interspeech.2022-563"},{"key":"e_1_3_2_1_38_1","unstructured":"Yu Zhang James Qin Daniel\u00a0S Park Wei Han Chung-Cheng Chiu Ruoming Pang Quoc\u00a0V Le and Yonghui Wu. 2020. Pushing the limits of semi-supervised learning for automatic speech recognition. arXiv preprint arXiv:2010.10504(2020). Yu Zhang James Qin Daniel\u00a0S Park Wei Han Chung-Cheng Chiu Ruoming Pang Quoc\u00a0V Le and Yonghui Wu. 2020. Pushing the limits of semi-supervised learning for automatic speech recognition. arXiv preprint arXiv:2010.10504(2020)."},{"key":"e_1_3_2_1_39_1","volume-title":"Twenty-Fourth International Joint Conference on Artificial Intelligence.","author":"Zhuang Fuzhen","year":"2015","unstructured":"Fuzhen Zhuang , Xiaohu Cheng , Ping Luo , Sinno\u00a0Jialin Pan , and Qing He . 2015 . Supervised representation learning: Transfer learning with deep autoencoders . In Twenty-Fourth International Joint Conference on Artificial Intelligence. Fuzhen Zhuang, Xiaohu Cheng, Ping Luo, Sinno\u00a0Jialin Pan, and Qing He. 2015. Supervised representation learning: Transfer learning with deep autoencoders. In Twenty-Fourth International Joint Conference on Artificial Intelligence."}],"event":{"name":"SoICT 2022: The 11th International Symposium on Information and Communication Technology","acronym":"SoICT 2022","location":"Hanoi Vietnam"},"container-title":["The 11th International Symposium on Information and Communication Technology"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3568562.3568645","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3568562.3568645","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:00:40Z","timestamp":1750186840000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3568562.3568645"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,12]]},"references-count":39,"alternative-id":["10.1145\/3568562.3568645","10.1145\/3568562"],"URL":"https:\/\/doi.org\/10.1145\/3568562.3568645","relation":{},"subject":[],"published":{"date-parts":[[2022,12]]},"assertion":[{"value":"2022-12-01","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}