{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,25]],"date-time":"2025-10-25T21:25:58Z","timestamp":1761427558258,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":43,"publisher":"ACM","license":[{"start":{"date-parts":[[2021,10,17]],"date-time":"2021-10-17T00:00:00Z","timestamp":1634428800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Natural Science Foundation of China","award":["61976196"],"award-info":[{"award-number":["61976196"]}]},{"DOI":"10.13039\/501100002341","name":"Academy of Finland","doi-asserted-by":"publisher","award":["331883"],"award-info":[{"award-number":["331883"]}],"id":[{"id":"10.13039\/501100002341","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Outstanding Talents of ?Ten Thousand Talents Plan?","award":["2018R51001"],"award-info":[{"award-number":["2018R51001"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,10,17]]},"DOI":"10.1145\/3474085.3475415","type":"proceedings-article","created":{"date-parts":[[2021,10,18]],"date-time":"2021-10-18T06:35:51Z","timestamp":1634538951000},"page":"2456-2464","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":11,"title":["Cross-modal Self-Supervised Learning for Lip Reading: When Contrastive Learning meets Adversarial Training"],"prefix":"10.1145","author":[{"given":"Changchong","family":"Sheng","sequence":"first","affiliation":[{"name":"University of Oulu, Oulu, Finland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Matti","family":"Pietik\u00e4inen","sequence":"additional","affiliation":[{"name":"University of Oulu, Oulu, Finland"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Qi","family":"Tian","sequence":"additional","affiliation":[{"name":"Xidian University, Xi'an, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Li","family":"Liu","sequence":"additional","affiliation":[{"name":"University of Oulu, Oulu, Finland"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2021,10,17]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"Andrew Senior, Oriol Vinyals, and Andrew Zisserman.","author":"Afouras Triantafyllos","year":"2018","unstructured":"Triantafyllos Afouras , Joon Son Chung , Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2018 . Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machine intelligence (2018). Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2018. Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machine intelligence (2018)."},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP40776.2020.9054253"},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01246-5_27"},{"key":"e_1_3_2_1_4_1","volume-title":"Lipnet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599","author":"Assael Yannis M","year":"2016","unstructured":"Yannis M Assael , Brendan Shillingford , Shimon Whiteson , and Nando De Freitas . 2016 . Lipnet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016). Yannis M Assael, Brendan Shillingford, Shimon Whiteson, and Nando De Freitas. 2016. Lipnet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016)."},{"key":"e_1_3_2_1_5_1","volume-title":"Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473","author":"Bahdanau Dzmitry","year":"2014","unstructured":"Dzmitry Bahdanau , Kyunghyun Cho , and Yoshua Bengio . 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 ( 2014 ). Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)."},{"key":"e_1_3_2_1_6_1","volume-title":"Thinking the voice: neural correlates of voice perception. Trends in cognitive sciences","author":"Belin Pascal","year":"2004","unstructured":"Pascal Belin , Shirley Fecteau , and Catherine Bedard . 2004. Thinking the voice: neural correlates of voice perception. Trends in cognitive sciences , Vol. 8 , 3 ( 2004 ), 129--135. Pascal Belin, Shirley Fecteau, and Catherine Bedard. 2004. Thinking the voice: neural correlates of voice perception. Trends in cognitive sciences, Vol. 8, 3 (2004), 129--135."},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.116"},{"key":"e_1_3_2_1_8_1","volume-title":"Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531","author":"Chatfield Ken","year":"2014","unstructured":"Ken Chatfield , Karen Simonyan , Andrea Vedaldi , and Andrew Zisserman . 2014. Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531 ( 2014 ). Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531 (2014)."},{"key":"e_1_3_2_1_9_1","volume-title":"2020 a. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709","author":"Chen Ting","year":"2020","unstructured":"Ting Chen , Simon Kornblith , Mohammad Norouzi , and Geoffrey Hinton . 2020 a. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709 ( 2020 ). Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020 a. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709 (2020)."},{"key":"e_1_3_2_1_10_1","volume-title":"2020 b. Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029","author":"Chen Ting","year":"2020","unstructured":"Ting Chen , Simon Kornblith , Kevin Swersky , Mohammad Norouzi , and Geoffrey Hinton . 2020 b. Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029 ( 2020 ). Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton. 2020 b. Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029 (2020)."},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.367"},{"key":"e_1_3_2_1_12_1","volume-title":"Asian Conference on Computer Vision. Springer, 87--103","author":"Chung Joon Son","year":"2016","unstructured":"Joon Son Chung and Andrew Zisserman . 2016 a. Lip reading in the wild . In Asian Conference on Computer Vision. Springer, 87--103 . Joon Son Chung and Andrew Zisserman. 2016a. Lip reading in the wild. In Asian Conference on Computer Vision. Springer, 87--103."},{"key":"e_1_3_2_1_13_1","volume-title":"Asian conference on computer vision. Springer, 251--263","author":"Chung Joon Son","year":"2016","unstructured":"Joon Son Chung and Andrew Zisserman . 2016 b. Out of time: automated lip sync in the wild . In Asian conference on computer vision. Springer, 251--263 . Joon Son Chung and Andrew Zisserman. 2016b. Out of time: automated lip sync in the wild. In Asian conference on computer vision. Springer, 251--263."},{"volume-title":"Lip Reading in Profile. In British Machine Vision Conference.","author":"Chung J. S.","key":"e_1_3_2_1_14_1","unstructured":"J. S. Chung and A. Zisserman . 2017 . Lip Reading in Profile. In British Machine Vision Conference. J. S. Chung and A. Zisserman. 2017. Lip Reading in Profile. In British Machine Vision Conference."},{"key":"e_1_3_2_1_15_1","volume-title":"Joon Son Chung, and Hong Goo Kang. 2020 a. Perfect Match: Self-Supervised Embeddings for Cross-modal Retrieval","author":"Chung Soo-Whan","year":"2020","unstructured":"Soo-Whan Chung , Joon Son Chung, and Hong Goo Kang. 2020 a. Perfect Match: Self-Supervised Embeddings for Cross-modal Retrieval . IEEE Journal of Selected Topics in Signal Processing ( 2020 ). Soo-Whan Chung, Joon Son Chung, and Hong Goo Kang. 2020 a. Perfect Match: Self-Supervised Embeddings for Cross-modal Retrieval. IEEE Journal of Selected Topics in Signal Processing (2020)."},{"key":"e_1_3_2_1_16_1","volume-title":"Hong Goo Kang, and Joon Son Chung. 2020 b. Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision. arXiv preprint arXiv:2004.14326","author":"Chung Soo-Whan","year":"2020","unstructured":"Soo-Whan Chung , Hong Goo Kang, and Joon Son Chung. 2020 b. Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision. arXiv preprint arXiv:2004.14326 ( 2020 ). Soo-Whan Chung, Hong Goo Kang, and Joon Son Chung. 2020 b. Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision. arXiv preprint arXiv:2004.14326 (2020)."},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_3_2_1_18_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)."},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.5555\/2946645.2946704"},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/1143844.1143891"},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2013.6638947"},{"key":"e_1_3_2_1_22_1","volume-title":"Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 297--304","author":"Gutmann Michael","year":"2010","unstructured":"Michael Gutmann and Aapo Hyv\"arinen. 2010 . Noise-contrastive estimation: A new estimation principle for unnormalized statistical models . In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 297--304 . Michael Gutmann and Aapo Hyv\"arinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 297--304."},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2006.100"},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00975"},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_1_26_1","volume-title":"Ali Razavi, Carl Doersch, SM Eslami, and Aaron van den Oord.","author":"H\u00e9naff Olivier J","year":"2019","unstructured":"Olivier J H\u00e9naff , Aravind Srinivas , Jeffrey De Fauw , Ali Razavi, Carl Doersch, SM Eslami, and Aaron van den Oord. 2019 . Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272 (2019). Olivier J H\u00e9naff, Aravind Srinivas, Jeffrey De Fauw, Ali Razavi, Carl Doersch, SM Eslami, and Aaron van den Oord. 2019. Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272 (2019)."},{"key":"e_1_3_2_1_27_1","volume-title":"Adversarial Self-Supervised Contrastive Learning. In Thirty-fourth Conference on Neural Information Processing Systems, NeurIPS","author":"Kim Minseon","year":"2020","unstructured":"Minseon Kim , Jihoon Tack , and Sung Ju Hwang . 2020 . Adversarial Self-Supervised Contrastive Learning. In Thirty-fourth Conference on Neural Information Processing Systems, NeurIPS 2020. NeurIPS. Minseon Kim, Jihoon Tack, and Sung Ju Hwang. 2020. Adversarial Self-Supervised Contrastive Learning. In Thirty-fourth Conference on Neural Information Processing Systems, NeurIPS 2020. NeurIPS."},{"key":"e_1_3_2_1_28_1","unstructured":"Bruno Korbar Du Tran and Lorenzo Torresani. 2018. Cooperative learning of audio and video models from self-supervised synchronization. In Advances in Neural Information Processing Systems. 7763--7774.  Bruno Korbar Du Tran and Lorenzo Torresani. 2018. Cooperative learning of audio and video models from self-supervised synchronization. In Advances in Neural Information Processing Systems. 7763--7774."},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-019-01247-4"},{"key":"e_1_3_2_1_30_1","volume-title":"Nature","volume":"264","author":"McGurk Harry","year":"1976","unstructured":"Harry McGurk and John MacDonald . 1976 . Hearing lips and seeing voices . Nature , Vol. 264 , 5588 (1976), 746--748. Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature, Vol. 264, 5588 (1976), 746--748."},{"key":"e_1_3_2_1_31_1","volume-title":"Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748","author":"van den Oord Aaron","year":"2018","unstructured":"Aaron van den Oord , Yazhe Li , and Oriol Vinyals . 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 ( 2018 ). Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)."},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01231-1_39"},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01105"},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2018.8461326"},{"key":"e_1_3_2_1_35_1","unstructured":"Alec Radford Karthik Narasimhan Tim Salimans and Ilya Sutskever. 2018. Improving language understanding by generative pre-training.  Alec Radford Karthik Narasimhan Tim Salimans and Ilya Sutskever. 2018. Improving language understanding by generative pre-training."},{"key":"e_1_3_2_1_36_1","volume-title":"Language models are unsupervised multitask learners. OpenAI blog","author":"Radford Alec","year":"2019","unstructured":"Alec Radford , Jeffrey Wu , Rewon Child , David Luan , Dario Amodei , and Ilya Sutskever . 2019. Language models are unsupervised multitask learners. OpenAI blog , Vol. 1 , 8 ( 2019 ), 9. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog, Vol. 1, 8 (2019), 9."},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00458"},{"key":"e_1_3_2_1_38_1","volume-title":"Combining residual networks with LSTMs for lipreading. arXiv preprint arXiv:1703.04105","author":"Stafylakis Themos","year":"2017","unstructured":"Themos Stafylakis and Georgios Tzimiropoulos . 2017. Combining residual networks with LSTMs for lipreading. arXiv preprint arXiv:1703.04105 ( 2017 ). Themos Stafylakis and Georgios Tzimiropoulos. 2017. Combining residual networks with LSTMs for lipreading. arXiv preprint arXiv:1703.04105 (2017)."},{"key":"e_1_3_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.5555\/2969033.2969173"},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.5555\/3295222.3295349"},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00393"},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00080"},{"key":"e_1_3_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33019299"}],"event":{"name":"MM '21: ACM Multimedia Conference","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Virtual Event China","acronym":"MM '21"},"container-title":["Proceedings of the 29th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3474085.3475415","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3474085.3475415","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:48:32Z","timestamp":1750193312000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3474085.3475415"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,10,17]]},"references-count":43,"alternative-id":["10.1145\/3474085.3475415","10.1145\/3474085"],"URL":"https:\/\/doi.org\/10.1145\/3474085.3475415","relation":{},"subject":[],"published":{"date-parts":[[2021,10,17]]},"assertion":[{"value":"2021-10-17","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}