{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,3]],"date-time":"2026-06-03T18:21:17Z","timestamp":1780510877253,"version":"3.54.1"},"publisher-location":"New York, NY, USA","reference-count":29,"publisher":"ACM","license":[{"start":{"date-parts":[[2020,10,12]],"date-time":"2020-10-12T00:00:00Z","timestamp":1602460800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2020,10,12]]},"DOI":"10.1145\/3394171.3413532","type":"proceedings-article","created":{"date-parts":[[2020,10,12]],"date-time":"2020-10-12T13:10:18Z","timestamp":1602508218000},"page":"484-492","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":776,"title":["A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild"],"prefix":"10.1145","author":[{"given":"K R","family":"Prajwal","sequence":"first","affiliation":[{"name":"International Institute of Information Technology, Hyderabad, Hyderabad, India"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Rudrabha","family":"Mukhopadhyay","sequence":"additional","affiliation":[{"name":"International Institute of Information Technology, Hyderabad, Hyderabad, India"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Vinay P.","family":"Namboodiri","sequence":"additional","affiliation":[{"name":"University of Bath, Bath, United Kingdom"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"C.V.","family":"Jawahar","sequence":"additional","affiliation":[{"name":"International Institute of Information Technology, Hyderabad, Hyderabad, India"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2020,10,12]]},"reference":[{"key":"e_1_3_2_2_1_1","unstructured":"T. Afouras J. S. Chung A. Senior O. Vinyals and A. Zisserman. 2018c. Deep Audio-Visual Speech Recognition. In arXiv:1809.02108.  T. Afouras J. S. Chung A. Senior O. Vinyals and A. Zisserman. 2018c. Deep Audio-Visual Speech Recognition. In arXiv:1809.02108."},{"key":"e_1_3_2_2_2_1","volume-title":"The Conversation: Deep Audio-Visual Speech Enhancement. In INTERSPEECH.","author":"Afouras T.","year":"2018","unstructured":"T. Afouras , J. S. Chung , and A. Zisserman . 2018 a. The Conversation: Deep Audio-Visual Speech Enhancement. In INTERSPEECH. T. Afouras, J. S. Chung, and A. Zisserman. 2018a. The Conversation: Deep Audio-Visual Speech Enhancement. In INTERSPEECH."},{"key":"e_1_3_2_2_3_1","volume-title":"Joon Son Chung, and Andrew Zisserman","author":"Afouras Triantafyllos","year":"2018","unstructured":"Triantafyllos Afouras , Joon Son Chung, and Andrew Zisserman . 2018 b. LRS 3-TED: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496 (2018). Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. 2018b. LRS3-TED: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496 (2018)."},{"key":"e_1_3_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01234-2_32"},{"key":"e_1_3_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00802"},{"key":"e_1_3_2_2_6_1","volume-title":"IEEE Computer Society Conference on Computer Vision and Pattern Recognition workshops.","author":"Chen Lele","year":"2019","unstructured":"Lele Chen , Haitian Zheng , Ross K Maddox , Zhiyao Duan , and Chenliang Xu . 2019 b. Sound to Visual: Hierarchical Cross-Modal Talking Face Video Generation . In IEEE Computer Society Conference on Computer Vision and Pattern Recognition workshops. Lele Chen, Haitian Zheng, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. 2019 b. Sound to Visual: Hierarchical Cross-Modal Talking Face Video Generation. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition workshops."},{"key":"e_1_3_2_2_7_1","volume-title":"You said that? arXiv preprint arXiv:1705.02966","author":"Chung Joon Son","year":"2017","unstructured":"Joon Son Chung , Amir Jamaludin , and Andrew Zisserman . 2017. You said that? arXiv preprint arXiv:1705.02966 ( 2017 ). Joon Son Chung, Amir Jamaludin, and Andrew Zisserman. 2017. You said that? arXiv preprint arXiv:1705.02966 (2017)."},{"key":"e_1_3_2_2_8_1","volume-title":"Asian Conference on Computer Vision. Springer, 87--103","author":"Chung Joon Son","year":"2016","unstructured":"Joon Son Chung and Andrew Zisserman . 2016 a. Lip reading in the wild . In Asian Conference on Computer Vision. Springer, 87--103 . Joon Son Chung and Andrew Zisserman. 2016a. Lip reading in the wild. In Asian Conference on Computer Vision. Springer, 87--103."},{"key":"e_1_3_2_2_9_1","volume-title":"Workshop on Multi-view Lip-reading, ACCV.","author":"Chung Joon Son","year":"2016","unstructured":"Joon Son Chung and Andrew Zisserman . 2016 b. Out of time: automated lip sync in the wild . In Workshop on Multi-view Lip-reading, ACCV. Joon Son Chung and Andrew Zisserman. 2016b. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV."},{"key":"e_1_3_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1121\/1.2229005"},{"key":"e_1_3_2_2_11_1","volume-title":"The DeepFake Detection Challenge Dataset. arxiv","author":"Dolhansky Brian","year":"2006","unstructured":"Brian Dolhansky , Joanna Bitton , Ben Pflaum , Jikuo Lu , Russ Howes , Menglin Wang , and Cristian Canton Ferrer . 2020. The DeepFake Detection Challenge Dataset. arxiv : 2006 .07397 [cs.CV] Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton Ferrer. 2020. The DeepFake Detection Challenge Dataset. arxiv: 2006.07397 [cs.CV]"},{"key":"e_1_3_2_2_12_1","article-title":"Adaptive subgradient methods for online learning and stochastic optimization","volume":"12","author":"Duchi John","year":"2011","unstructured":"John Duchi , Elad Hazan , and Yoram Singer . 2011 . Adaptive subgradient methods for online learning and stochastic optimization . Journal of machine learning research , Vol. 12 , 7 (2011). John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, Vol. 12, 7 (2011).","journal-title":"Journal of machine learning research"},{"key":"e_1_3_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/3306346.3323028"},{"key":"e_1_3_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2015.2407694"},{"key":"e_1_3_2_2_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.3390\/app10010370"},{"key":"e_1_3_2_2_17_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-019-01150-y"},{"key":"e_1_3_2_2_18_1","volume-title":"Proceedings of the 27th ACM International Conference on Multimedia. ACM, 1428--1436","author":"Rudrabha Mukhopadhyay Prajwal KR","year":"2019","unstructured":"Prajwal KR , Rudrabha Mukhopadhyay , Jerin Philip , Abhishek Jha , Vinay Namboodiri , and CV Jawahar . 2019 . Towards Automatic Face-to-Face Translation . In Proceedings of the 27th ACM International Conference on Multimedia. ACM, 1428--1436 . Prajwal KR, Rudrabha Mukhopadhyay, Jerin Philip, Abhishek Jha, Vinay Namboodiri, and CV Jawahar. 2019. Towards Automatic Face-to-Face Translation. In Proceedings of the 27th ACM International Conference on Multimedia. ACM, 1428--1436."},{"key":"e_1_3_2_2_19_1","volume-title":"Obamanet: Photo-realistic lip-sync from text. arXiv preprint arXiv:1801.01442","author":"Kumar Rithesh","year":"2017","unstructured":"Rithesh Kumar , Jose Sotelo , Kundan Kumar , Alexandre de Br\u00e9bisson , and Yoshua Bengio . 2017 . Obamanet: Photo-realistic lip-sync from text. arXiv preprint arXiv:1801.01442 (2017). Rithesh Kumar, Jose Sotelo, Kundan Kumar, Alexandre de Br\u00e9bisson, and Yoshua Bengio. 2017. Obamanet: Photo-realistic lip-sync from text. arXiv preprint arXiv:1801.01442 (2017)."},{"key":"e_1_3_2_2_20_1","volume-title":"Proc. icml","volume":"30","author":"Maas Andrew L","year":"2013","unstructured":"Andrew L Maas , Awni Y Hannun , and Andrew Y Ng . 2013 . Rectifier nonlinearities improve neural network acoustic models . In Proc. icml , Vol. 30 . 3. Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. 2013. Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, Vol. 30. 3."},{"key":"e_1_3_2_2_21_1","unstructured":"NPD. 2016. 52 Percent of Millennial Smartphone Owners Use their Device for Video Calling According to The NPD Group. https:\/\/www.npd.com\/wps\/portal\/npd\/us\/news\/press-releases\/2016\/52-percent-of-millennial-smartphone-owners-use-their-device-for-video-calling-according-to-the-npd-group\/  NPD. 2016. 52 Percent of Millennial Smartphone Owners Use their Device for Video Calling According to The NPD Group. https:\/\/www.npd.com\/wps\/portal\/npd\/us\/news\/press-releases\/2016\/52-percent-of-millennial-smartphone-owners-use-their-device-for-video-calling-according-to-the-npd-group\/"},{"key":"e_1_3_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3072959.3073640"},{"key":"e_1_3_2_2_23_1","volume-title":"Neural Voice Puppetry: Audio-driven Facial Reenactment. arXiv preprint arXiv:1912.05566","author":"Thies Justus","year":"2019","unstructured":"Justus Thies , Mohamed Elgharib , Ayush Tewari , Christian Theobalt , and Matthias Nie\u00dfner . 2019. Neural Voice Puppetry: Audio-driven Facial Reenactment. arXiv preprint arXiv:1912.05566 ( 2019 ). Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nie\u00dfner. 2019. Neural Voice Puppetry: Audio-driven Facial Reenactment. arXiv preprint arXiv:1912.05566 (2019)."},{"key":"e_1_3_2_2_24_1","volume-title":"DeepFakes and Beyond: A Survey of Face Manipulation and Fake Detection. arxiv","author":"Tolosana Ruben","year":"2001","unstructured":"Ruben Tolosana , Ruben Vera-Rodriguez , Julian Fierrez , Aythami Morales , and Javier Ortega-Garcia . 2020. DeepFakes and Beyond: A Survey of Face Manipulation and Fake Detection. arxiv : 2001 .00179 [cs.CV] Ruben Tolosana, Ruben Vera-Rodriguez, Julian Fierrez, Aythami Morales, and Javier Ortega-Garcia. 2020. DeepFakes and Beyond: A Survey of Face Manipulation and Fake Detection. arxiv: 2001.00179 [cs.CV]"},{"key":"e_1_3_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW50498.2020.00335"},{"key":"e_1_3_2_2_26_1","volume-title":"Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022","author":"Ulyanov Dmitry","year":"2016","unstructured":"Dmitry Ulyanov , Andrea Vedaldi , and Victor Lempitsky . 2016. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022 ( 2016 ). Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. 2016. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022 (2016)."},{"key":"e_1_3_2_2_27_1","volume-title":"International Journal of Computer Vision","author":"Vougioukas Konstantinos","year":"2019","unstructured":"Konstantinos Vougioukas , Stavros Petridis , and Maja Pantic . 2019. Realistic speech-driven facial animation with gans . International Journal of Computer Vision ( 2019 ), 1--16. Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic. 2019. Realistic speech-driven facial animation with gans. International Journal of Computer Vision (2019), 1--16."},{"key":"e_1_3_2_2_28_1","volume-title":"et almbox","author":"Wang Zhou","year":"2004","unstructured":"Zhou Wang , Alan C Bovik , Hamid R Sheikh , Eero P Simoncelli , et almbox . 2004 . Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, Vol. 13 , 4 (2004), 600--612. Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simoncelli, et almbox. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, Vol. 13, 4 (2004), 600--612."},{"key":"e_1_3_2_2_29_1","volume-title":"Talking Face Generation by Adversarially Disentangled Audio-Visual Representation. arXiv preprint arXiv:1807.07860","author":"Zhou Hang","year":"2018","unstructured":"Hang Zhou , Yu Liu , Ziwei Liu , Ping Luo , and Xiaogang Wang . 2018. Talking Face Generation by Adversarially Disentangled Audio-Visual Representation. arXiv preprint arXiv:1807.07860 ( 2018 ). Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. 2018. Talking Face Generation by Adversarially Disentangled Audio-Visual Representation. arXiv preprint arXiv:1807.07860 (2018)."}],"event":{"name":"MM '20: The 28th ACM International Conference on Multimedia","location":"Seattle WA USA","acronym":"MM '20","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 28th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3394171.3413532","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3394171.3413532","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:47:13Z","timestamp":1750193233000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3394171.3413532"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,10,12]]},"references-count":29,"alternative-id":["10.1145\/3394171.3413532","10.1145\/3394171"],"URL":"https:\/\/doi.org\/10.1145\/3394171.3413532","relation":{},"subject":[],"published":{"date-parts":[[2020,10,12]]},"assertion":[{"value":"2020-10-12","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}