{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,13]],"date-time":"2026-05-13T08:32:51Z","timestamp":1778661171929,"version":"3.51.4"},"publisher-location":"New York, NY, USA","reference-count":35,"publisher":"ACM","license":[{"start":{"date-parts":[[2019,10,15]],"date-time":"2019-10-15T00:00:00Z","timestamp":1571097600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2019,10,15]]},"DOI":"10.1145\/3343031.3351066","type":"proceedings-article","created":{"date-parts":[[2019,10,21]],"date-time":"2019-10-21T16:32:26Z","timestamp":1571675546000},"page":"1428-1436","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":88,"title":["Towards Automatic Face-to-Face Translation"],"prefix":"10.1145","author":[{"given":"Prajwal","family":"K R","sequence":"first","affiliation":[{"name":"IIIT Hyderabad, Hyderabad, India"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Rudrabha","family":"Mukhopadhyay","sequence":"additional","affiliation":[{"name":"IIIT Hyderabad, Hyderabad, India"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jerin","family":"Philip","sequence":"additional","affiliation":[{"name":"IIIT Hyderabad, Hyderabad, India"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Abhishek","family":"Jha","sequence":"additional","affiliation":[{"name":"IIIT Hyderabad, Hyderabad, India"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Vinay","family":"Namboodiri","sequence":"additional","affiliation":[{"name":"IIT Kanpur, Kanpur, India"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"C V","family":"Jawahar","sequence":"additional","affiliation":[{"name":"IIIT Hyderabad, Hyderabad, India"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2019,10,15]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"Andrew Senior, Oriol Vinyals, and Andrew Zisserman.","author":"Afouras Triantafyllos","year":"2018","unstructured":"Triantafyllos Afouras , Joon Son Chung , Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2018 . Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machine intelligence (2018). Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2018. Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machine intelligence (2018)."},{"key":"e_1_3_2_1_2_1","volume-title":"Massively Multilingual Neural Machine Translation. arXiv preprint arXiv:1903.00089","author":"Aharoni Roee","year":"2019","unstructured":"Roee Aharoni , Melvin Johnson , and Orhan Firat . 2019. Massively Multilingual Neural Machine Translation. arXiv preprint arXiv:1903.00089 ( 2019 ). Roee Aharoni, Melvin Johnson, and Orhan Firat. 2019. Massively Multilingual Neural Machine Translation. arXiv preprint arXiv:1903.00089 (2019)."},{"key":"e_1_3_2_1_3_1","volume-title":"International conference on machine learning. 173--182","author":"Amodei Dario","year":"2016","unstructured":"Dario Amodei , Sundaram Ananthanarayanan , Rishita Anubhai , Jingliang Bai , Eric Battenberg , Carl Case , Jared Casper , Bryan Catanzaro , Qiang Cheng , Guoliang Chen , 2016 . Deep speech 2: End-to-end speech recognition in english and mandarin . In International conference on machine learning. 173--182 . Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et almbox. 2016. Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning. 173--182."},{"key":"e_1_3_2_1_4_1","unstructured":"Sercan Arik Jitong Chen Kainan Peng Wei Ping and Yanqi Zhou. 2018. Neural voice cloning with a few samples. In Advances in Neural Information Processing Systems. 10040--10050.  Sercan Arik Jitong Chen Kainan Peng Wei Ping and Yanqi Zhou. 2018. Neural voice cloning with a few samples. In Advances in Neural Information Processing Systems. 10040--10050."},{"key":"e_1_3_2_1_5_1","volume-title":"Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473","author":"Bahdanau Dzmitry","year":"2014","unstructured":"Dzmitry Bahdanau , Kyunghyun Cho , and Yoshua Bengio . 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 ( 2014 ). Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)."},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/258734.258880"},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01234-2_32"},{"key":"e_1_3_2_1_8_1","volume-title":"You said that? arXiv preprint arXiv:1705.02966","author":"Chung Joon Son","year":"2017","unstructured":"Joon Son Chung , Amir Jamaludin , and Andrew Zisserman . 2017. You said that? arXiv preprint arXiv:1705.02966 ( 2017 ). Joon Son Chung, Amir Jamaludin, and Andrew Zisserman. 2017. You said that? arXiv preprint arXiv:1705.02966 (2017)."},{"key":"e_1_3_2_1_9_1","volume-title":"Asian Conference on Computer Vision. Springer, 87--103","author":"Chung Joon Son","year":"2016","unstructured":"Joon Son Chung and Andrew Zisserman . 2016 . Lip reading in the wild . In Asian Conference on Computer Vision. Springer, 87--103 . Joon Son Chung and Andrew Zisserman. 2016. Lip reading in the wild. In Asian Conference on Computer Vision. Springer, 87--103."},{"key":"e_1_3_2_1_10_1","volume-title":"International Workshop on Spoken Language Translation .","author":"Federmann Christian","year":"2016","unstructured":"Christian Federmann and William D Lewis . 2016 . Microsoft speech language translation (mslt) corpus: The iwslt 2016 release for english, french and german . In International Workshop on Spoken Language Translation . Christian Federmann and William D Lewis. 2016. Microsoft speech language translation (mslt) corpus: The iwslt 2016 release for english, french and german. In International Workshop on Spoken Language Translation ."},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASSP.1984.1164317"},{"key":"e_1_3_2_1_12_1","unstructured":"Abhishek Jha Vinay Namboodiri and C V Jawahar. 2019. Cross-Language Speech Dependent Lip-Synchronization. (2019). To appear in 2019 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP).  Abhishek Jha Vinay Namboodiri and C V Jawahar. 2019. Cross-Language Speech Dependent Lip-Synchronization. (2019). To appear in 2019 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP)."},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00065"},{"key":"e_1_3_2_1_14_1","volume-title":"Parallel-data-free voice conversion using cycle-consistent adversarial networks. arXiv preprint arXiv:1711.11293","author":"Kaneko Takuhiro","year":"2017","unstructured":"Takuhiro Kaneko and Hirokazu Kameoka . 2017. Parallel-data-free voice conversion using cycle-consistent adversarial networks. arXiv preprint arXiv:1711.11293 ( 2017 ). Takuhiro Kaneko and Hirokazu Kameoka. 2017. Parallel-data-free voice conversion using cycle-consistent adversarial networks. arXiv preprint arXiv:1711.11293 (2017)."},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.5555\/1577069.1755843"},{"key":"e_1_3_2_1_16_1","volume-title":"Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980","author":"Kingma Diederik P","year":"2014","unstructured":"Diederik P Kingma and Jimmy Ba . 2014 . Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)."},{"key":"e_1_3_2_1_17_1","volume-title":"Obamanet: Photo-realistic lip-sync from text. arXiv preprint arXiv:1801.01442","author":"Kumar Rithesh","year":"2017","unstructured":"Rithesh Kumar , Jose Sotelo , Kundan Kumar , Alexandre de Br\u00e9bisson , and Yoshua Bengio . 2017 . Obamanet: Photo-realistic lip-sync from text. arXiv preprint arXiv:1801.01442 (2017). Rithesh Kumar, Jose Sotelo, Kundan Kumar, Alexandre de Br\u00e9bisson, and Yoshua Bengio. 2017. Obamanet: Photo-realistic lip-sync from text. arXiv preprint arXiv:1801.01442 (2017)."},{"key":"e_1_3_2_1_18_1","volume-title":"Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018)","author":"Kunchukuttan Anoop","year":"2018","unstructured":"Anoop Kunchukuttan , Pratik Mehta , and Pushpak Bhattacharyya . 2018 . The IIT Bombay English-Hindi Parallel Corpus . In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018) . Anoop Kunchukuttan, Pratik Mehta, and Pushpak Bhattacharyya. 2018. The IIT Bombay English-Hindi Parallel Corpus. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018) ."},{"key":"e_1_3_2_1_19_1","volume-title":"Proceedings of Translating and the Computer (TC37)","author":"Lewis Will","year":"2015","unstructured":"Will Lewis . 2015 . Skype Translator: Breaking Down Language and Hearing Barriers . In Proceedings of Translating and the Computer (TC37) . https:\/\/www.microsoft.com\/en-us\/research\/publication\/skype-translator-breaking-down-language-and-hearing-barriers\/ Will Lewis. 2015. Skype Translator: Breaking Down Language and Hearing Barriers. In Proceedings of Translating and the Computer (TC37) . https:\/\/www.microsoft.com\/en-us\/research\/publication\/skype-translator-breaking-down-language-and-hearing-barriers\/"},{"key":"e_1_3_2_1_20_1","volume-title":"Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025","author":"Luong Minh-Thang","year":"2015","unstructured":"Minh-Thang Luong , Hieu Pham , and Christopher D Manning . 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 ( 2015 ). Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015)."},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1103"},{"key":"e_1_3_2_1_22_1","unstructured":"NPD. 2016. 52 Percent of Millennial Smartphone Owners Use their Device for Video Calling According to The NPD Group. https:\/\/www.npd.com\/wps\/portal\/npd\/us\/news\/press-releases\/2016\/52-percent-of-millennial-smartphone-owners-use-their-device-for-video-calling-according-to-the-npd-group\/  NPD. 2016. 52 Percent of Millennial Smartphone Owners Use their Device for Video Calling According to The NPD Group. https:\/\/www.npd.com\/wps\/portal\/npd\/us\/news\/press-releases\/2016\/52-percent-of-millennial-smartphone-owners-use-their-device-for-video-calling-according-to-the-npd-group\/"},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2015.7178964"},{"key":"e_1_3_2_1_24_1","volume-title":"A Baseline Neural Machine Translation System for Indian Languages. arXiv preprint arXiv:1907.12437","author":"Philip Jerin","year":"2019","unstructured":"Jerin Philip , Vinay P Namboodiri , and CV Jawahar . 2019. A Baseline Neural Machine Translation System for Indian Languages. arXiv preprint arXiv:1907.12437 ( 2019 ). Jerin Philip, Vinay P Namboodiri, and CV Jawahar. 2019. A Baseline Neural Machine Translation System for Indian Languages. arXiv preprint arXiv:1907.12437 (2019)."},{"key":"e_1_3_2_1_25_1","volume-title":"Deep voice 3: Scaling text-to-speech with convolutional sequence learning. arXiv preprint arXiv:1710.07654","author":"Ping Wei","year":"2017","unstructured":"Wei Ping , Kainan Peng , Andrew Gibiansky , Sercan O Arik , Ajay Kannan , Sharan Narang , Jonathan Raiman , and John Miller . 2017. Deep voice 3: Scaling text-to-speech with convolutional sequence learning. arXiv preprint arXiv:1710.07654 ( 2017 ). Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Miller. 2017. Deep voice 3: Scaling text-to-speech with convolutional sequence learning. arXiv preprint arXiv:1710.07654 (2017)."},{"key":"e_1_3_2_1_26_1","unstructured":"Anthony Rousseau Paul Del\u00e9glise and Yannick Esteve. 2012. TED-LIUM: an Automatic Speech Recognition dedicated corpus.. In LREC. 125--129.  Anthony Rousseau Paul Del\u00e9glise and Yannick Esteve. 2012. TED-LIUM: an Automatic Speech Recognition dedicated corpus.. In LREC. 125--129."},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2018.8461368"},{"key":"e_1_3_2_1_28_1","unstructured":"Ilya Sutskever Oriol Vinyals and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. 3104--3112.  Ilya Sutskever Oriol Vinyals and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. 3104--3112."},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/3072959.3073640"},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2018.8461829"},{"key":"e_1_3_2_1_31_1","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008.  Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008."},{"key":"e_1_3_2_1_32_1","volume-title":"et almbox","author":"Wang Zhou","year":"2004","unstructured":"Zhou Wang , Alan C Bovik , Hamid R Sheikh , Eero P Simoncelli , et almbox . 2004 . Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing , Vol. 13 , 4 (2004), 600--612. Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simoncelli, et almbox. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing , Vol. 13, 4 (2004), 600--612."},{"key":"e_1_3_2_1_33_1","volume-title":"et almbox","author":"Wu Yonghui","year":"2016","unstructured":"Yonghui Wu , Mike Schuster , Zhifeng Chen , Quoc V Le , Mohammad Norouzi , Wolfgang Macherey , Maxim Krikun , Yuan Cao , Qin Gao , Klaus Macherey , et almbox . 2016 . Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016). Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et almbox. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)."},{"key":"e_1_3_2_1_34_1","volume-title":"Statistical parametric speech synthesis. speech communication","author":"Zen Heiga","year":"2009","unstructured":"Heiga Zen , Keiichi Tokuda , and Alan W Black . 2009. Statistical parametric speech synthesis. speech communication , Vol. 51 , 11 ( 2009 ), 1039--1064. Heiga Zen, Keiichi Tokuda, and Alan W Black. 2009. Statistical parametric speech synthesis. speech communication , Vol. 51, 11 (2009), 1039--1064."},{"key":"e_1_3_2_1_35_1","volume-title":"Talking Face Generation by Adversarially Disentangled Audio-Visual Representation. arXiv preprint arXiv:1807.07860","author":"Zhou Hang","year":"2018","unstructured":"Hang Zhou , Yu Liu , Ziwei Liu , Ping Luo , and Xiaogang Wang . 2018. Talking Face Generation by Adversarially Disentangled Audio-Visual Representation. arXiv preprint arXiv:1807.07860 ( 2018 ). Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. 2018. Talking Face Generation by Adversarially Disentangled Audio-Visual Representation. arXiv preprint arXiv:1807.07860 (2018)."}],"event":{"name":"MM '19: The 27th ACM International Conference on Multimedia","location":"Nice France","acronym":"MM '19","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 27th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3343031.3351066","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3343031.3351066","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T23:13:12Z","timestamp":1750201992000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3343031.3351066"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,10,15]]},"references-count":35,"alternative-id":["10.1145\/3343031.3351066","10.1145\/3343031"],"URL":"https:\/\/doi.org\/10.1145\/3343031.3351066","relation":{},"subject":[],"published":{"date-parts":[[2019,10,15]]},"assertion":[{"value":"2019-10-15","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}