{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:21:21Z","timestamp":1750220481129,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":29,"publisher":"ACM","license":[{"start":{"date-parts":[[2021,10,18]],"date-time":"2021-10-18T00:00:00Z","timestamp":1634515200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"the Major Scientific Research Project of Zhejiang Lab","award":["No. 2019KD0AC01"],"award-info":[{"award-number":["No. 2019KD0AC01"]}]},{"name":"Ningbo Natural Science Foundation","award":["202003N4318"],"award-info":[{"award-number":["202003N4318"]}]},{"name":"the Fundamental Research Funds for the Central Universities","award":["2021FZZX001-23"],"award-info":[{"award-number":["2021FZZX001-23"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["U20B2066, 61976186"],"award-info":[{"award-number":["U20B2066, 61976186"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,10,18]]},"DOI":"10.1145\/3462244.3479952","type":"proceedings-article","created":{"date-parts":[[2021,10,15]],"date-time":"2021-10-15T14:41:47Z","timestamp":1634308907000},"page":"687-691","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Speech Guided Disentangled Visual Representation Learning for Lip Reading"],"prefix":"10.1145","author":[{"given":"Ya","family":"Zhao","sequence":"first","affiliation":[{"name":"Zhejiang University, China"}]},{"given":"Cheng","family":"Ma","sequence":"additional","affiliation":[{"name":"Zhejiang University, China"}]},{"given":"Zunlei","family":"Feng","sequence":"additional","affiliation":[{"name":"Zhejiang University, China"}]},{"given":"Mingli","family":"Song","sequence":"additional","affiliation":[{"name":"Zhejiang University, China"}]}],"member":"320","published-online":{"date-parts":[[2021,10,18]]},"reference":[{"key":"e_1_3_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/FG.2015.7163155"},{"key":"e_1_3_2_1_2_1","unstructured":"Yoshua Bengio Nicholas L\u00e9onard and Aaron Courville. 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432(2013).  Yoshua Bengio Nicholas L\u00e9onard and Aaron Courville. 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432(2013)."},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01264-9_9"},{"key":"e_1_3_2_1_4_1","first-page":"1086","article-title":"VoxCeleb2","volume":"2018","author":"Chung Joon\u00a0Son","year":"2018","unstructured":"Joon\u00a0Son Chung , Arsha Nagrani , and Andrew Zisserman . 2018 . VoxCeleb2 : Deep Speaker Recognition. Proc. Interspeech 2018 (2018), 1086 \u2013 1090 . Joon\u00a0Son Chung, Arsha Nagrani, and Andrew Zisserman. 2018. VoxCeleb2: Deep Speaker Recognition. Proc. Interspeech 2018(2018), 1086\u20131090.","journal-title":"Deep Speaker Recognition. Proc. Interspeech"},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.367"},{"key":"e_1_3_2_1_6_1","volume-title":"Asian Conference on Computer Vision. Springer, 87\u2013103","author":"Chung Joon\u00a0Son","year":"2016","unstructured":"Joon\u00a0Son Chung and Andrew Zisserman . 2016 . Lip reading in the wild . In Asian Conference on Computer Vision. Springer, 87\u2013103 . Joon\u00a0Son Chung and Andrew Zisserman. 2016. Lip reading in the wild. In Asian Conference on Computer Vision. Springer, 87\u2013103."},{"key":"e_1_3_2_1_7_1","volume-title":"Asian conference on computer vision. Springer, 251\u2013263","author":"Chung Joon\u00a0Son","year":"2016","unstructured":"Joon\u00a0Son Chung and Andrew Zisserman . 2016 . Out of time: automated lip sync in the wild . In Asian conference on computer vision. Springer, 251\u2013263 . Joon\u00a0Son Chung and Andrew Zisserman. 2016. Out of time: automated lip sync in the wild. In Asian conference on computer vision. Springer, 251\u2013263."},{"key":"e_1_3_2_1_8_1","volume-title":"Active shape models-their training and application. Computer vision and image understanding 61, 1","author":"Cootes F","year":"1995","unstructured":"Timothy\u00a0 F Cootes , Christopher\u00a0 J Taylor , David\u00a0 H Cooper , and Jim Graham . 1995. Active shape models-their training and application. Computer vision and image understanding 61, 1 ( 1995 ), 38\u201359. Timothy\u00a0F Cootes, Christopher\u00a0J Taylor, David\u00a0H Cooper, and Jim Graham. 1995. Active shape models-their training and application. Computer vision and image understanding 61, 1 (1995), 38\u201359."},{"key":"e_1_3_2_1_9_1","volume-title":"Feature extraction using discrete cosine transform and discrimination power analysis with a face recognition technology. Pattern recognition 43, 4","author":"Dabbaghchian Saeed","year":"2010","unstructured":"Saeed Dabbaghchian , Masoumeh\u00a0 P Ghaemmaghami , and Ali Aghagolzadeh . 2010. Feature extraction using discrete cosine transform and discrimination power analysis with a face recognition technology. Pattern recognition 43, 4 ( 2010 ), 1431\u20131440. Saeed Dabbaghchian, Masoumeh\u00a0P Ghaemmaghami, and Ali Aghagolzadeh. 2010. Feature extraction using discrete cosine transform and discrimination power analysis with a face recognition technology. Pattern recognition 43, 4 (2010), 1431\u20131440."},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.5555\/3326943.3327009"},{"key":"e_1_3_2_1_11_1","first-page":"5894","article-title":"Dual swap disentangling","volume":"31","author":"Feng Zunlei","year":"2018","unstructured":"Zunlei Feng , Xinchao Wang , Chenglong Ke , An-Xiang Zeng , Dacheng Tao , and Mingli Song . 2018 . Dual swap disentangling . Advances in Neural Information Processing Systems 31 (2018), 5894 \u2013 5904 . Zunlei Feng, Xinchao Wang, Chenglong Ke, An-Xiang Zeng, Dacheng Tao, and Mingli Song. 2018. Dual swap disentangling. Advances in Neural Information Processing Systems 31 (2018), 5894\u20135904.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_12_1","first-page":"3444","article-title":"Interpretable Partitioned Embedding for Intelligent Multi-item Fashion Outfit Composition","volume":"15","author":"Feng Zunlei","year":"2019","unstructured":"Zunlei Feng , Zhenyun Yu , Yongcheng Jing , Sai Wu , Mingli Song , Yezhou Yang , and Junxiao Jiang . 2019 . Interpretable Partitioned Embedding for Intelligent Multi-item Fashion Outfit Composition . ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 15 , 2s (2019), 3444 \u2013 3453 . Zunlei Feng, Zhenyun Yu, Yongcheng Jing, Sai Wu, Mingli Song, Yezhou Yang, and Junxiao Jiang. 2019. Interpretable Partitioned Embedding for Intelligent Multi-item Fashion Outfit Composition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 15, 2s (2019), 3444\u20133453.","journal-title":"ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)"},{"key":"e_1_3_2_1_13_1","first-page":"1287","article-title":"Image-to-image translation for cross-domain disentanglement","volume":"31","author":"Gonzalez-Garcia Abel","year":"2018","unstructured":"Abel Gonzalez-Garcia , Joost Van De\u00a0Weijer , and Yoshua Bengio . 2018 . Image-to-image translation for cross-domain disentanglement . Advances in Neural Information Processing Systems 31 (2018), 1287 \u2013 1298 . Abel Gonzalez-Garcia, Joost Van De\u00a0Weijer, and Yoshua Bengio. 2018. Image-to-image translation for cross-domain disentanglement. Advances in Neural Information Processing Systems 31 (2018), 1287\u20131298.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_14_1","volume-title":"International Conference on Learning Representations","author":"Higgins Irina","year":"2017","unstructured":"Irina Higgins , Loic Matthey , Arka Pal , Christopher Burgess , Xavier Glorot , Matthew Botvinick , Shakir Mohamed , and Alexander Lerchner . 2017 . beta-vae: Learning basic visual concepts with a constrained variational framework . International Conference on Learning Representations (2017). Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. 2017. beta-vae: Learning basic visual concepts with a constrained variational framework. International Conference on Learning Representations (2017)."},{"key":"e_1_3_2_1_15_1","volume-title":"Proceedings of the 32nd International Conference on Neural Information Processing Systems. 4485\u20134495","author":"Jia Ye","year":"2018","unstructured":"Ye Jia , Yu Zhang , Ron\u00a0 J Weiss , Quan Wang , Jonathan Shen , Fei Ren , Zhifeng Chen , Patrick Nguyen , Ruoming Pang , Ignacio\u00a0Lopez Moreno , 2018 . Transfer learning from speaker verification to multispeaker text-to-speech synthesis . In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 4485\u20134495 . Ye Jia, Yu Zhang, Ron\u00a0J Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio\u00a0Lopez Moreno, 2018. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 4485\u20134495."},{"key":"e_1_3_2_1_16_1","volume-title":"Snakes: Active contour models. International journal of computer vision 1, 4","author":"Kass Michael","year":"1988","unstructured":"Michael Kass , Andrew Witkin , and Demetri Terzopoulos . 1988 . Snakes: Active contour models. International journal of computer vision 1, 4 (1988), 321\u2013331. Michael Kass, Andrew Witkin, and Demetri Terzopoulos. 1988. Snakes: Active contour models. International journal of computer vision 1, 4 (1988), 321\u2013331."},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP40776.2020.9054057"},{"key":"e_1_3_2_1_18_1","first-page":"2616","article-title":"VoxCeleb","volume":"2017","author":"Nagrani Arsha","year":"2017","unstructured":"Arsha Nagrani , Joon\u00a0Son Chung , and Andrew Zisserman . 2017 . VoxCeleb : A Large-Scale Speaker Identification Dataset. Proc. Interspeech 2017 (2017), 2616 \u2013 2620 . Arsha Nagrani, Joon\u00a0Son Chung, and Andrew Zisserman. 2017. VoxCeleb: A Large-Scale Speaker Identification Dataset. Proc. Interspeech 2017(2017), 2616\u20132620.","journal-title":"A Large-Scale Speaker Identification Dataset. Proc. Interspeech"},{"key":"e_1_3_2_1_19_1","volume-title":"Learning factorial codes by predictability minimization. Neural computation 4, 6","author":"Schmidhuber J\u00fcrgen","year":"1992","unstructured":"J\u00fcrgen Schmidhuber . 1992. Learning factorial codes by predictability minimization. Neural computation 4, 6 ( 1992 ), 863\u2013879. J\u00fcrgen Schmidhuber. 1992. Learning factorial codes by predictability minimization. Neural computation 4, 6 (1992), 863\u2013879."},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2017-85"},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1121\/1.1912174"},{"key":"e_1_3_2_1_22_1","volume-title":"Proceedings of the 31st International Conference on Neural Information Processing Systems. 6309\u20136318","author":"van\u00a0den Oord Aaron","year":"2017","unstructured":"Aaron van\u00a0den Oord , Oriol Vinyals , and Koray Kavukcuoglu . 2017 . Neural discrete representation learning . In Proceedings of the 31st International Conference on Neural Information Processing Systems. 6309\u20136318 . Aaron van\u00a0den Oord, Oriol Vinyals, and Koray Kavukcuoglu. 2017. Neural discrete representation learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 6309\u20136318."},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.193"},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2019.8683120"},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/FG.2019.8756582"},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00080"},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1145\/3338533.3366579"},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i04.6174"},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.33019299"}],"event":{"name":"ICMI '21: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION","sponsor":["SIGCHI ACM Special Interest Group on Computer-Human Interaction"],"location":"Montr\u00e9al QC Canada","acronym":"ICMI '21"},"container-title":["Proceedings of the 2021 International Conference on Multimodal Interaction"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3462244.3479952","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3462244.3479952","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:48:55Z","timestamp":1750193335000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3462244.3479952"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,10,18]]},"references-count":29,"alternative-id":["10.1145\/3462244.3479952","10.1145\/3462244"],"URL":"https:\/\/doi.org\/10.1145\/3462244.3479952","relation":{},"subject":[],"published":{"date-parts":[[2021,10,18]]},"assertion":[{"value":"2021-10-18","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}