{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,23]],"date-time":"2026-01-23T14:33:28Z","timestamp":1769178808549,"version":"3.49.0"},"publisher-location":"New York, NY, USA","reference-count":34,"publisher":"ACM","license":[{"start":{"date-parts":[[2021,8,24]],"date-time":"2021-08-24T00:00:00Z","timestamp":1629763200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Sichuan Science and Technology Program","award":["2019YFG0535"],"award-info":[{"award-number":["2019YFG0535"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["61832001"],"award-info":[{"award-number":["61832001"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2021,8,24]]},"DOI":"10.1145\/3460426.3463624","type":"proceedings-article","created":{"date-parts":[[2021,9,1]],"date-time":"2021-09-01T22:50:29Z","timestamp":1630536629000},"page":"394-401","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":7,"title":["Multi-Attention Audio-Visual Fusion Network for Audio Spatialization"],"prefix":"10.1145","author":[{"given":"Wen","family":"Zhang","sequence":"first","affiliation":[{"name":"University of Electronic Science and Technology of China, Chengdu, China"}]},{"given":"Jie","family":"Shao","sequence":"additional","affiliation":[{"name":"Sichuan Artificial Intelligence Research Institute, Yibin, China"}]}],"member":"320","published-online":{"date-parts":[[2021,9]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"19th Annual Conference of the International Speech Communication Association","author":"Afouras Triantafyllos","year":"2018","unstructured":"Triantafyllos Afouras , Joon Son Chung , and Andrew Zisserman . 2018 . The Conversation: Deep Audio-Visual Speech Enhancement. In Interspeech 2018 , 19th Annual Conference of the International Speech Communication Association , Hyderabad, India, 2- -6 September 2018. 3244--3248. Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. 2018. The Conversation: Deep Audio-Visual Speech Enhancement. In Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2--6 September 2018. 3244--3248."},{"key":"e_1_3_2_1_2_1","volume-title":"Listen and Learn. In IEEE International Conference on Computer Vision, ICCV 2017","author":"Arandjelovic Relja","year":"2017","unstructured":"Relja Arandjelovic and Andrew Zisserman . 2017 . Look , Listen and Learn. In IEEE International Conference on Computer Vision, ICCV 2017 , Venice, Italy, October 22--29 , 2017. 609--617. Relja Arandjelovic and Andrew Zisserman. 2017. Look, Listen and Learn. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22--29, 2017. 609--617."},{"key":"e_1_3_2_1_3_1","volume-title":"Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016","author":"Aytar Yusuf","year":"2016","unstructured":"Yusuf Aytar , Carl Vondrick , and Antonio Torralba . 2016 . SoundNet: Learning Sound Representations from Unlabeled Video . In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016 , December 5 --10 , 2016, Barcelona, Spain. 892--900. Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2016. SoundNet: Learning Sound Representations from Unlabeled Video. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5--10, 2016, Barcelona, Spain. 892--900."},{"key":"e_1_3_2_1_4_1","volume-title":"Anh Huy Phan, and Shun-ichi Amari","author":"Cichocki Andrzej","year":"2009","unstructured":"Andrzej Cichocki , Rafal Zdunek , Anh Huy Phan, and Shun-ichi Amari . 2009 . Nonnegative Matrix and Tensor Factorizations - Applications to Exploratory Multi-way Data Analysis and Blind Source Separation. Wiley . Andrzej Cichocki, Rafal Zdunek, Anh Huy Phan, and Shun-ichi Amari. 2009. Nonnegative Matrix and Tensor Factorizations - Applications to Exploratory Multi-way Data Analysis and Blind Source Separation. Wiley."},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASLP.2017.2716178"},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/2601097.2601119"},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2019.8682970"},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3197517.3201357"},{"key":"e_1_3_2_1_9_1","volume-title":"Seeing Through Noise: Visually Driven Speaker Separation And Enhancement. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018","author":"Gabbay Aviv","year":"2018","unstructured":"Aviv Gabbay , Ariel Ephrat , Tavi Halperin , and Shmuel Peleg . 2018 . Seeing Through Noise: Visually Driven Speaker Separation And Enhancement. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018 , Calgary, AB, Canada, April 15--20 , 2018. 3051--3055. Aviv Gabbay, Ariel Ephrat, Tavi Halperin, and Shmuel Peleg. 2018. Seeing Through Noise: Visually Driven Speaker Separation And Enhancement. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, April 15--20, 2018. 3051--3055."},{"key":"e_1_3_2_1_10_1","volume-title":"Music Gesture for Visual Sound Separation. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020","author":"Gan Chuang","year":"2020","unstructured":"Chuang Gan , Deng Huang , Hang Zhao , Joshua B. Tenenbaum , and Antonio Torralba . 2020 . Music Gesture for Visual Sound Separation. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020 , Seattle, WA, USA, June 13--19 , 2020. 10475--10484. Chuang Gan, Deng Huang, Hang Zhao, Joshua B. Tenenbaum, and Antonio Torralba. 2020. Music Gesture for Visual Sound Separation. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13--19, 2020. 10475--10484."},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00041"},{"key":"e_1_3_2_1_12_1","volume-title":"Co-Separating Sounds of Visual Objects. In 2019 IEEE\/CVF International Conference on Computer Vision, ICCV 2019","author":"Gao Ruohan","year":"2019","unstructured":"Ruohan Gao and Kristen Grauman . 2019 b . Co-Separating Sounds of Visual Objects. In 2019 IEEE\/CVF International Conference on Computer Vision, ICCV 2019 , Seoul, Korea (South), October 27 - November 2, 2019. 3878--3887. Ruohan Gao and Kristen Grauman. 2019 b. Co-Separating Sounds of Visual Objects. In 2019 IEEE\/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. 3878--3887."},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1162\/0899766054322964"},{"key":"e_1_3_2_1_14_1","volume-title":"Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016","author":"He Kaiming","year":"2016","unstructured":"Kaiming He , Xiangyu Zhang , Shaoqing Ren , and Jian Sun . 2016 . Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016 , Las Vegas, NV, USA, June 27--30 , 2016. 770--778. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27--30, 2016. 770--778."},{"key":"e_1_3_2_1_15_1","volume-title":"Cross-Modal Retrieval of Videos and Music. CoRR","author":"Hong Sungeun","year":"2017","unstructured":"Sungeun Hong , Woobin Im , and Hyun Seung Yang . 2017. Deep Learning for Content-Based , Cross-Modal Retrieval of Videos and Music. CoRR , Vol. abs\/ 1704 .06761 ( 2017 ). Sungeun Hong, Woobin Im, and Hyun Seung Yang. 2017. Deep Learning for Content-Based, Cross-Modal Retrieval of Videos and Music. CoRR, Vol. abs\/1704.06761 (2017)."},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/ASRU.2017.8268967"},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3197517.3201391"},{"key":"e_1_3_2_1_18_1","volume-title":"Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018","author":"Morgado Pedro","year":"2018","unstructured":"Pedro Morgado , Nuno Vasconcelos , Timothy R. Langlois , and Oliver Wang . 2018 . Self-Supervised Generation of Spatial Audio for 360textdegree Video . In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018 , NeurIPS 2018, December 3--8, 2018, Montr\u00e9 al, Canada. 360--370. Pedro Morgado, Nuno Vasconcelos, Timothy R. Langlois, and Oliver Wang. 2018. Self-Supervised Generation of Spatial Audio for 360textdegree Video. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3--8, 2018, Montr\u00e9 al, Canada. 360--370."},{"key":"e_1_3_2_1_19_1","volume-title":"Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018","author":"Nagrani Arsha","year":"2018","unstructured":"Arsha Nagrani , Samuel Albanie , and Andrew Zisserman . 2018 . Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018 , Salt Lake City, UT, USA, June 18--22 , 2018. 8427--8436. Arsha Nagrani, Samuel Albanie, and Andrew Zisserman. 2018. Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18--22, 2018. 8427--8436."},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00772"},{"key":"e_1_3_2_1_21_1","volume-title":"Proceedings, Part VI. 639--658","author":"Owens Andrew","unstructured":"Andrew Owens and Alexei A. Efros . 2018. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8--14, 2018 , Proceedings, Part VI. 639--658 . Andrew Owens and Alexei A. Efros. 2018. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8--14, 2018, Proceedings, Part VI. 639--658."},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2017.7952687"},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-24574-4_28"},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-015-0816-y"},{"key":"e_1_3_2_1_25_1","volume-title":"19th Annual Conference of the International Speech Communication Association","author":"Sriskandaraja Kaavya","year":"2018","unstructured":"Kaavya Sriskandaraja , Vidhyasaharan Sethu , and Eliathamby Ambikairajah . 2018 . Deep Siamese Architecture Based Replay Detection for Secure Voice Biometric. In Interspeech 2018 , 19th Annual Conference of the International Speech Communication Association , Hyderabad, India, 2- -6 September 2018. 671--675. Kaavya Sriskandaraja, Vidhyasaharan Sethu, and Eliathamby Ambikairajah. 2018. Deep Siamese Architecture Based Replay Detection for Secure Voice Biometric. In Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2--6 September 2018. 671--675."},{"key":"e_1_3_2_1_26_1","volume-title":"Proceedings, Part II. 252--268","author":"Tian Yapeng","year":"2018","unstructured":"Yapeng Tian , Jing Shi , Bochen Li , Zhiyao Duan , and Chenliang Xu . 2018 . Audio-Visual Event Localization in Unconstrained Videos. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8--14, 2018 , Proceedings, Part II. 252--268 . Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu. 2018. Audio-Visual Event Localization in Unconstrained Videos. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8--14, 2018, Proceedings, Part II. 252--268."},{"key":"e_1_3_2_1_27_1","volume-title":"Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani , Noam Shazeer , Niki Parmar , Jakob Uszkoreit , Llion Jones , Aidan N. Gomez , Lukasz Kaiser , and Illia Polosukhin . 2017 . Attention is All you Need . In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017 , December 4 --9 , 2017, Long Beach, CA, USA. 5998--6008. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4--9, 2017, Long Beach, CA, USA. 5998--6008."},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/TASL.2006.885253"},{"key":"e_1_3_2_1_29_1","volume-title":"Disjoint Mapping Network for Cross-modal Matching of Voices and Faces. In 7th International Conference on Learning Representations, ICLR 2019","author":"Wen Yandong","year":"2019","unstructured":"Yandong Wen , Mahmoud Al Ismail , Weiyang Liu , Bhiksha Raj , and Rita Singh . 2019 . Disjoint Mapping Network for Cross-modal Matching of Voices and Faces. In 7th International Conference on Learning Representations, ICLR 2019 , New Orleans, LA, USA, May 6--9 , 2019 . Yandong Wen, Mahmoud Al Ismail, Weiyang Liu, Bhiksha Raj, and Rita Singh. 2019. Disjoint Mapping Network for Cross-modal Matching of Voices and Faces. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6--9, 2019 ."},{"key":"e_1_3_2_1_30_1","volume-title":"Recursive Visual Sound Separation Using Minus-Plus Net. In 2019 IEEE\/CVF International Conference on Computer Vision, ICCV 2019","author":"Xu Xudong","year":"2019","unstructured":"Xudong Xu , Bo Dai , and Dahua Lin . 2019 . Recursive Visual Sound Separation Using Minus-Plus Net. In 2019 IEEE\/CVF International Conference on Computer Vision, ICCV 2019 , Seoul, Korea (South), October 27 - November 2, 2019. 882--891. Xudong Xu, Bo Dai, and Dahua Lin. 2019. Recursive Visual Sound Separation Using Minus-Plus Net. In 2019 IEEE\/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. 882--891."},{"key":"e_1_3_2_1_31_1","volume-title":"The Sound of Motions. In 2019 IEEE\/CVF International Conference on Computer Vision, ICCV 2019","author":"Zhao Hang","year":"2019","unstructured":"Hang Zhao , Chuang Gan , Wei-Chiu Ma , and Antonio Torralba . 2019 . The Sound of Motions. In 2019 IEEE\/CVF International Conference on Computer Vision, ICCV 2019 , Seoul, Korea (South), October 27 - November 2, 2019. 1735--1744. Hang Zhao, Chuang Gan, Wei-Chiu Ma, and Antonio Torralba. 2019. The Sound of Motions. In 2019 IEEE\/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. 1735--1744."},{"key":"e_1_3_2_1_32_1","volume-title":"Proceedings, Part I. 587--604","author":"Zhao Hang","year":"2018","unstructured":"Hang Zhao , Chuang Gan , Andrew Rouditchenko , Carl Vondrick , Josh H. McDermott , and Antonio Torralba . 2018 . The Sound of Pixels. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8--14, 2018 , Proceedings, Part I. 587--604 . Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh H. McDermott, and Antonio Torralba. 2018. The Sound of Pixels. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8--14, 2018, Proceedings, Part I. 587--604."},{"key":"e_1_3_2_1_33_1","volume-title":"Proceedings, Part XII. 52--69","author":"Zhou Hang","year":"2020","unstructured":"Hang Zhou , Xudong Xu , Dahua Lin , Xiaogang Wang , and Ziwei Liu . 2020 . Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23--28, 2020 , Proceedings, Part XII. 52--69 . Hang Zhou, Xudong Xu, Dahua Lin, Xiaogang Wang, and Ziwei Liu. 2020. Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XII. 52--69."},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11633-021-1293-0"}],"event":{"name":"ICMR '21: International Conference on Multimedia Retrieval","location":"Taipei Taiwan","acronym":"ICMR '21","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 2021 International Conference on Multimedia Retrieval"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3460426.3463624","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3460426.3463624","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T20:17:03Z","timestamp":1750191423000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3460426.3463624"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,8,24]]},"references-count":34,"alternative-id":["10.1145\/3460426.3463624","10.1145\/3460426"],"URL":"https:\/\/doi.org\/10.1145\/3460426.3463624","relation":{},"subject":[],"published":{"date-parts":[[2021,8,24]]},"assertion":[{"value":"2021-09-01","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}