{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,7]],"date-time":"2026-05-07T16:25:19Z","timestamp":1778171119210,"version":"3.51.4"},"publisher-location":"New York, NY, USA","reference-count":51,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T00:00:00Z","timestamp":1665360000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["No. 62172101, No. 61976057"],"award-info":[{"award-number":["No. 62172101, No. 61976057"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100003399","name":"Science and Technology Commission of Shanghai Municipality","doi-asserted-by":"publisher","award":["No. 21511101000, No. 21511100602"],"award-info":[{"award-number":["No. 21511101000, No. 21511100602"]}],"id":[{"id":"10.13039\/501100003399","id-type":"DOI","asserted-by":"publisher"}]},{"name":"SPMI Innovation and Technology Fund Projects","award":["SAST2020-110"],"award-info":[{"award-number":["SAST2020-110"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,10,10]]},"DOI":"10.1145\/3503161.3547869","type":"proceedings-article","created":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T15:42:35Z","timestamp":1665416555000},"page":"6241-6249","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":68,"title":["MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing"],"prefix":"10.1145","author":[{"given":"Jiashuo","family":"Yu","sequence":"first","affiliation":[{"name":"Fudan University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ying","family":"Cheng","sequence":"additional","affiliation":[{"name":"Fudan University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Rui-Wei","family":"Zhao","sequence":"additional","affiliation":[{"name":"Fudan University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Rui","family":"Feng","sequence":"additional","affiliation":[{"name":"Fudan University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yuejie","family":"Zhang","sequence":"additional","affiliation":[{"name":"Fudan University, Shanghai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2022,10,10]]},"reference":[{"key":"e_1_3_2_2_1_1","volume-title":"Joon Son Chung, and Andrew Zisserman","author":"Afouras Triantafyllos","year":"2020","unstructured":"Triantafyllos Afouras , Andrew Owens , Joon Son Chung, and Andrew Zisserman . 2020 . Self-Supervised Learning of Audio-Visual Objects from Video. In ECCV. Triantafyllos Afouras, Andrew Owens, Joon Son Chung, and Andrew Zisserman. 2020. Self-Supervised Learning of Audio-Visual Objects from Video. In ECCV."},{"key":"e_1_3_2_2_2_1","volume-title":"Lucas Smaira, Sander Dieleman, and Andrew Zisserman.","author":"Alayrac Jean-Baptiste","year":"2020","unstructured":"Jean-Baptiste Alayrac , Adri\u00e0 Recasens , Rosalia Schneider , Relja Arandjelovi?, Jason Ramapuram , Jeffrey De Fauw , Lucas Smaira, Sander Dieleman, and Andrew Zisserman. 2020 . Self-supervised multimodal versatile networks. arXiv preprint arXiv:2006.16228 (2020). Jean-Baptiste Alayrac, Adri\u00e0 Recasens, Rosalia Schneider, Relja Arandjelovi?, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. 2020. Self-supervised multimodal versatile networks. arXiv preprint arXiv:2006.16228 (2020)."},{"key":"e_1_3_2_2_3_1","volume-title":"NeurIPS","volume":"33","author":"Alwassel Humam","year":"2020","unstructured":"Humam Alwassel , Dhruv Mahajan , Bruno Korbar , Lorenzo Torresani , Bernard Ghanem , and Du Tran . 2020 . Self-supervised learning by cross-modal audio-video clustering . In NeurIPS , Vol. 33 . Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, and Du Tran. 2020. Self-supervised learning by cross-modal audio-video clustering. In NeurIPS, Vol. 33."},{"key":"e_1_3_2_2_4_1","doi-asserted-by":"crossref","unstructured":"Relja Arandjelovic and Andrew Zisserman. 2017. Look listen and learn. In ICCV. 609--617.  Relja Arandjelovic and Andrew Zisserman. 2017. Look listen and learn. In ICCV. 609--617.","DOI":"10.1109\/ICCV.2017.73"},{"key":"e_1_3_2_2_5_1","volume-title":"Jamie Ryan Kiros, and Geoffrey E Hinton","author":"Ba Jimmy Lei","year":"2016","unstructured":"Jimmy Lei Ba , Jamie Ryan Kiros, and Geoffrey E Hinton . 2016 . Layer normalization. arXiv preprint arXiv:1607.06450 (2016). Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016)."},{"key":"e_1_3_2_2_6_1","volume-title":"Seeing sounds: visual and auditory interactions in the brain. Current opinion in neurobiology 16, 4","author":"Bulkin David A","year":"2006","unstructured":"David A Bulkin and Jennifer M Groh . 2006. Seeing sounds: visual and auditory interactions in the brain. Current opinion in neurobiology 16, 4 ( 2006 ), 415--419. David A Bulkin and Jennifer M Groh. 2006. Seeing sounds: visual and auditory interactions in the brain. Current opinion in neurobiology 16, 4 (2006), 415--419."},{"key":"e_1_3_2_2_7_1","doi-asserted-by":"crossref","unstructured":"Ying Cheng Ruize Wang Zhihao Pan Rui Feng and Yuejie Zhang. 2020. Look listen and attend: Co-attention network for self-supervised audio-visual representation learning. In ACM MM. 3884--3892.  Ying Cheng Ruize Wang Zhihao Pan Rui Feng and Yuejie Zhang. 2020. Look listen and attend: Co-attention network for self-supervised audio-visual representation learning. In ACM MM. 3884--3892.","DOI":"10.1145\/3394171.3413869"},{"key":"e_1_3_2_2_8_1","volume-title":"Imagenet: A large-scale hierarchical image database. In CVPR. 248--255.","author":"Deng Jia","year":"2009","unstructured":"Jia Deng , Wei Dong , Richard Socher , Li-Jia Li , Kai Li , and Li Fei-Fei . 2009 . Imagenet: A large-scale hierarchical image database. In CVPR. 248--255. Jia Deng,Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In CVPR. 248--255."},{"key":"e_1_3_2_2_9_1","volume-title":"Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In CVPR. 3575--3584.","author":"Farha Yazan Abu","year":"2019","unstructured":"Yazan Abu Farha and Jurgen Gall . 2019 . Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In CVPR. 3575--3584. Yazan Abu Farha and Jurgen Gall. 2019. Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In CVPR. 3575--3584."},{"key":"e_1_3_2_2_10_1","unstructured":"Chuang Gan Deng Huang Hang Zhao Joshua B Tenenbaum and Antonio Torralba. 2020. Music gesture for visual sound separation. In CVPR. 10478--10487.  Chuang Gan Deng Huang Hang Zhao Joshua B Tenenbaum and Antonio Torralba. 2020. Music gesture for visual sound separation. In CVPR. 10478--10487."},{"key":"e_1_3_2_2_11_1","unstructured":"Chuang Gan Hang Zhao Peihao Chen David Cox and Antonio Torralba. 2019. Self-supervised moving vehicle tracking with stereo sound. In ICCV. 7053--7062.  Chuang Gan Hang Zhao Peihao Chen David Cox and Antonio Torralba. 2019. Self-supervised moving vehicle tracking with stereo sound. In ICCV. 7053--7062."},{"key":"e_1_3_2_2_12_1","doi-asserted-by":"crossref","unstructured":"Jort F Gemmeke Daniel PWEllis Dylan Freedman Aren Jansen Wade Lawrence R Channing Moore Manoj Plakal and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In ICASSP. 776--780.  Jort F Gemmeke Daniel PWEllis Dylan Freedman Aren Jansen Wade Lawrence R Channing Moore Manoj Plakal and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In ICASSP. 776--780.","DOI":"10.1109\/ICASSP.2017.7952261"},{"key":"e_1_3_2_2_13_1","volume-title":"Proceedings of the fourteenth international conference on artificial intelligence and statistics. 315--323","author":"Glorot Xavier","year":"2011","unstructured":"Xavier Glorot , Antoine Bordes , and Yoshua Bengio . 2011 . Deep sparse rectifier neural networks . In Proceedings of the fourteenth international conference on artificial intelligence and statistics. 315--323 . Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics. 315--323."},{"key":"e_1_3_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1162\/jocn.2009.21134"},{"key":"e_1_3_2_2_15_1","unstructured":"Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778.  Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778."},{"key":"e_1_3_2_2_16_1","volume-title":"Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al.","author":"Hershey Shawn","year":"2017","unstructured":"Shawn Hershey , Sourish Chaudhuri , Daniel PW Ellis , Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. 2017 . CNN architectures for large-scale audio classification. In ICASSP. 131--135. Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. 2017. CNN architectures for large-scale audio classification. In ICASSP. 131--135."},{"key":"e_1_3_2_2_17_1","doi-asserted-by":"crossref","unstructured":"Di Hu Feiping Nie and Xuelong Li. 2019. Deep multimodal clustering for unsupervised audiovisual learning. In CVPR. 9248--9257.  Di Hu Feiping Nie and Xuelong Li. 2019. Deep multimodal clustering for unsupervised audiovisual learning. In CVPR. 9248--9257.","DOI":"10.1109\/CVPR.2019.00947"},{"key":"e_1_3_2_2_18_1","volume-title":"NeurIPS","volume":"33","author":"Hu Di","year":"2020","unstructured":"Di Hu , Rui Qian , Minyue Jiang , Xiao Tan , Shilei Wen , Errui Ding , Weiyao Lin , and Dejing Dou . 2020 . Discriminative Sounding Objects Localization via Selfsupervised Audiovisual Matching . In NeurIPS , Vol. 33 . Di Hu, Rui Qian, Minyue Jiang, Xiao Tan, Shilei Wen, Errui Ding, Weiyao Lin, and Dejing Dou. 2020. Discriminative Sounding Objects Localization via Selfsupervised Audiovisual Matching. In NeurIPS, Vol. 33."},{"key":"e_1_3_2_2_19_1","volume-title":"Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980","author":"Kingma Diederik P","year":"2014","unstructured":"Diederik P Kingma and Jimmy Ba . 2014 . Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)."},{"key":"e_1_3_2_2_20_1","doi-asserted-by":"crossref","unstructured":"Qiuqiang Kong Yong Xu Wenwu Wang and Mark D Plumbley. 2018. Audio set classification with attention model: A probabilistic perspective. In ICASSP. 316--320.  Qiuqiang Kong Yong Xu Wenwu Wang and Mark D Plumbley. 2018. Audio set classification with attention model: A probabilistic perspective. In ICASSP. 316--320.","DOI":"10.1109\/ICASSP.2018.8461392"},{"key":"e_1_3_2_2_21_1","unstructured":"Bruno Korbar Du Tran and Lorenzo Torresani. 2018. Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization. In NeurIPS. 7774--7785.  Bruno Korbar Du Tran and Lorenzo Torresani. 2018. Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization. In NeurIPS. 7774--7785."},{"key":"e_1_3_2_2_22_1","volume-title":"Austin Reiter, and Gregory D Hager","author":"Lea Colin","year":"2017","unstructured":"Colin Lea , Michael D Flynn , Rene Vidal , Austin Reiter, and Gregory D Hager . 2017 . Temporal convolutional networks for action segmentation and detection. In CVPR. 156--165. Colin Lea, Michael D Flynn, Rene Vidal, Austin Reiter, and Gregory D Hager. 2017. Temporal convolutional networks for action segmentation and detection. In CVPR. 156--165."},{"key":"e_1_3_2_2_23_1","volume-title":"MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation. TPAMI","author":"Li Shi-Jie","year":"2020","unstructured":"Shi-Jie Li , Yazan AbuFarha , Yun Liu , Ming-Ming Cheng , and Juergen Gall . 2020. MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation. TPAMI ( 2020 ). Shi-Jie Li, Yazan AbuFarha, Yun Liu, Ming-Ming Cheng, and Juergen Gall. 2020. MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation. TPAMI (2020)."},{"key":"e_1_3_2_2_24_1","unstructured":"Yan-Bo Lin Yu-Jhe Li and Yu-Chiang Frank Wang. 2019. Dual-modality seq2seq network for audio-visual event localization. In ICASSP. 2002--2006.  Yan-Bo Lin Yu-Jhe Li and Yu-Chiang Frank Wang. 2019. Dual-modality seq2seq network for audio-visual event localization. In ICASSP. 2002--2006."},{"key":"e_1_3_2_2_25_1","unstructured":"Yan-Bo Lin and Yu-Chiang Frank Wang. 2020. Audiovisual Transformer with Instance Attention for Audio-Visual Event Localization. In ACCV.  Yan-Bo Lin and Yu-Chiang Frank Wang. 2020. Audiovisual Transformer with Instance Attention for Audio-Visual Event Localization. In ACCV."},{"key":"e_1_3_2_2_26_1","unstructured":"Daochang Liu Tingting Jiang and Yizhou Wang. 2019. Completeness modeling and context separation for weakly supervised temporal action localization. In CVPR. 1298--1307.  Daochang Liu Tingting Jiang and Yizhou Wang. 2019. Completeness modeling and context separation for weakly supervised temporal action localization. In CVPR. 1298--1307."},{"key":"e_1_3_2_2_27_1","unstructured":"Shuang Ma Zhaoyang Zeng Daniel McDuff and Yale Song. 2021. Active Contrastive Learning of Audio-Visual Video Representations. In ICLR. https: \/\/openreview.net\/forum?id=OMizHuea_HB  Shuang Ma Zhaoyang Zeng Daniel McDuff and Yale Song. 2021. Active Contrastive Learning of Audio-Visual Video Representations. In ICLR. https: \/\/openreview.net\/forum?id=OMizHuea_HB"},{"key":"e_1_3_2_2_28_1","volume-title":"NeurIPS","volume":"33","author":"Morgado Pedro","year":"2020","unstructured":"Pedro Morgado , Yi Li , and Nuno Nvasconcelos . 2020 . Learning Representations from Audio-Visual Spatial Alignment . In NeurIPS , Vol. 33 . Pedro Morgado, Yi Li, and Nuno Nvasconcelos. 2020. Learning Representations from Audio-Visual Spatial Alignment. In NeurIPS, Vol. 33."},{"key":"e_1_3_2_2_29_1","doi-asserted-by":"crossref","unstructured":"Phuc Nguyen Ting Liu Gautam Prasad and Bohyung Han. 2018. Weakly supervised action localization by sparse temporal pooling network. In CVPR. 6752-- 6761.  Phuc Nguyen Ting Liu Gautam Prasad and Bohyung Han. 2018. Weakly supervised action localization by sparse temporal pooling network. In CVPR. 6752-- 6761.","DOI":"10.1109\/CVPR.2018.00706"},{"key":"e_1_3_2_2_30_1","volume-title":"Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499","author":"van den Oord Aaron","year":"2016","unstructured":"Aaron van den Oord , Sander Dieleman , Heiga Zen , Karen Simonyan , Oriol Vinyals , Alex Graves , Nal Kalchbrenner , Andrew Senior , and Koray Kavukcuoglu . 2016 . Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016). Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)."},{"key":"e_1_3_2_2_31_1","doi-asserted-by":"crossref","unstructured":"Andrew Owens and Alexei A Efros. 2018. Audio-visual scene analysis with self-supervised multisensory features. In ECCV. 631--648.  Andrew Owens and Alexei A Efros. 2018. Audio-visual scene analysis with self-supervised multisensory features. In ECCV. 631--648.","DOI":"10.1007\/978-3-030-01231-1_39"},{"key":"e_1_3_2_2_32_1","doi-asserted-by":"crossref","unstructured":"Janani Ramaswamy. 2020. What Makes the Sound?: A Dual-Modality Interacting Network for Audio-Visual Event Localization. In ICASSP. 4372--4376.  Janani Ramaswamy. 2020. What Makes the Sound?: A Dual-Modality Interacting Network for Audio-Visual Event Localization. In ICASSP. 4372--4376.","DOI":"10.1109\/ICASSP40776.2020.9053895"},{"key":"e_1_3_2_2_33_1","doi-asserted-by":"crossref","unstructured":"Janani Ramaswamy and Sukhendu Das. 2020. See the sound hear the pixels. In WACV. 2970--2979.  Janani Ramaswamy and Sukhendu Das. 2020. See the sound hear the pixels. In WACV. 2970--2979.","DOI":"10.1109\/WACV45572.2020.9093616"},{"key":"e_1_3_2_2_34_1","volume-title":"Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan and AndrewZisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 ( 2014 ). Karen Simonyan and AndrewZisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)."},{"key":"e_1_3_2_2_35_1","doi-asserted-by":"crossref","unstructured":"Christian Szegedy Vincent Vanhoucke Sergey Ioffe Jon Shlens and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In CVPR. 2818--2826.  Christian Szegedy Vincent Vanhoucke Sergey Ioffe Jon Shlens and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In CVPR. 2818--2826.","DOI":"10.1109\/CVPR.2016.308"},{"key":"e_1_3_2_2_36_1","doi-asserted-by":"crossref","unstructured":"Yapeng Tian Dingzeyu Li and Chenliang Xu. 2020. Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing. In ECCV.  Yapeng Tian Dingzeyu Li and Chenliang Xu. 2020. Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing. In ECCV.","DOI":"10.1007\/978-3-030-58580-8_26"},{"key":"e_1_3_2_2_37_1","doi-asserted-by":"crossref","unstructured":"Yapeng Tian Jing Shi Bochen Li Zhiyao Duan and Chenliang Xu. 2018. Audiovisual event localization in unconstrained videos. In ECCV. 247--263.  Yapeng Tian Jing Shi Bochen Li Zhiyao Duan and Chenliang Xu. 2018. Audiovisual event localization in unconstrained videos. In ECCV. 247--263.","DOI":"10.1007\/978-3-030-01216-8_16"},{"key":"e_1_3_2_2_38_1","doi-asserted-by":"crossref","unstructured":"Du Tran Heng Wang Lorenzo Torresani Jamie Ray Yann LeCun and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In CVPR. 6450--6459.  Du Tran Heng Wang Lorenzo Torresani Jamie Ray Yann LeCun and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In CVPR. 6450--6459.","DOI":"10.1109\/CVPR.2018.00675"},{"key":"e_1_3_2_2_39_1","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez Lukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS. 5998--6008.  Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez Lukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS. 5998--6008."},{"key":"e_1_3_2_2_40_1","doi-asserted-by":"crossref","unstructured":"Yun Wang Juncheng Li and Florian Metze. 2019. A comparison of five multiple instance learning pooling functions for sound event detection with weak labeling. In ICASSP. 31--35.  Yun Wang Juncheng Li and Florian Metze. 2019. A comparison of five multiple instance learning pooling functions for sound event detection with weak labeling. In ICASSP. 31--35.","DOI":"10.1109\/ICASSP.2019.8682847"},{"key":"e_1_3_2_2_41_1","doi-asserted-by":"crossref","unstructured":"Yunbo Wang Mingsheng Long Jianmin Wang and Philip S Yu. 2017. Spatiotemporal pyramid network for video action recognition. In CVPR. 1529--1538.  Yunbo Wang Mingsheng Long Jianmin Wang and Philip S Yu. 2017. Spatiotemporal pyramid network for video action recognition. In CVPR. 1529--1538.","DOI":"10.1109\/CVPR.2017.226"},{"key":"e_1_3_2_2_42_1","doi-asserted-by":"crossref","unstructured":"YuWu and Yi Yang. 2021. Exploring Heterogeneous Clues forWeakly-Supervised Audio-Visual Video Parsing. In CVPR. 1326--1335.  YuWu and Yi Yang. 2021. Exploring Heterogeneous Clues forWeakly-Supervised Audio-Visual Video Parsing. In CVPR. 1326--1335.","DOI":"10.1109\/CVPR46437.2021.00138"},{"key":"e_1_3_2_2_43_1","doi-asserted-by":"crossref","unstructured":"Yu Wu Linchao Zhu Yan Yan and Yi Yang. 2019. Dual attention matching for audio-visual event localization. In ICCV. 6292--6300.  Yu Wu Linchao Zhu Yan Yan and Yi Yang. 2019. Dual attention matching for audio-visual event localization. In ICCV. 6292--6300.","DOI":"10.1109\/ICCV.2019.00639"},{"key":"e_1_3_2_2_44_1","unstructured":"Haoming Xu Runhao Zeng Qingyao Wu Mingkui Tan and Chuang Gan. 2020. Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization. In ACM MM.  Haoming Xu Runhao Zeng Qingyao Wu Mingkui Tan and Chuang Gan. 2020. Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization. In ACM MM."},{"key":"e_1_3_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i01.5361"},{"key":"e_1_3_2_2_46_1","doi-asserted-by":"crossref","unstructured":"Ceyuan Yang Yinghao Xu Jianping Shi Bo Dai and Bolei Zhou. 2020. Temporal pyramid network for action recognition. In CVPR. 591--600.  Ceyuan Yang Yinghao Xu Jianping Shi Bo Dai and Bolei Zhou. 2020. Temporal pyramid network for action recognition. In CVPR. 591--600.","DOI":"10.1109\/CVPR42600.2020.00067"},{"key":"e_1_3_2_2_47_1","volume-title":"MPN: Multimodal Parallel Network for Audio-Visual Event Localization. ICME","author":"Yu Jiashuo","year":"2021","unstructured":"Jiashuo Yu , Ying Cheng , and Rui Feng . 2021 . MPN: Multimodal Parallel Network for Audio-Visual Event Localization. ICME (2021). Jiashuo Yu, Ying Cheng, and Rui Feng. 2021. MPN: Multimodal Parallel Network for Audio-Visual Event Localization. ICME (2021)."},{"key":"e_1_3_2_2_48_1","doi-asserted-by":"crossref","unstructured":"Da Zhang Xiyang Dai and Yuan-Fang Wang. 2018. Dynamic temporal pyramid network: A closer look at multi-scale modeling for activity detection. In ACCV. 712--728.  Da Zhang Xiyang Dai and Yuan-Fang Wang. 2018. Dynamic temporal pyramid network: A closer look at multi-scale modeling for activity detection. In ACCV. 712--728.","DOI":"10.1007\/978-3-030-20870-7_44"},{"key":"e_1_3_2_2_49_1","doi-asserted-by":"crossref","unstructured":"Hang Zhao Chuang Gan Wei-Chiu Ma and Antonio Torralba. 2019. The sound of motions. In ICCV. 1735--1744.  Hang Zhao Chuang Gan Wei-Chiu Ma and Antonio Torralba. 2019. The sound of motions. In ICCV. 1735--1744.","DOI":"10.1109\/ICCV.2019.00182"},{"key":"e_1_3_2_2_50_1","doi-asserted-by":"crossref","unstructured":"Hang Zhao Chuang Gan Andrew Rouditchenko Carl Vondrick Josh McDermott and Antonio Torralba. 2018. The Sound of Pixels. In ECCV.  Hang Zhao Chuang Gan Andrew Rouditchenko Carl Vondrick Josh McDermott and Antonio Torralba. 2018. The Sound of Pixels. In ECCV.","DOI":"10.1007\/978-3-030-01246-5_35"},{"key":"e_1_3_2_2_51_1","doi-asserted-by":"crossref","unstructured":"Jinxing Zhou Liang Zheng Yiran Zhong Shijie Hao and Meng Wang. 2021. Positive Sample Propagation along the Audio-Visual Event Line. In CVPR.  Jinxing Zhou Liang Zheng Yiran Zhong Shijie Hao and Meng Wang. 2021. Positive Sample Propagation along the Audio-Visual Event Line. In CVPR.","DOI":"10.1109\/CVPR46437.2021.00833"}],"event":{"name":"MM '22: The 30th ACM International Conference on Multimedia","location":"Lisboa Portugal","acronym":"MM '22","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 30th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3547869","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3503161.3547869","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:02:35Z","timestamp":1750186955000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3547869"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,10]]},"references-count":51,"alternative-id":["10.1145\/3503161.3547869","10.1145\/3503161"],"URL":"https:\/\/doi.org\/10.1145\/3503161.3547869","relation":{},"subject":[],"published":{"date-parts":[[2022,10,10]]},"assertion":[{"value":"2022-10-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}