{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,10]],"date-time":"2025-12-10T09:00:20Z","timestamp":1765357220105,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":47,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T00:00:00Z","timestamp":1665360000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100012166","name":"National Key Research and Development Program of China","doi-asserted-by":"publisher","award":["2018AAA0102200"],"award-info":[{"award-number":["2018AAA0102200"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,10,10]]},"DOI":"10.1145\/3503161.3548309","type":"proceedings-article","created":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T15:43:12Z","timestamp":1665416592000},"page":"719-727","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":35,"title":["DHHN: Dual Hierarchical Hybrid Network for Weakly-Supervised Audio-Visual Video Parsing"],"prefix":"10.1145","author":[{"given":"Xun","family":"Jiang","sequence":"first","affiliation":[{"name":"University of Electronic Science and Technology of China, Chengdu, China"}]},{"given":"Xing","family":"Xu","sequence":"additional","affiliation":[{"name":"University of Electronic Science and Technology of China, Chengdu, China"}]},{"given":"Zhiguo","family":"Chen","sequence":"additional","affiliation":[{"name":"University of Electronic Science and Technology of China, Chengdu, China"}]},{"given":"Jingran","family":"Zhang","sequence":"additional","affiliation":[{"name":"University of Electronic Science and Technology of China, Chengdu, China"}]},{"given":"Jingkuan","family":"Song","sequence":"additional","affiliation":[{"name":"University of Electronic Science and Technology of China, Chengdu, China"}]},{"given":"Fumin","family":"Shen","sequence":"additional","affiliation":[{"name":"University of Electronic Science and Technology of China, Chengdu, China"}]},{"given":"Huimin","family":"Lu","sequence":"additional","affiliation":[{"name":"Kyushu Institute of Technology, Kitakyushu, China"}]},{"given":"Heng Tao","family":"Shen","sequence":"additional","affiliation":[{"name":"University of Electronic Science and Technology of China and Peng Cheng Laboratory, Chengdu, China"}]}],"member":"320","published-online":{"date-parts":[[2022,10,10]]},"reference":[{"key":"e_1_3_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58523-5_13"},{"key":"e_1_3_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.73"},{"key":"e_1_3_2_2_3_1","volume-title":"Proceedings of Annual Conference on Neural Information Processing Systems. 892--900","author":"Aytar Yusuf","year":"2016","unstructured":"Yusuf Aytar , Carl Vondrick , and Antonio Torralba . 2016 . SoundNet: Learning Sound Representations from Unlabeled Video . In Proceedings of Annual Conference on Neural Information Processing Systems. 892--900 . Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2016. SoundNet: Learning Sound Representations from Unlabeled Video. In Proceedings of Annual Conference on Neural Information Processing Systems. 892--900."},{"key":"e_1_3_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00124"},{"key":"e_1_3_2_2_5_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2021-120"},{"key":"e_1_3_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413869"},{"key":"e_1_3_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1214\/009053604000000201"},{"key":"e_1_3_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_3_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01524"},{"key":"e_1_3_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01047"},{"key":"e_1_3_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2017.7952261"},{"key":"e_1_3_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_2_13_1","volume-title":"Modeling Two-Stream Correspondence for Visual Sound Separation","author":"He Yixuan","year":"2021","unstructured":"Yixuan He , Xing Xu , Jingran Zhang , Fumin Shen , Yang Yang , and Heng Tao Shen . 2021. Modeling Two-Stream Correspondence for Visual Sound Separation . IEEE Transactions on Circuits and Systems for Video Technology ( 2021 ). Yixuan He, Xing Xu, Jingran Zhang, Fumin Shen, Yang Yang, and Heng Tao Shen. 2021. Modeling Two-Stream Correspondence for Visual Sound Separation. IEEE Transactions on Circuits and Systems for Video Technology (2021)."},{"key":"e_1_3_2_2_14_1","volume-title":"Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al.","author":"Hershey Shawn","year":"2017","unstructured":"Shawn Hershey , Sourish Chaudhuri , Daniel PW Ellis , Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. 2017 . CNN architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp). 131--135. Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. 2017. CNN architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp). 131--135."},{"key":"e_1_3_2_2_15_1","unstructured":"Will Kay Joao Carreira Karen Simonyan Brian Zhang Chloe Hillier Sudheendra Vijayanarasimhan Fabio Viola Tim Green Trevor Back Paul Natsev etal 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).  Will Kay Joao Carreira Karen Simonyan Brian Zhang Chloe Hillier Sudheendra Vijayanarasimhan Fabio Viola Tim Green Trevor Back Paul Natsev et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)."},{"key":"e_1_3_2_2_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00559"},{"key":"e_1_3_2_2_17_1","volume-title":"Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980","author":"Kingma Diederik P","year":"2014","unstructured":"Diederik P Kingma and Jimmy Ba . 2014 . Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)."},{"key":"e_1_3_2_2_18_1","volume-title":"Proceedings of Annual Conference on Neural Information Processing Systems. 7774--7785","author":"Korbar Bruno","year":"2018","unstructured":"Bruno Korbar , Du Tran , and Lorenzo Torresani . 2018 . Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization . In Proceedings of Annual Conference on Neural Information Processing Systems. 7774--7785 . Bruno Korbar, Du Tran, and Lorenzo Torresani. 2018. Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization. In Proceedings of Annual Conference on Neural Information Processing Systems. 7774--7785."},{"key":"e_1_3_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2021-2135"},{"key":"e_1_3_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2019.8683226"},{"key":"e_1_3_2_2_21_1","unstructured":"Yan-Bo Lin Hung-Yu Tseng Hsin-Ying Lee Yen-Yu Lin and Ming-Hsuan Yang. 2021. Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing. Preceedings of the Advances in Neural Information Processing Systems 34.  Yan-Bo Lin Hung-Yu Tseng Hsin-Ying Lee Yen-Yu Lin and Ming-Hsuan Yang. 2021. Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing. Preceedings of the Advances in Neural Information Processing Systems 34."},{"key":"e_1_3_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2021.3132229"},{"key":"e_1_3_2_2_23_1","volume-title":"Proceedings of International Conference on Learning Representations.","author":"Ma Shuang","year":"2021","unstructured":"Shuang Ma , Zhaoyang Zeng , Daniel J. McDuff , and Yale Song . 2021 . Active Contrastive Learning of Audio-Visual Video Representations . In Proceedings of International Conference on Learning Representations. Shuang Ma, Zhaoyang Zeng, Daniel J. McDuff, and Yale Song. 2021. Active Contrastive Learning of Audio-Visual Video Representations. In Proceedings of International Conference on Learning Representations."},{"key":"e_1_3_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00034"},{"key":"e_1_3_2_2_25_1","volume-title":"Hinton","author":"M\u00fcller Rafael","year":"2019","unstructured":"Rafael M\u00fcller , Simon Kornblith , and Geoffrey E . Hinton . 2019 . When does label smoothing help?. In Preceedings of the Advances in Neural Information Processing Systems . 4696--4705. Rafael M\u00fcller, Simon Kornblith, and Geoffrey E. Hinton. 2019. When does label smoothing help?. In Preceedings of the Advances in Neural Information Processing Systems. 4696--4705."},{"key":"e_1_3_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01033"},{"volume-title":"Proceedings of the European Conference on Computer Vision. 639--658","author":"Owens Andrew","key":"e_1_3_2_2_27_1","unstructured":"Andrew Owens and Alexei A. Efros . 2018. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features . In Proceedings of the European Conference on Computer Vision. 639--658 . Andrew Owens and Alexei A. Efros. 2018. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features. In Proceedings of the European Conference on Computer Vision. 639--658."},{"key":"e_1_3_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP40776.2020.9053895"},{"key":"e_1_3_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00640"},{"key":"e_1_3_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00277"},{"key":"e_1_3_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58580-8_26"},{"key":"e_1_3_2_2_32_1","volume-title":"Proceedings of the European Conference on Computer Vision. 247--263","author":"Tian Yapeng","year":"2018","unstructured":"Yapeng Tian , Jing Shi , Bochen Li , Zhiyao Duan , and Chenliang Xu . 2018 . Audiovisual event localization in unconstrained videos . In Proceedings of the European Conference on Computer Vision. 247--263 . Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu. 2018. Audiovisual event localization in unconstrained videos. In Proceedings of the European Conference on Computer Vision. 247--263."},{"key":"e_1_3_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00675"},{"key":"e_1_3_2_2_34_1","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez Lukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. Preceedings of the Advances in Neural Information Processing Systems 30.  Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez Lukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. Preceedings of the Advances in Neural Information Processing Systems 30."},{"key":"e_1_3_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2022.3142420"},{"key":"e_1_3_2_2_36_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00138"},{"key":"e_1_3_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00639"},{"key":"e_1_3_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413581"},{"key":"e_1_3_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCYB.2020.3009004"},{"key":"e_1_3_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2020.3045530"},{"key":"e_1_3_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2020.2986029"},{"key":"e_1_3_2_2_42_1","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence. 279--286","author":"Xuan Hanyu","year":"2020","unstructured":"Hanyu Xuan , Zhenyu Zhang , Shuo Chen , Jian Yang , and Yan Yan . 2020 . Crossmodal attention network for temporal inconsistent audio-visual event localization . In Proceedings of the AAAI Conference on Artificial Intelligence. 279--286 . Hanyu Xuan, Zhenyu Zhang, Shuo Chen, Jian Yang, and Yan Yan. 2020. Crossmodal attention network for temporal inconsistent audio-visual event localization. In Proceedings of the AAAI Conference on Artificial Intelligence. 279--286."},{"key":"e_1_3_2_2_43_1","volume-title":"Self-supervised contrastive cross-modality representation learning for spoken question answering. arXiv preprint arXiv:2109.03381","author":"You Chenyu","year":"2021","unstructured":"Chenyu You , Nuo Chen , and Yuexian Zou . 2021. Self-supervised contrastive cross-modality representation learning for spoken question answering. arXiv preprint arXiv:2109.03381 ( 2021 ). Chenyu You, Nuo Chen, and Yuexian Zou. 2021. Self-supervised contrastive cross-modality representation learning for spoken question answering. arXiv preprint arXiv:2109.03381 (2021)."},{"key":"e_1_3_2_2_44_1","volume-title":"MMPyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing. arXiv preprint arXiv:2111.12374","author":"Yu Jiashuo","year":"2021","unstructured":"Jiashuo Yu , Ying Cheng , Rui-Wei Zhao , Rui Feng , and Yuejie Zhang . 2021. MMPyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing. arXiv preprint arXiv:2111.12374 ( 2021 ). Jiashuo Yu, Ying Cheng, Rui-Wei Zhao, Rui Feng, and Yuejie Zhang. 2021. MMPyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing. arXiv preprint arXiv:2111.12374 (2021)."},{"key":"e_1_3_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i4.16447"},{"key":"e_1_3_2_2_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00182"},{"key":"e_1_3_2_2_47_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00833"}],"event":{"name":"MM '22: The 30th ACM International Conference on Multimedia","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Lisboa Portugal","acronym":"MM '22"},"container-title":["Proceedings of the 30th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3548309","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3503161.3548309","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:00:43Z","timestamp":1750186843000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3548309"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,10]]},"references-count":47,"alternative-id":["10.1145\/3503161.3548309","10.1145\/3503161"],"URL":"https:\/\/doi.org\/10.1145\/3503161.3548309","relation":{},"subject":[],"published":{"date-parts":[[2022,10,10]]},"assertion":[{"value":"2022-10-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}