{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,7,27]],"date-time":"2025-07-27T07:22:59Z","timestamp":1753600979860,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":55,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,11,7]],"date-time":"2022-11-07T00:00:00Z","timestamp":1667779200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,11,7]]},"DOI":"10.1145\/3536221.3556625","type":"proceedings-article","created":{"date-parts":[[2022,11,4]],"date-time":"2022-11-04T15:54:14Z","timestamp":1667577254000},"page":"48-56","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Does Audio help in deep Audio-Visual Saliency prediction models?"],"prefix":"10.1145","author":[{"given":"Ritvik","family":"Agrawal","sequence":"first","affiliation":[{"name":"CVIT, KCIS, International Institute for Information Technology, Hyderabad, India"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Shreyank","family":"Jyoti","sequence":"additional","affiliation":[{"name":"CVIT, KCIS, International Institute for Information Technology, Hyderabad, India"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Rohit","family":"Girmaji","sequence":"additional","affiliation":[{"name":"CVIT, KCIS, International Institute for Information Technology, Hyderabad, India"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sarath","family":"Sivaprasad","sequence":"additional","affiliation":[{"name":"CVIT, KCIS, International Institute for Information Technology, Hyderabad, India and TCS Research, Pune, India"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Vineet","family":"Gandhi","sequence":"additional","affiliation":[{"name":"CVIT, KCIS, International Institute for Information Technology, Hyderabad, India"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2022,11,7]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"Soundnet: Learning sound representations from unlabeled video. Advances in neural information processing systems","author":"Aytar Yusuf","year":"2016","unstructured":"Yusuf Aytar , Carl Vondrick , and Antonio Torralba . 2016 . Soundnet: Learning sound representations from unlabeled video. Advances in neural information processing systems (2016). Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2016. Soundnet: Learning sound representations from unlabeled video. Advances in neural information processing systems (2016)."},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/ROBOT.2008.4543572"},{"key":"e_1_3_2_1_3_1","volume-title":"What do different evaluation metrics tell us about saliency models?IEEE transactions on pattern analysis and machine intelligence","author":"Bylinskii Zoya","year":"2018","unstructured":"Zoya Bylinskii , Tilke Judd , Aude Oliva , Antonio Torralba , and Fr\u00e9do Durand . 2018. What do different evaluation metrics tell us about saliency models?IEEE transactions on pattern analysis and machine intelligence ( 2018 ). Zoya Bylinskii, Tilke Judd, Aude Oliva, Antonio Torralba, and Fr\u00e9do Durand. 2018. What do different evaluation metrics tell us about saliency models?IEEE transactions on pattern analysis and machine intelligence (2018)."},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.502"},{"key":"e_1_3_2_1_5_1","volume-title":"Audiovisual saliency prediction via deep learning. Neurocomputing","author":"Chen Jiazhong","year":"2021","unstructured":"Jiazhong Chen , Qingqing Li , Hefei Ling , Dakai Ren , and Ping Duan . 2021. Audiovisual saliency prediction via deep learning. Neurocomputing ( 2021 ). Jiazhong Chen, Qingqing Li, Hefei Ling, Dakai Ren, and Ping Duan. 2021. Audiovisual saliency prediction via deep learning. Neurocomputing (2021)."},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2014.2329380"},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.21437\/Interspeech.2018-1929"},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.23919\/MVA51890.2021.9511406"},{"key":"e_1_3_2_1_9_1","volume-title":"How saliency, faces, and sound influence gaze in dynamic social scenes. Journal of vision","author":"Coutrot Antoine","year":"2014","unstructured":"Antoine Coutrot and Nathalie Guyader . 2014. How saliency, faces, and sound influence gaze in dynamic social scenes. Journal of vision ( 2014 ). Antoine Coutrot and Nathalie Guyader. 2014. How saliency, faces, and sound influence gaze in dynamic social scenes. Journal of vision (2014)."},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/EUSIPCO.2015.7362640"},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"crossref","unstructured":"Antoine Coutrot Nathalie Guyader Gelu Ionescu and Alice Caplier. 2012. Influence of soundtrack on eye movements during video exploration. Journal of Eye Movement Research(2012).  Antoine Coutrot Nathalie Guyader Gelu Ionescu and Alice Caplier. 2012. Influence of soundtrack on eye movements during video exploration. Journal of Eye Movement Research(2012).","DOI":"10.16910\/jemr.5.4.2"},{"key":"e_1_3_2_1_12_1","volume-title":"Video viewing: do auditory salient events capture visual attention?annals of telecommunications-annales des t\u00e9l\u00e9communications","author":"Coutrot Antoine","year":"2014","unstructured":"Antoine Coutrot , Nathalie Guyader , Gelu Ionescu , and Alice Caplier . 2014. Video viewing: do auditory salient events capture visual attention?annals of telecommunications-annales des t\u00e9l\u00e9communications ( 2014 ). Antoine Coutrot, Nathalie Guyader, Gelu Ionescu, and Alice Caplier. 2014. Video viewing: do auditory salient events capture visual attention?annals of telecommunications-annales des t\u00e9l\u00e9communications (2014)."},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58558-7_25"},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/TAMD.2014.2303072"},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10584-0_33"},{"key":"e_1_3_2_1_16_1","volume-title":"Saliency-aware video compression","author":"Hadizadeh Hadi","year":"2013","unstructured":"Hadi Hadizadeh and Ivan\u00a0 V Baji\u0107 . 2013. Saliency-aware video compression . IEEE Transactions on Image Processing( 2013 ). Hadi Hadizadeh and Ivan\u00a0V Baji\u0107. 2013. Saliency-aware video compression. IEEE Transactions on Image Processing(2013)."},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/34.730558"},{"key":"e_1_3_2_1_18_1","volume-title":"2021 IEEE\/RSJ International Conference on Intelligent Robots and Systems (IROS).","author":"Jain Samyak","year":"2020","unstructured":"Samyak Jain , Pradeep Yarlagadda , Shreyank Jyoti , Shyamgopal Karthik , Ramanathan Subramanian , and Vineet Gandhi . 2020 . Vinet: Pushing the limits of visual modality for audio-visual saliency prediction . In 2021 IEEE\/RSJ International Conference on Intelligent Robots and Systems (IROS). Samyak Jain, Pradeep Yarlagadda, Shreyank Jyoti, Shyamgopal Karthik, Ramanathan Subramanian, and Vineet Gandhi. 2020. Vinet: Pushing the limits of visual modality for audio-visual saliency prediction. In 2021 IEEE\/RSJ International Conference on Intelligent Robots and Systems (IROS)."},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-15552-9_46"},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"crossref","unstructured":"Petros Koutras and Petros Maragos. 2015. A perceptually based spatio-temporal computational framework for visual saliency estimation. Signal Processing: Image Communication(2015).  Petros Koutras and Petros Maragos. 2015. A perceptually based spatio-temporal computational framework for visual saliency estimation. Signal Processing: Image Communication(2015).","DOI":"10.1016\/j.image.2015.08.004"},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW.2019.00109"},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58565-5_25"},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"crossref","unstructured":"Matei Mancas Vincent\u00a0P Ferrera Nicolas Riche and John\u00a0G Taylor. 2016. From Human Attention to Computational Attention.  Matei Mancas Vincent\u00a0P Ferrera Nicolas Riche and John\u00a0G Taylor. 2016. From Human Attention to Computational Attention.","DOI":"10.1007\/978-1-4939-3435-5"},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP.2017.8296592"},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206557"},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW.2017.327"},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00248"},{"key":"e_1_3_2_1_28_1","volume-title":"2014 Sixth International Workshop on Quality of Multimedia Experience (QoMEX).","author":"Min Xiongkuo","year":"2014","unstructured":"Xiongkuo Min , Guangtao Zhai , Zhongpai Gao , Chunjia Hu , and Xiaokang Yang . 2014 . Sound influences visual attention discriminately in videos . In 2014 Sixth International Workshop on Quality of Multimedia Experience (QoMEX). Xiongkuo Min, Guangtao Zhai, Zhongpai Gao, Chunjia Hu, and Xiaokang Yang. 2014. Sound influences visual attention discriminately in videos. In 2014 Sixth International Workshop on Quality of Multimedia Experience (QoMEX)."},{"key":"e_1_3_2_1_29_1","volume-title":"Fixation prediction through multimodal analysis. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)","author":"Min Xiongkuo","year":"2016","unstructured":"Xiongkuo Min , Guangtao Zhai , Ke Gu , and Xiaokang Yang . 2016. Fixation prediction through multimodal analysis. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) ( 2016 ). Xiongkuo Min, Guangtao Zhai, Ke Gu, and Xiaokang Yang. 2016. Fixation prediction through multimodal analysis. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) (2016)."},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2020.2966082"},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"crossref","unstructured":"Parag\u00a0K Mital Tim\u00a0J Smith Robin\u00a0L Hill and John\u00a0M Henderson. 2011. Clustering of gaze during dynamic scene viewing is predicted by motion. Cognitive computation(2011).  Parag\u00a0K Mital Tim\u00a0J Smith Robin\u00a0L Hill and John\u00a0M Henderson. 2011. Clustering of gaze during dynamic scene viewing is predicted by motion. Cognitive computation(2011).","DOI":"10.1007\/s12559-010-9074-z"},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/3313831.3376544"},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01229"},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/2502081.2502128"},{"key":"e_1_3_2_1_35_1","volume-title":"Sounds can boost the awareness of visual events through attention without cross-modal integration. Scientific reports","author":"P\u00e1pai M\u00e1rta\u00a0Szabina","year":"2017","unstructured":"M\u00e1rta\u00a0Szabina P\u00e1pai and Salvador Soto-Faraco . 2017. Sounds can boost the awareness of visual events through attention without cross-modal integration. Scientific reports ( 2017 ). M\u00e1rta\u00a0Szabina P\u00e1pai and Salvador Soto-Faraco. 2017. Sounds can boost the awareness of visual events through attention without cross-modal integration. Scientific reports (2017)."},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"crossref","unstructured":"David\u00a0R Perrott Kourosh Saberi Kathleen Brown and Thomas\u00a0Z Strybel. 1990. Auditory psychomotor coordination and visual search performance. Perception & psychophysics(1990).  David\u00a0R Perrott Kourosh Saberi Kathleen Brown and Thomas\u00a0Z Strybel. 1990. Auditory psychomotor coordination and visual search performance. Perception & psychophysics(1990).","DOI":"10.3758\/BF03211521"},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/WACV51458.2022.00024"},{"key":"e_1_3_2_1_38_1","unstructured":"Minglang Qiao Yufan Liu Mai Xu Xin Deng Bing Li Weiming Hu and Ali Borji. 2021. Joint Learning of Visual-Audio Saliency Prediction and Sound Source Localization on Multi-face Videos. arXiv preprint arXiv:2111.08567(2021).  Minglang Qiao Yufan Liu Mai Xu Xin Deng Bing Li Weiming Hu and Ali Borji. 2021. Joint Learning of Visual-Audio Saliency Prediction and Sound Source Localization on Multi-face Videos. arXiv preprint arXiv:2111.08567(2021)."},{"key":"e_1_3_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2008.4587727"},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-24574-4_28"},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"crossref","unstructured":"Guido Schillaci Sa\u0161a Bodiro\u017ea and Verena\u00a0Vanessa Hafner. 2013. Evaluating the effect of saliency detection and attention manipulation in human-robot interaction. International Journal of Social Robotics(2013).  Guido Schillaci Sa\u0161a Bodiro\u017ea and Verena\u00a0Vanessa Hafner. 2013. Evaluating the effect of saliency detection and attention manipulation in human-robot interaction. International Journal of Social Robotics(2013).","DOI":"10.1007\/s12369-012-0174-7"},{"key":"e_1_3_2_1_42_1","volume-title":"Saliency in VR: How do people explore virtual environments?IEEE transactions on visualization and computer graphics","author":"Sitzmann Vincent","year":"2018","unstructured":"Vincent Sitzmann , Ana Serrano , Amy Pavel , Maneesh Agrawala , Diego Gutierrez , Belen Masia , and Gordon Wetzstein . 2018. Saliency in VR: How do people explore virtual environments?IEEE transactions on visualization and computer graphics ( 2018 ). Vincent Sitzmann, Ana Serrano, Amy Pavel, Maneesh Agrawala, Diego Gutierrez, Belen Masia, and Gordon Wetzstein. 2018. Saliency in VR: How do people explore virtual environments?IEEE transactions on visualization and computer graphics (2018)."},{"key":"e_1_3_2_1_43_1","volume-title":"2011 19th European Signal Processing Conference.","author":"Song Guanghan","year":"2011","unstructured":"Guanghan Song , Denis Pellerin , and Lionel Granjon . 2011 . Sound effect on visual gaze when looking at videos . In 2011 19th European Signal Processing Conference. Guanghan Song, Denis Pellerin, and Lionel Granjon. 2011. Sound effect on visual gaze when looking at videos. In 2011 19th European Signal Processing Conference."},{"key":"e_1_3_2_1_44_1","doi-asserted-by":"crossref","unstructured":"Guanghan Song Denis Pellerin and Lionel Granjon. 2013. Different types of sounds influence gaze differently in videos. Journal of Eye Movement Research(2013).  Guanghan Song Denis Pellerin and Lionel Granjon. 2013. Different types of sounds influence gaze differently in videos. Journal of Eye Movement Research(2013).","DOI":"10.16910\/jemr.6.4.1"},{"key":"e_1_3_2_1_45_1","unstructured":"Nitish Srivastava Geoffrey Hinton Alex Krizhevsky Ilya Sutskever and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research(2014).  Nitish Srivastava Geoffrey Hinton Alex Krizhevsky Ilya Sutskever and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research(2014)."},{"key":"e_1_3_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475587"},{"key":"e_1_3_2_1_47_1","volume-title":"Dave: A deep audio-visual embedding for dynamic saliency prediction. arXiv preprint arXiv:1905.10693(2019).","author":"Tavakoli R","year":"2019","unstructured":"Hamed\u00a0 R Tavakoli , Ali Borji , Esa Rahtu , and Juho Kannala . 2019 . Dave: A deep audio-visual embedding for dynamic saliency prediction. arXiv preprint arXiv:1905.10693(2019). Hamed\u00a0R Tavakoli, Ali Borji, Esa Rahtu, and Juho Kannala. 2019. Dave: A deep audio-visual embedding for dynamic saliency prediction. arXiv preprint arXiv:1905.10693(2019)."},{"key":"e_1_3_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP.2016.7472197"},{"key":"e_1_3_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00482"},{"key":"e_1_3_2_1_50_1","volume-title":"Sound enhances visual perception: cross-modal effects of auditory organization on vision.Journal of experimental psychology: Human perception and performance","author":"Vroomen Jean","year":"2000","unstructured":"Jean Vroomen and Beatrice\u00a0de Gelder . 2000. Sound enhances visual perception: cross-modal effects of auditory organization on vision.Journal of experimental psychology: Human perception and performance ( 2000 ). Jean Vroomen and Beatrice\u00a0de Gelder. 2000. Sound enhances visual perception: cross-modal effects of auditory organization on vision.Journal of experimental psychology: Human perception and performance (2000)."},{"key":"e_1_3_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00514"},{"key":"e_1_3_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6927"},{"key":"e_1_3_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01267-0_19"},{"key":"e_1_3_2_1_54_1","volume-title":"Hubert Konik, and Alain Tr\u00e9meau.","author":"Yubing Tong","year":"2011","unstructured":"Tong Yubing , Faouzi\u00a0Alaya Cheikh , Fahad Fazal\u00a0Elahi Guraya , Hubert Konik, and Alain Tr\u00e9meau. 2011 . A spatiotemporal saliency model for video surveillance. Cognitive Computation( 2011). Tong Yubing, Faouzi\u00a0Alaya Cheikh, Fahad Fazal\u00a0Elahi Guraya, Hubert Konik, and Alain Tr\u00e9meau. 2011. A spatiotemporal saliency model for video surveillance. Cognitive Computation(2011)."},{"key":"e_1_3_2_1_55_1","volume-title":"Lavs: A Lightweight Audio-Visual Saliency Prediction Model. In 2021 IEEE International Conference on Multimedia and Expo (ICME).","author":"Zhu Dandan","year":"2021","unstructured":"Dandan Zhu , Defang Zhao , Xiongkuo Min , Tian Han , Qiangqiang Zhou , Shaobo Yu , Yongqing Chen , Guangtao Zhai , and Xiaokang Yang . 2021 . Lavs: A Lightweight Audio-Visual Saliency Prediction Model. In 2021 IEEE International Conference on Multimedia and Expo (ICME). Dandan Zhu, Defang Zhao, Xiongkuo Min, Tian Han, Qiangqiang Zhou, Shaobo Yu, Yongqing Chen, Guangtao Zhai, and Xiaokang Yang. 2021. Lavs: A Lightweight Audio-Visual Saliency Prediction Model. In 2021 IEEE International Conference on Multimedia and Expo (ICME)."}],"event":{"name":"ICMI '22: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION","sponsor":["SIGCHI ACM Special Interest Group on Computer-Human Interaction"],"location":"Bengaluru India","acronym":"ICMI '22"},"container-title":["Proceedings of the 2022 International Conference on Multimodal Interaction"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3536221.3556625","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3536221.3556625","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:48:53Z","timestamp":1750182533000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3536221.3556625"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,11,7]]},"references-count":55,"alternative-id":["10.1145\/3536221.3556625","10.1145\/3536221"],"URL":"https:\/\/doi.org\/10.1145\/3536221.3556625","relation":{},"subject":[],"published":{"date-parts":[[2022,11,7]]},"assertion":[{"value":"2022-11-07","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}