{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,28]],"date-time":"2026-03-28T06:24:28Z","timestamp":1774679068810,"version":"3.50.1"},"reference-count":53,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2016,10,25]],"date-time":"2016-10-25T00:00:00Z","timestamp":1477353600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National High-Tech R8D Program of China","award":["2015AA015905"],"award-info":[{"award-number":["2015AA015905"]}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["61422112, 61371146, 61521062 and 61527804"],"award-info":[{"award-number":["61422112, 61371146, 61521062 and 61527804"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2017,2,28]]},"abstract":"<jats:p>In this article, we propose to predict human eye fixation through incorporating both audio and visual cues. Traditional visual attention models generally make the utmost of stimuli\u2019s visual features, yet they bypass all audio information. In the real world, however, we not only direct our gaze according to visual saliency, but also are attracted by salient audio cues. Psychological experiments show that audio has an influence on visual attention, and subjects tend to be attracted by the sound sources. Therefore, we propose fusing both audio and visual information to predict eye fixation. In our proposed framework, we first localize the moving--sound-generating objects through multimodal analysis and generate an audio attention map. Then, we calculate the spatial and temporal attention maps using the visual modality. Finally, the audio, spatial, and temporal attention maps are fused to generate the final audiovisual saliency map. The proposed method is applicable to scenes containing moving--sound-generating objects. We gather a set of video sequences and collect eye-tracking data under an audiovisual test condition. Experiment results show that we can achieve better eye fixation prediction performance when taking both audio and visual cues into consideration, especially in some typical scenes in which object motion and audio are highly correlated.<\/jats:p>","DOI":"10.1145\/2996463","type":"journal-article","created":{"date-parts":[[2016,10,26]],"date-time":"2016-10-26T13:20:01Z","timestamp":1477488001000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":98,"title":["Fixation Prediction through Multimodal Analysis"],"prefix":"10.1145","volume":"13","author":[{"given":"Xiongkuo","family":"Min","sequence":"first","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]},{"given":"Guangtao","family":"Zhai","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]},{"given":"Ke","family":"Gu","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]},{"given":"Xiaokang","family":"Yang","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]}],"member":"320","published-online":{"date-parts":[[2016,10,25]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206596"},{"key":"e_1_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2012.28"},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/1814433.1814468"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2007.383344"},{"key":"e_1_2_1_5_1","unstructured":"Ali Borji Ming-Ming Cheng Huaizu Jiang and Jia Li. 2014. Salient object detection: A survey. ArXiv Preprint. Ali Borji Ming-Ming Cheng Huaizu Jiang and Jia Li. 2014. Salient object detection: A survey. ArXiv Preprint."},{"key":"e_1_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2012.89"},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2012.2210727"},{"key":"e_1_2_1_8_1","volume-title":"Retrieved","author":"Bylinskii Zoya","year":"2012"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1167\/9.12.10"},{"key":"e_1_2_1_10_1","unstructured":"Moran Cerf Jonathan Harel Wolfgang Einh\u00e4user and Christof Koch. 2008. Predicting human gaze using low-level saliency combined with face detection. In Advances in Neural Information Processing Systems. 241--248. Moran Cerf Jonathan Harel Wolfgang Einh\u00e4user and Christof Koch. 2008. Predicting human gaze using low-level saliency combined with face detection. In Advances in Neural Information Processing Systems. 241--248."},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP.2010.5651381"},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2014.2329380"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2011.5995344"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2014.414"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/WIAMIS.2013.6616164"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1167\/14.8.5"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1167\/13.4.11"},{"key":"e_1_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2013.2267205"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/LSP.2015.2413944"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2014.2372392"},{"key":"e_1_2_1_21_1","volume-title":"Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition. 1--8.","author":"Guo Chenlei","year":"2008"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1162\/0899766042321814"},{"key":"e_1_2_1_23_1","doi-asserted-by":"crossref","unstructured":"Jonathan Harel Christof Koch and Pietro Perona. 2006. Graph-based visual saliency. In Advances in Neural Information Processing Systems. 545--552. Jonathan Harel Christof Koch and Pietro Perona. 2006. Graph-based visual saliency. In Advances in Neural Information Processing Systems. 545--552.","DOI":"10.7551\/mitpress\/7503.003.0073"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2007.383267"},{"key":"e_1_2_1_25_1","unstructured":"Xiaodi Hou and Liqing Zhang. 2009. Dynamic visual attention: Searching for coding length increments. In Advances in Neural Information Processing Systems. 681--688. Xiaodi Hou and Liqing Zhang. 2009. Dynamic visual attention: Searching for coding length increments. In Advances in Neural Information Processing Systems. 681--688."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2004.834657"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/34.730558"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2012.2228476"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1037\/h0061495"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2009.5459462"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2005.274"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/TSP.2006.888095"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2015.2425544"},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/JSTSP.2011.2165199"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2012.147"},{"key":"e_1_2_1_36_1","doi-asserted-by":"publisher","DOI":"10.1145\/2647868.2654936"},{"key":"e_1_2_1_37_1","unstructured":"Ce Liu. 2009. Beyond Pixels: Exploring New Representations and Applications for Motion Analysis. Ph.D. Dissertation. Citeseer. Ce Liu. 2009. Beyond Pixels: Exploring New Representations and Applications for Motion Analysis. Ph.D. Dissertation. Citeseer."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2005.854410"},{"key":"e_1_2_1_39_1","volume-title":"Proceedings of IEEE International Workshop on Quality of Multimedia Experience. 153--158","author":"Min Xiongkuo","year":"2014"},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2014.2305632"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-74048-3"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP.2007.4379119"},{"key":"e_1_2_1_43_1","volume-title":"Strybel","author":"Perrott David R.","year":"1990"},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.visres.2005.03.019"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1167\/9.12.15"},{"key":"e_1_2_1_46_1","unstructured":"G. W. Snecdecor and W. G. Cochran. 1989. Statistical Methods (8th ed.). Iowa State University Press Iowa City IA. G. W. Snecdecor and W. G. Cochran. 1989. Statistical Methods (8th ed.). Iowa State University Press Iowa City IA."},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.16910\/jemr.6.4.1"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1037\/0096-1523.26.5.1583"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-33783-3_45"},{"key":"e_1_2_1_50_1","doi-asserted-by":"publisher","DOI":"10.1037\/0096-1523.16.1.121"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/1180639.1180824"},{"key":"e_1_2_1_52_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2013.26"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","DOI":"10.1167\/8.7.32"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2996463","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/2996463","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:23:11Z","timestamp":1750220591000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/2996463"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2016,10,25]]},"references-count":53,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2017,2,28]]}},"alternative-id":["10.1145\/2996463"],"URL":"https:\/\/doi.org\/10.1145\/2996463","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2016,10,25]]},"assertion":[{"value":"2015-11-01","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2016-08-01","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2016-10-25","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}