{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,5]],"date-time":"2026-05-05T09:08:46Z","timestamp":1777972126163,"version":"3.51.4"},"reference-count":89,"publisher":"Springer Science and Business Media LLC","issue":"3","license":[{"start":{"date-parts":[[2023,10,18]],"date-time":"2023-10-18T00:00:00Z","timestamp":1697587200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,10,18]],"date-time":"2023-10-18T00:00:00Z","timestamp":1697587200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["R01MH114999"],"award-info":[{"award-number":["R01MH114999"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Comput Vis"],"published-print":{"date-parts":[[2024,3]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Predicting human\u2019s gaze from egocentric videos serves as a critical role for human intention understanding in daily activities. In this paper, we present the first transformer-based model to address the challenging problem of egocentric gaze estimation. We observe that the connection between the global scene context and local visual information is vital for localizing the gaze fixation from egocentric video frames. To this end, we design the transformer encoder to embed the global context as one additional visual token and further propose a novel global\u2013local correlation module to explicitly model the correlation of the global token and each local token. We validate our model on two egocentric video datasets \u2013 EGTEA Gaze + and Ego4D. Our detailed ablation studies demonstrate the benefits of our method. In addition, our approach exceeds the previous state-of-the-art model by a large margin. We also apply our model to a novel gaze saccade\/fixation prediction task and the traditional action recognition problem. The consistent gains suggest the strong generalization capability of our model. We also provide additional visualizations to support our claim that global\u2013local correlation serves a key representation for predicting gaze fixation from egocentric videos. More details can be found in our website (<jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/bolinlai.github.io\/GLC-EgoGazeEst\">https:\/\/bolinlai.github.io\/GLC-EgoGazeEst<\/jats:ext-link>).<\/jats:p>","DOI":"10.1007\/s11263-023-01879-7","type":"journal-article","created":{"date-parts":[[2023,10,18]],"date-time":"2023-10-18T08:02:39Z","timestamp":1697616159000},"page":"854-871","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":30,"title":["In the Eye of Transformer: Global\u2013Local Correlation for Egocentric Gaze Estimation and Beyond"],"prefix":"10.1007","volume":"132","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7578-7336","authenticated-orcid":false,"given":"Bolin","family":"Lai","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Miao","family":"Liu","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Fiona","family":"Ryan","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"James M.","family":"Rehg","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2023,10,18]]},"reference":[{"key":"1879_CR1","doi-asserted-by":"crossref","unstructured":"Al-Naser, M., Siddiqui, S.A., Ohashi, H., Ahmed, S., Katsuyki, N., Takuto, S., & Dengel, A. (2019). Ogaze: Gaze prediction in egocentric videos for attentional object selection. 2019 digital image computing: Techniques and applications (dicta) (pp. 1\u20138).","DOI":"10.1109\/DICTA47822.2019.8945893"},{"key":"1879_CR2","doi-asserted-by":"crossref","unstructured":"Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lu\u010di\u0107, M., & Schmid, C. (2021). Vivit: A video vision transformer. In Proceedings of the IEEE\/CVF International Conference on Computer Vision (pp. 6836\u20136846).","DOI":"10.1109\/ICCV48922.2021.00676"},{"issue":"12","key":"1879_CR3","doi-asserted-by":"publisher","first-page":"3216","DOI":"10.1007\/s11263-021-01519-y","volume":"129","author":"G Bellitto","year":"2021","unstructured":"Bellitto, G., Proietto Salanitri, F., Palazzo, S., Rundo, F., Giordano, D., & Spampinato, C. (2021). Hierarchical domain-adapted feature learning for video saliency prediction. International Journal of Computer Vision, 129(12), 3216\u20133232.","journal-title":"International Journal of Computer Vision"},{"key":"1879_CR4","unstructured":"Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video understanding?. In International Conference on Machine Learning."},{"key":"1879_CR5","first-page":"1877","volume":"33","author":"T Brown","year":"2020","unstructured":"Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., & Dhariwal, P. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877\u20131901.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"1879_CR6","doi-asserted-by":"crossref","unstructured":"Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In European Conference on Computer Vision (pp. 213\u2013229).","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"1879_CR7","doi-asserted-by":"publisher","first-page":"2287","DOI":"10.1109\/TIP.2019.2945857","volume":"29","author":"Z Che","year":"2019","unstructured":"Che, Z., Borji, A., Zhai, G., Min, X., Guo, G., & Le Callet, P. (2019). How is gaze influenced by image transformations? dataset and model. IEEE Transactions on Image Processing, 29, 2287\u20132300.","journal-title":"IEEE Transactions on Image Processing"},{"key":"1879_CR8","doi-asserted-by":"crossref","unstructured":"Chen, B., Li, P., Li, C., Li, B., Bai, L., Lin, C., & Ouyang, W. (2021). Glit: Neural architecture search for global and local image transformer. In Proceedings of the IEEE\/CVF International Conference on Computer Vision (pp. 12\u201321).","DOI":"10.1109\/ICCV48922.2021.00008"},{"key":"1879_CR9","doi-asserted-by":"publisher","first-page":"59","DOI":"10.1016\/j.neucom.2021.07.088","volume":"462","author":"J Chen","year":"2021","unstructured":"Chen, J., Li, Z., Jin, Y., Ren, D., & Ling, H. (2021). Video saliency prediction via spatio-temporal reasoning. Neurocomputing, 462, 59\u201368.","journal-title":"Neurocomputing"},{"key":"1879_CR10","doi-asserted-by":"crossref","unstructured":"Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., & Girdhar, R. (2022). Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (pp. 1290\u20131299).","DOI":"10.1109\/CVPR52688.2022.00135"},{"key":"1879_CR11","doi-asserted-by":"crossref","unstructured":"Chong, E., Ruiz, N., Wang, Y., Zhang, Y., Rozga, A., & Rehg, J.M. (2018). Connecting gaze, scene, and attention: Generalized attention estimation via joint modeling of gaze and scene saliency. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 383\u2013398).","DOI":"10.1007\/978-3-030-01228-1_24"},{"key":"1879_CR12","doi-asserted-by":"crossref","unstructured":"Chong, E., Wang, Y., Ruiz, N., & Rehg, J.M. (2020). Detecting attended visual targets in video. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (pp. 5396\u20135406).","DOI":"10.1109\/CVPR42600.2020.00544"},{"key":"1879_CR13","doi-asserted-by":"crossref","unstructured":"Dai, X., Chen, Y., Yang, J., Zhang, P., Yuan, L., & Zhang, L. (2021). Dynamic detr: End-to-end object detection with dynamic attention. In Proceedings of the IEEE\/CVF International Conference on Computer Vision (pp. 2988\u20132997).","DOI":"10.1109\/ICCV48922.2021.00298"},{"key":"1879_CR14","doi-asserted-by":"crossref","unstructured":"Dai, Z., Cai, B., Lin, Y., & Chen, J. (2021). Up-detr: Unsupervised pre-training for object detection with transformers. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (pp. 1601\u20131610).","DOI":"10.1109\/CVPR46437.2021.00165"},{"key":"1879_CR15","first-page":"3965","volume":"34","author":"Z Dai","year":"2021","unstructured":"Dai, Z., Liu, H., Le, Q. V., & Tan, M. (2021). Coatnet: Marrying convolution and attention for all data sizes. Advances in Neural Information Processing Systems, 34, 3965\u20133977.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"1879_CR16","unstructured":"Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805"},{"key":"1879_CR17","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., et al. (2022). An image is worth 16x16 words: Transformers for image recognition at scale. Iclr."},{"key":"1879_CR18","doi-asserted-by":"crossref","unstructured":"Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., & Feichtenhofer, C. (2021). Multiscale vision transformers. In Proceedings of the IEEE\/CVF International Conference on Computer Vision (pp. 6824\u20136835).","DOI":"10.1109\/ICCV48922.2021.00675"},{"key":"1879_CR19","first-page":"26183","volume":"34","author":"Y Fang","year":"2021","unstructured":"Fang, Y., Liao, B., Wang, X., Fang, J., Qi, J., Wu, R., & Liu, W. (2021). You only look at one sequence: Rethinking transformer in vision through object detection. Advances in Neural Information Processing Systems, 34, 26183\u201397.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"1879_CR20","doi-asserted-by":"crossref","unstructured":"Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In Proceedings of the IEEE\/CVF International Conference on Computer Vision (pp. 6202\u20136211).","DOI":"10.1109\/ICCV.2019.00630"},{"key":"1879_CR21","unstructured":"Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (pp. 249\u2013256)."},{"key":"1879_CR22","unstructured":"Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar, R., et al. (2022). Ego4d: Around the world in 3000 hours of egocentric video. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (pp. 18995\u201319012)."},{"key":"1879_CR23","doi-asserted-by":"crossref","unstructured":"Hao, Y., Zhang, H., Ngo, C.-W., & He, X. (2022). Group contextualization for video recognition. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (pp. 928\u2013938).","DOI":"10.1109\/CVPR52688.2022.00100"},{"key":"1879_CR24","doi-asserted-by":"crossref","unstructured":"Harel, J., Koch, C. & Perona, P. (2006). Graph-based visual saliency. Advances in neural information processing systems. 19.","DOI":"10.7551\/mitpress\/7503.003.0073"},{"issue":"4","key":"1879_CR25","doi-asserted-by":"publisher","first-page":"188","DOI":"10.1016\/j.tics.2005.02.009","volume":"9","author":"M Hayhoe","year":"2005","unstructured":"Hayhoe, M., & Ballard, D. (2005). Eye movements in natural behavior. Trends in Cognitive Sciences, 9(4), 188\u2013194.","journal-title":"Trends in Cognitive Sciences"},{"key":"1879_CR26","doi-asserted-by":"publisher","first-page":"7795","DOI":"10.1109\/TIP.2020.3007841","volume":"29","author":"Y Huang","year":"2020","unstructured":"Huang, Y., Cai, M., Li, Z., Lu, F., & Sato, Y. (2020). Mutual context network for jointly estimating egocentric gaze and action. IEEE Transactions on Image Processing, 29, 7795\u20137806.","journal-title":"IEEE Transactions on Image Processing"},{"key":"1879_CR27","doi-asserted-by":"crossref","unstructured":"Huang, Y., Cai, M., Li, Z. & Sato, Y. (2018). Predicting gaze in egocentric video by learning task-dependent attention transition. In Proceedings of the European conference on computer vision (eccv) (pp. 754\u2013769).","DOI":"10.1007\/978-3-030-01225-0_46"},{"issue":"4","key":"1879_CR28","doi-asserted-by":"publisher","first-page":"306","DOI":"10.1109\/THMS.2020.2965429","volume":"50","author":"Y Huang","year":"2020","unstructured":"Huang, Y., Cai, M., & Sato, Y. (2020). An ego-vision system for discovering human joint attention. IEEE Transactions on Human-Machine Systems, 50(4), 306\u2013316.","journal-title":"IEEE Transactions on Human-Machine Systems"},{"key":"1879_CR29","doi-asserted-by":"crossref","unstructured":"Hussain, T., Anwar, A., Anwar, S., Petersson, L., & Baik, S.W. (2022). Pyramidal attention for saliency detection. In 2022 IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (pp. 2877\u20132887).","DOI":"10.1109\/CVPRW56347.2022.00325"},{"key":"1879_CR30","doi-asserted-by":"publisher","first-page":"103887","DOI":"10.1016\/j.imavis.2020.103887","volume":"95","author":"S Jia","year":"2020","unstructured":"Jia, S., & Bruce, N. D. (2020). Eml-net: An expandable multi-layer network for saliency prediction. Image and Vision Computing, 95, 103887.","journal-title":"Image and Vision Computing"},{"key":"1879_CR31","doi-asserted-by":"crossref","unstructured":"Jia, W., Liu, M. & Rehg, J.M. (2022). Generative adversarial network for future hand segmentation from egocentric video. In Proceedings of the European Conference on Computer Vision (ECCV).","DOI":"10.1007\/978-3-031-19778-9_37"},{"key":"1879_CR32","doi-asserted-by":"crossref","unstructured":"Jiang, L., Li, Y., Li, S., Xu, M., Lei, S., Guo, Y. & Huang, B. (2022). Does text attract attention on e-commerce images: A novel saliency prediction dataset and method. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (pp. 2088\u20132097).","DOI":"10.1109\/CVPR52688.2022.00213"},{"key":"1879_CR33","unstructured":"Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., et al. (2017). The kinetics human action video dataset. arXiv preprint arXiv:1705.06950"},{"key":"1879_CR34","doi-asserted-by":"crossref","unstructured":"Kellnhofer, P., Recasens, A., Stent, S., Matusik, W. & Torralba, A. (2019). Gaze360: Physically unconstrained gaze estimation in the wild. In IEEE International Conference on Computer Vision (ICCV).","DOI":"10.1109\/ICCV.2019.00701"},{"key":"1879_CR35","doi-asserted-by":"crossref","unstructured":"Khattar, A., Hegde, S. & Hebbalaguppe, R. (2021). Cross-domain multi-task learning for object detection and saliency estimation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (pp. 3639\u20133648).","DOI":"10.1109\/CVPRW53098.2021.00403"},{"key":"1879_CR36","doi-asserted-by":"crossref","unstructured":"Krafka, K., Khosla, A., Kellnhofer, P., Kannan, H., Bhandarkar, S., Matusik, W., & Torralba, A. (2016). Eye tracking for everyone. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2176\u20132184).","DOI":"10.1109\/CVPR.2016.239"},{"key":"1879_CR37","doi-asserted-by":"publisher","first-page":"261","DOI":"10.1016\/j.neunet.2020.05.004","volume":"129","author":"A Kroner","year":"2020","unstructured":"Kroner, A., Senden, M., Driessens, K., & Goebel, R. (2020). Contextual encoder-decoder network for visual saliency prediction. Neural Networks, 129, 261\u2013270.","journal-title":"Neural Networks"},{"issue":"9","key":"1879_CR38","doi-asserted-by":"publisher","first-page":"4446","DOI":"10.1109\/TIP.2017.2710620","volume":"26","author":"SS Kruthiventi","year":"2017","unstructured":"Kruthiventi, S. S., Ayush, K., & Babu, R. V. (2017). Deepfix: A fully convolutional neural network for predicting human eye fixations. IEEE Transactions on Image Processing, 26(9), 4446\u20134456.","journal-title":"IEEE Transactions on Image Processing"},{"key":"1879_CR39","doi-asserted-by":"crossref","unstructured":"Lai, B., Liu, M., Ryan, F., & Rehg, J. (2022). In the eye of transformer: Global-local correlation for egocentric gaze estimation. In British Machine Vision Conference.","DOI":"10.1007\/s11263-023-01879-7"},{"key":"1879_CR40","doi-asserted-by":"crossref","unstructured":"Lee, Y., Kim, J., Willette, J., & Hwang, S.J. (2022). Mpvit: Multi-path vision transformer for dense prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.","DOI":"10.1109\/CVPR52688.2022.00714"},{"key":"1879_CR41","doi-asserted-by":"crossref","unstructured":"Li, Y., Fathi, A., & Rehg, J.M. (2013). Learning to predict gaze in egocentric video. In Proceedings of the IEEE International Conference on Computer Vision (pp. 3216\u20133223).","DOI":"10.1109\/ICCV.2013.399"},{"key":"1879_CR42","unstructured":"Li, Y., Liu, M., & Rehg, J. (2021). In the eye of the beholder: Gaze and actions in first person video. In IEEE Transactions on Pattern Analysis and Machine Intelligence."},{"key":"1879_CR43","doi-asserted-by":"crossref","unstructured":"Li, Y., Liu, M., & Rehg, J.M. (2018). In the eye of beholder: Joint learning of gaze and actions in first person video. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 619\u2013635).","DOI":"10.1007\/978-3-030-01228-1_38"},{"key":"1879_CR44","doi-asserted-by":"crossref","unstructured":"Li, Y., Wu, C.-Y., Fan, H., Mangalam, K., Xiong, B., Malik, J., & Feichtenhofer, C. (2022). Mvitv2: Improved multiscale vision transformers for classification and detection. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (pp. 4804\u20134814).","DOI":"10.1109\/CVPR52688.2022.00476"},{"key":"1879_CR45","doi-asserted-by":"crossref","unstructured":"Lin, S., Xie, H., Wang, B., Yu, K., Chang, X., Liang, X., & Wang, G. (2022). Knowledge distillation via the target-aware transformer. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (pp. 10915\u201310924).","DOI":"10.1109\/CVPR52688.2022.01064"},{"key":"1879_CR46","doi-asserted-by":"crossref","unstructured":"Liu, M., Ma, L., Somasundaram, K., Li, Y., Grauman, K., Rehg, J.M., & Li, C. (2022). Egocentric activity recognition and localization on a 3d map. In Proceedings of the European Conference on Computer Vision (ECCV).","DOI":"10.1007\/978-3-031-19778-9_36"},{"key":"1879_CR47","doi-asserted-by":"crossref","unstructured":"Liu, M., Tang, S., Li, Y., & Rehg, J.M. (2020). Forecasting human-object interaction: joint prediction of motor attention and actions in first person video. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 704\u2013721).","DOI":"10.1007\/978-3-030-58452-8_41"},{"key":"1879_CR48","doi-asserted-by":"crossref","unstructured":"Liu, N., Han, J., & Yang, M.-H. (2018). Picanet: Learning pixel-wise contextual attention for saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3089\u20133098).","DOI":"10.1109\/CVPR.2018.00326"},{"key":"1879_CR49","doi-asserted-by":"crossref","unstructured":"Liu, N., Nan, K., Zhao, W., Yao, X., & Han, J. (2023). Learning complementary spatial\u2013temporal transformer for video salient object detection. IEEE Transactions on Neural Networks and Learning Systems.","DOI":"10.1109\/TNNLS.2023.3243246"},{"key":"1879_CR50","doi-asserted-by":"crossref","unstructured":"Liu, N., Zhang, N., Wan, K., Shao, L., & Han, J. (2021). Visual saliency transformer. In Proceedings of the IEEE\/CVF International Conference on Computer Vision (pp. 4722\u20134732).","DOI":"10.1109\/ICCV48922.2021.00468"},{"key":"1879_CR51","unstructured":"Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692"},{"key":"1879_CR52","doi-asserted-by":"crossref","unstructured":"Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE\/CVF International Conference on Computer Vision (pp. 10012\u201310022).","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"1879_CR53","doi-asserted-by":"crossref","unstructured":"Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., & Hu, H. (2022). Video swin transformer. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (pp. 3202\u20133211).","DOI":"10.1109\/CVPR52688.2022.00320"},{"key":"1879_CR54","unstructured":"Loshchilov, I., & Hutter, F. (xxxx). Decoupled weight decay regularization. In International Conference on Learning Representations."},{"key":"1879_CR55","unstructured":"Loshchilov, I., & Hutter, F. (2016). Sgdr: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations."},{"key":"1879_CR56","unstructured":"Lou, J., Lin, H., Marshall, D., Saupe, D., & Liu, H. (2021). Transalnet: Visual saliency prediction using transformers. arXiv preprint arXiv:2110.03593"},{"key":"1879_CR57","doi-asserted-by":"crossref","unstructured":"Ma, C., Sun, H., Rao, Y., Zhou, J., & Lu, J. (2022). Video saliency forecasting transformer. In IEEE Transactions on Circuits and Systems for Video Technology.","DOI":"10.1109\/TCSVT.2022.3172971"},{"key":"1879_CR58","doi-asserted-by":"crossref","unstructured":"MacInnes, J.J., Iqbal, S., Pearson, J., & Johnson, E.N. (2018). Wearable eye-tracking for research: Automated dynamic gaze mapping and accuracy\/precision comparisons across devices. BioRxiv. 299925","DOI":"10.1101\/299925"},{"key":"1879_CR59","doi-asserted-by":"crossref","unstructured":"Naas, S.-A., Jiang, X., Sigg, S., & Ji, Y. (2020). Functional gaze prediction in egocentric video. In Proceedings of the 18th International Conference on Advances in Mobile Computing & Multimedia (pp. 40\u201347).","DOI":"10.1145\/3428690.3429174"},{"key":"1879_CR60","doi-asserted-by":"crossref","unstructured":"Neimark, D., Bar, O., Zohar, M., & Asselmann, D. (2021). Video transformer network. In Proceedings of the IEEE\/CVF International Conference on Computer Vision (pp. 3163\u20133172).","DOI":"10.1109\/ICCVW54120.2021.00355"},{"key":"1879_CR61","doi-asserted-by":"crossref","unstructured":"Nonaka, S., Nobuhara, S., & Nishino, K. (2022). Dynamic 3d gaze from afar: Deep gaze estimation from temporal eye-head-body coordination. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (cvpr) (p.\u00a02192-2201).","DOI":"10.1109\/CVPR52688.2022.00223"},{"key":"1879_CR62","unstructured":"Pan, J., Ferrer, C.C., McGuinness, K., O\u2019Connor, N.E., Torres, J., Sayrol, E., & Giro-i Nieto, X. (2017). Salgan: Visual saliency prediction with generative adversarial networks. arXiv preprint arXiv:1701.01081"},{"key":"1879_CR63","first-page":"12493","volume":"34","author":"M Patrick","year":"2021","unstructured":"Patrick, M., Campbell, D., Asano, Y., Misra, I., Metze, F., Feichtenhofer, C., Henriques, J. F., et al. (2021). Keeping your eye on the ball: Trajectory attention in video transformers. Advances in Neural Information Processing Systems, 34, 12493\u201312506.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"1879_CR64","doi-asserted-by":"crossref","unstructured":"Ren, S., Zhou, D., He, S., Feng, J., & Wang, X. (2022). Shunted self-attention via multi-scale token aggregation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition.","DOI":"10.1109\/CVPR52688.2022.01058"},{"key":"1879_CR65","doi-asserted-by":"crossref","unstructured":"Soo\u00a0Park, H., & Shi, J. (2015). Social saliency prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4777\u20134785).","DOI":"10.1109\/CVPR.2015.7299110"},{"key":"1879_CR66","doi-asserted-by":"crossref","unstructured":"Strudel, R., Garcia, R., Laptev, I., & Schmid, C. (2021). Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE\/CVF International Conference on Computer Vision (pp. 7262\u20137272).","DOI":"10.1109\/ICCV48922.2021.00717"},{"issue":"1","key":"1879_CR67","doi-asserted-by":"publisher","first-page":"131","DOI":"10.1007\/s00530-021-00796-4","volume":"28","author":"Y Sun","year":"2022","unstructured":"Sun, Y., Zhao, M., Hu, K., & Fan, S. (2022). Visual saliency prediction using multi-scale attention gated network. Multimedia Systems, 28(1), 131\u2013139.","journal-title":"Multimedia Systems"},{"key":"1879_CR68","doi-asserted-by":"crossref","unstructured":"Tavakoli, H.R., Rahtu, E., Kannala, J., & Borji, A. (2019). Digging deeper into egocentric gaze prediction. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV) (pp. 273\u2013282).","DOI":"10.1109\/WACV.2019.00035"},{"key":"1879_CR69","doi-asserted-by":"crossref","unstructured":"Thakur, S.K., Beyan, C., Morerio, P., & Del\u00a0Bue, A. (2021). Predicting gaze from egocentric social interaction videos and imu data. In Proceedings of the 2021 International Conference on Multimodal Interaction (pp. 717\u2013722).","DOI":"10.1145\/3462244.3479954"},{"key":"1879_CR70","doi-asserted-by":"crossref","unstructured":"Tsiami, A., Koutras, P., & Maragos, P. (2020). Stavis: Spatio-temporal audiovisual saliency network. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (pp. 4766\u20134776).","DOI":"10.1109\/CVPR42600.2020.00482"},{"key":"1879_CR71","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems. 30"},{"key":"1879_CR72","doi-asserted-by":"crossref","unstructured":"Wang, H., Zhu, Y., Adam, H., Yuille, A., & Chen, L.-C. (2021). Max-deeplab: End-to-end panoptic segmentation with mask transformers. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (pp. 5463\u20135474).","DOI":"10.1109\/CVPR46437.2021.00542"},{"key":"1879_CR73","doi-asserted-by":"crossref","unstructured":"Wang, J., & Torresani, L. (2022). Deformable video transformer. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (pp. 14053\u201314062).","DOI":"10.1109\/CVPR52688.2022.01366"},{"key":"1879_CR74","doi-asserted-by":"crossref","unstructured":"Wang, L., Lu, H., Ruan, X., & Yang, M.-H. (2015). Deep networks for saliency detection via local estimation and global search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3183\u20133192).","DOI":"10.1109\/CVPR.2015.7298938"},{"issue":"1","key":"1879_CR75","doi-asserted-by":"publisher","first-page":"38","DOI":"10.1109\/TIP.2017.2754941","volume":"27","author":"W Wang","year":"2017","unstructured":"Wang, W., Shen, J., & Shao, L. (2017). Video salient object detection via fully convolutional networks. IEEE Transactions on Image Processing, 27(1), 38\u201349.","journal-title":"IEEE Transactions on Image Processing"},{"key":"1879_CR76","doi-asserted-by":"crossref","unstructured":"Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., & Shao, L. (2021). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE\/CVF International Conference on Computer Vision (pp. 568\u2013578).","DOI":"10.1109\/ICCV48922.2021.00061"},{"key":"1879_CR77","doi-asserted-by":"crossref","unstructured":"Wang, X., Wu, Y., Zhu, L., & Yang, Y. (2020). Symbiotic attention with privileged information for egocentric action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, pp. 12249\u201312256).","DOI":"10.1609\/aaai.v34i07.6907"},{"key":"1879_CR78","unstructured":"Wang, Z., Liu, Z., Li, G., Wang, Y., Zhang, T., Xu, L., & Wang, J. (2021). Spatio-temporal self-attention network for video saliency prediction. IEEE Transactions on Multimedia."},{"key":"1879_CR79","doi-asserted-by":"crossref","unstructured":"Wu, X., Wu, Z., Zhang, J., Ju, L., & Wang, S. (2020). Salsac: A video saliency prediction model with shuffled attentions and correlation-based convlstm. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, pp. 12410\u201312417).","DOI":"10.1609\/aaai.v34i07.6927"},{"key":"1879_CR80","unstructured":"Yang, J., Li, C., Zhang, P., Dai, X., Xiao, B., Yuan, L., & Gao, J. (2021). Focal self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641"},{"issue":"8","key":"1879_CR81","doi-asserted-by":"publisher","first-page":"2163","DOI":"10.1109\/TMM.2019.2947352","volume":"22","author":"S Yang","year":"2019","unstructured":"Yang, S., Lin, G., Jiang, Q., & Lin, W. (2019). A dilated inception network for visual saliency prediction. IEEE Transactions on Multimedia, 22(8), 2163\u20132176.","journal-title":"IEEE Transactions on Multimedia"},{"key":"1879_CR82","volume-title":"Eye Movements and Vision","author":"AL Yarbus","year":"2013","unstructured":"Yarbus, A. L. (2013). Eye Movements and Vision. Springer."},{"key":"1879_CR83","doi-asserted-by":"crossref","unstructured":"Ye, Z., Li, Y., Fathi, A., Han, Y., Rozga, A., Abowd, G.D., & Rehg, J.M. (2012). Detecting eye contact using wearable eye-tracking glasses. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing (pp. 699\u2013704).","DOI":"10.1145\/2370216.2370368"},{"key":"1879_CR84","doi-asserted-by":"crossref","unstructured":"Zhang, H., Cisse, M., Dauphin, Y.N., & Lopez-Paz, D. (2017). mixup: Beyond empirical risk minimization. In International Conference on Learning Representations.","DOI":"10.1007\/978-1-4899-7687-1_79"},{"issue":"8","key":"1879_CR85","doi-asserted-by":"publisher","first-page":"1783","DOI":"10.1109\/TPAMI.2018.2871688","volume":"41","author":"M Zhang","year":"2018","unstructured":"Zhang, M., Ma, K. T., Lim, J. H., Zhao, Q., & Feng, J. (2018). Anticipating where people will look using adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(8), 1783\u20131796.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"1879_CR86","doi-asserted-by":"crossref","unstructured":"Zhang, M., Teck\u00a0Ma, K., Hwee\u00a0Lim, J., Zhao, Q., & Feng, J. (2017). Deep future gaze: Gaze anticipation on egocentric videos using adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4372\u20134381).","DOI":"10.1109\/CVPR.2017.377"},{"key":"1879_CR87","doi-asserted-by":"crossref","unstructured":"Zhang, W., Huang, Z., Luo, G., Chen, T., Wang, X., Liu, W., & Shen, C. (2022). Topformer: Token pyramid transformer for mobile semantic segmentation. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (pp. 12083\u201312093).","DOI":"10.1109\/CVPR52688.2022.01177"},{"key":"1879_CR88","doi-asserted-by":"crossref","unstructured":"Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., et al. (2021). Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the ieee\/cvf conference on computer vision and pattern recognition (pp. 6881\u20136890).","DOI":"10.1109\/CVPR46437.2021.00681"},{"key":"1879_CR89","doi-asserted-by":"crossref","unstructured":"Zhuge, M., Fan, D.-P., Liu, N., Zhang, D., Xu, D., & Shao, L. (2022). Salient object detection via integrity learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3), 3738\u201352.","DOI":"10.1109\/TPAMI.2022.3179526"}],"container-title":["International Journal of Computer Vision"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-023-01879-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11263-023-01879-7\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11263-023-01879-7.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,2,19]],"date-time":"2024-02-19T21:16:21Z","timestamp":1708377381000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11263-023-01879-7"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,10,18]]},"references-count":89,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2024,3]]}},"alternative-id":["1879"],"URL":"https:\/\/doi.org\/10.1007\/s11263-023-01879-7","relation":{},"ISSN":["0920-5691","1573-1405"],"issn-type":[{"value":"0920-5691","type":"print"},{"value":"1573-1405","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,10,18]]},"assertion":[{"value":"1 April 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"10 August 2023","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"18 October 2023","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}