{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,20]],"date-time":"2026-07-20T13:03:27Z","timestamp":1784552607065,"version":"3.55.0"},"reference-count":238,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2026,2,1]],"date-time":"2026-02-01T00:00:00Z","timestamp":1769904000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2026,2,2]],"date-time":"2026-02-02T00:00:00Z","timestamp":1769990400000},"content-version":"vor","delay-in-days":1,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Mach. Intell. Res."],"published-print":{"date-parts":[[2026,2]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>With the rapid development of artificial intelligence technologies and wearable devices, egocentric vision understanding has emerged as a new and challenging research direction, gradually attracting widespread attention from both academia and industry. Egocentric vision captures visual and multimodal data through cameras or sensors worn on the human body, offering a unique perspective that simulates human visual experiences. This paper provides a comprehensive survey of the research on egocentric vision understanding, systematically analyzing the components of egocentric scenes and categorizing the tasks into four main areas: subject understanding, object understanding, environment understanding, and hybrid understanding. We explore in detail the sub-tasks within each category. We also summarize the main challenges and trends currently existing in the field. Furthermore, this paper presents an overview of high-quality egocentric vision datasets, offering valuable resources for future research. By summarizing the latest advancements, we anticipate the broad applications of egocentric vision technologies in fields such as augmented reality, virtual reality, and embodied intelligence, and propose future research directions based on the latest developments in the field.<\/jats:p>","DOI":"10.1007\/s11633-025-1599-4","type":"journal-article","created":{"date-parts":[[2026,2,2]],"date-time":"2026-02-02T14:11:45Z","timestamp":1770041505000},"page":"1-33","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Challenges and Trends in Egocentric Vision: A Survey"],"prefix":"10.1007","volume":"23","author":[{"ORCID":"https:\/\/orcid.org\/0009-0009-1196-2276","authenticated-orcid":false,"given":"Xiang","family":"Li","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0963-0311","authenticated-orcid":false,"given":"Heqian","family":"Qiu","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3745-0262","authenticated-orcid":false,"given":"Lanxiao","family":"Wang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Hanwen","family":"Zhang","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Chenghao","family":"Qi","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Linfeng","family":"Han","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Huiyu","family":"Xiong","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7481-095X","authenticated-orcid":false,"given":"Hongliang","family":"Li","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2026,2,2]]},"reference":[{"issue":"5","key":"1599_CR1","doi-asserted-by":"publisher","first-page":"744","DOI":"10.1109\/TCSVT.2015.2409731","volume":"25","author":"A Betancourt","year":"2015","unstructured":"A. Betancourt, P. Morerio, C. S. Regazzoni, M. Rauterberg. The evolution of first person vision methods: A survey. IEEE Transactions on Circuits and Systems for Video Technology, vol. 25, no. 5, pp. 744\u2013760, 2015. DOI: https:\/\/doi.org\/10.1109\/TCSVT.2015.2409731.","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology"},{"issue":"1","key":"1599_CR2","doi-asserted-by":"publisher","first-page":"65","DOI":"10.1109\/THMS.2016.2623480","volume":"47","author":"A G del Molino","year":"2017","unstructured":"A. G. del Molino, C. Tan, J. H. Lim, A. H. Tan. Summarization of egocentric videos: A comprehensive survey. IEEE Transactions on Human-Machine Systems, vol. 47, no. 1, pp. 65\u201376, 2017. DOI: https:\/\/doi.org\/10.1109\/THMS.2016.2623480.","journal-title":"IEEE Transactions on Human-Machine Systems"},{"key":"1599_CR3","doi-asserted-by":"publisher","unstructured":"I. Rodin, A. Furnari, D. Mavroeidis, G. M. Farinella. Predicting the future from first person (egocentric) vision: A survey. Computer Vision and Image Understanding, vol. 211, Article number 103252, 2021. DOI: https:\/\/doi.org\/10.1016\/j.cviu.2021.103252.","DOI":"10.1016\/j.cviu.2021.103252"},{"key":"1599_CR4","doi-asserted-by":"publisher","first-page":"175","DOI":"10.1016\/j.neucom.2021.11.081","volume":"472","author":"A N\u00fa\u00f1ez-Marcos","year":"2022","unstructured":"A. N\u00fa\u00f1ez-Marcos, G. Azkune, I. Arganda-Carreras. Egocentric vision-based action recognition: A survey. Neurocomputing, vol. 472, pp. 175\u2013197, 2022. DOI: https:\/\/doi.org\/10.1016\/j.neucom.2021.11.081.","journal-title":"Neurocomputing"},{"issue":"6","key":"1599_CR5","doi-asserted-by":"publisher","first-page":"6846","DOI":"10.1109\/TPAMI.2020.2986648","volume":"45","author":"A Bandini","year":"2023","unstructured":"A. Bandini, J. Zariffa. Analysis of the hands in egocentric vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 6, pp. 6846\u20136866, 2023. DOI: https:\/\/doi.org\/10.1109\/TPAMI.2020.2986648.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"1599_CR6","doi-asserted-by":"publisher","first-page":"1643","DOI":"10.1109\/CVPRW63382.2024.00171","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"M Azam","year":"2024","unstructured":"M. Azam, K. Desai. A survey on 3D egocentric human pose estimation. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 1643\u20131654, 2024. DOI: https:\/\/doi.org\/10.1109\/CVPRW63382.2024.00171."},{"key":"1599_CR7","doi-asserted-by":"publisher","first-page":"428","DOI":"10.1007\/978-3-031-72698-9_25","volume-title":"Proceedings of the 18th European Conference on Computer Vision","author":"Z Fan","year":"2025","unstructured":"Z. Fan, T. Ohkawa, L. Yang, N. Lin, Z. Zhou, S. Zhou, J. Liang, Z. Gao, X. Zhang, X. Zhang, F. Li, Z. Liu, F. Lu, K. A. Zeid, B. Leibe, J. On, S. Baek, A. Prakash, S. Gupta, K. He, Y. Sato, O. Hilliges, H. J. Chang, A. Yao. Benchmarks and challenges in pose estimation for egocentric hand interactions with objects. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 428\u2013448, 2025. DOI: https:\/\/doi.org\/10.1007\/978-3-031-72698-9_25."},{"issue":"11","key":"1599_CR8","doi-asserted-by":"publisher","first-page":"4880","DOI":"10.1007\/s11263-024-02095-7","volume":"132","author":"C Plizzari","year":"2024","unstructured":"C. Plizzari, G. Goletto, A. Furnari, S. Bansal, F. Ragusa, G. M. Farinella, D. Damen, T. Tommasi. An outlook into the future of egocentric vision. International Journal of Computer Vision, vol. 132, no. 11, pp. 4880\u20134936, 2024. DOI: https:\/\/doi.org\/10.1007\/S11263-024-02095-7.","journal-title":"International Journal of Computer Vision"},{"key":"1599_CR9","volume-title":"Scaling pre-training to one hundred billion data for vision language models","author":"X Wang","year":"2025","unstructured":"X. Wang, I. Alabdulmohsin, D. Salz, Z. Li, K. Rong, X. Zhai. Scaling pre-training to one hundred billion data for vision language models, [Online], Available: https:\/\/arxiv.org\/abs\/2502.07617, 2025."},{"issue":"7","key":"1599_CR10","doi-asserted-by":"publisher","first-page":"4177","DOI":"10.1007\/s11263-025-02349-y","volume":"133","author":"W Wang","year":"2025","unstructured":"W. Wang, H. Yang, Z. Tuo, H. He, J. Zhu, J. Fu, J. Liu. Swap attention in spatio-temporal diffusions for text-to-video generation. International Journal of Computer Vision, vol. 133, no. 7, pp. 4177\u20134195, 2025. DOI: https:\/\/doi.org\/10.1007\/s11263-025-02349-y.","journal-title":"International Journal of Computer Vision"},{"key":"1599_CR11","doi-asserted-by":"publisher","first-page":"18973","DOI":"10.1109\/CVPR52688.2022.01842","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"K Grauman","year":"2022","unstructured":"K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, M. Martin, T. Nagarajan, I. Radosavovic, S. K. Ramakrishnan, F. Ryan, J. Sharma, M. Wray, M. Xu, E. Z. Xu, C. Zhao, S. Bansal, D. Batra, V. Cartillier, S. Crane, T. Do, M. Doulaty, A. Erapalli, C. Feichtenhofer, A. Fragomeni, Q. Fu, A. Gebreselasie, C. Gonz\u00e1lez, J. Hillis, X. Huang, Y. Huang, W. Jia, W. Khoo, J. Kol\u00e1\u01d0, S. Kottur, A. Kumar, F. Landini, C. Li, Y. Li, Z. Li, K. Mangalam, R. Modhugu, J. Munro, T. Murrell, T. Nishiyasu, W. Price, P. R. Puentes, M. Ramazanova, L. Sari, K. Somasundaram, A. Southerland, Y. Sugano, R. Tao, M. Vo, Y. Wang, X. Wu, T. Yagi, Z. Zhao, Y. Zhu, P. Arbel\u00e1ez, D. Crandall, D. Damen, G. M. Farinella, C. Fuegen, B. Ghanem, V. K. Ithapu, C. V. Jawahar, H. Joo, K. Kitani, H. Li, R. Newcombe, A. Oliva, H. S. Park, J. M. Rehg, Y. Sato, J. Shi, M. Z. Shou, A. Torralba, L. Torresani, M. Yan, J. Malik. Ego4D: Around the world in 3 000 hours of egocentric video. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, pp. 18973\u201318990, 2022. DOI: https:\/\/doi.org\/10.1109\/CVPR52688.2022.01842."},{"key":"1599_CR12","doi-asserted-by":"publisher","first-page":"256","DOI":"10.1007\/978-3-031-72691-0_15","volume-title":"Proceedings of the 18th European Conference on Computer Vision","author":"H Yun","year":"2025","unstructured":"H. Yun, R. Gao, I. Ananthabhotla, A. Kumar, J. Donley, C. Li, G. Kim, V. K. Ithapu, C. Murdock. Spherical world-locking for audio-visual localization in egocentric videos. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 256\u2013274, 2025. DOI: https:\/\/doi.org\/10.1007\/978-3-031-72691-0_15."},{"key":"1599_CR13","volume-title":"Proceedings of the 37th International Conference on Neural Information Processing Systems","author":"V Tschernezki","year":"2023","unstructured":"V. Tschernezki, A. Darkhalil, Z. Zhu, D. Fouhey, I. Laina, D. Larlus, D. Damen, A. Vedaldi. EPIC fields marrying 3D geometry and video understanding. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 1152, 2023."},{"key":"1599_CR14","doi-asserted-by":"publisher","first-page":"312","DOI":"10.1007\/978-3-031-72649-1_18","volume-title":"Proceedings of the 18th European Conference on Computer Vision","author":"M Zhang","year":"2025","unstructured":"M. Zhang, Y. Huang, R. Liu, Y. Sato. Masked video and body-worn IMU autoencoder for egocentric action recognition. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 312\u2013330, 2025. DOI: https:\/\/doi.org\/10.1007\/978-3-031-72649-1_18."},{"key":"1599_CR15","doi-asserted-by":"publisher","first-page":"6324","DOI":"10.1609\/aaai.v38i6.28451","volume-title":"Proceedings of the 38th AAAI Conference on Artificial Intelligence","author":"L Xu","year":"2024","unstructured":"L. Xu, Y. Gao, W. Song, A. Hao. Weakly supervised multimodal affordance grounding for egocentric images. In Proceedings of the 38th AAAI Conference on Artificial Intelligence, Vancouver, Canada, pp. 6324\u20136332, 2024. DOI: https:\/\/doi.org\/10.1609\/aaai.v38i6.28451."},{"key":"1599_CR16","doi-asserted-by":"publisher","first-page":"16477","DOI":"10.1109\/CVPR52733.2024.01559","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Z Zhao","year":"2024","unstructured":"Z. Zhao, Y. Wang, C. Wang. Fusing personal and environmental cues for identification and segmentation of first-person camera wearers in third-person views. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 16477\u201316487, 2024. DOI: https:\/\/doi.org\/10.1109\/CVPR52733.2024.01559."},{"key":"1599_CR17","doi-asserted-by":"publisher","first-page":"21933","DOI":"10.1109\/CVPR52733.2024.02071","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Y Zhao","year":"2024","unstructured":"Y. Zhao, H. Ma, S. Kong, C. Fowlkes. Instance tracking in 3D scenes from egocentric videos. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 21933\u201321944, 2024. DOI: https:\/\/doi.org\/10.1109\/CVPR52733.2024.02071."},{"key":"1599_CR18","doi-asserted-by":"publisher","first-page":"4793","DOI":"10.1109\/CVPRW63382.2024.00482","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops","author":"Q Zhang","year":"2024","unstructured":"Q. Zhang, T. Xiao, H. Habeeb, L. Laich, S. Bouaziz, P. Snape, W. Zhang, M. Cioffi, P. Zhang, P. Pidlypenskyi, W. Lin, L. Ma, M. Wang, K. Li, C. Long, S. Song, M. Prazak, A. Sjoholm, A. Deogade, J. Lee, J. D. Mangas, A. Aubel. REFA: Real-time egocentric facial animations for virtual reality. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, USA, pp. 4793\u20134802, 2024. DOI: https:\/\/doi.org\/10.1109\/CVPRW63382.2024.00482."},{"key":"1599_CR19","doi-asserted-by":"publisher","first-page":"401","DOI":"10.1007\/978-3-031-73001-6_23","volume-title":"Proceedings of the 18th European Conference on Computer Vision","author":"C Yang","year":"2025","unstructured":"C. Yang, A. Tkach, S. Hampali, L. Zhang, E. J. Crowley, C. Keskin. EgoPoseFormer: A simple baseline for stereo egocentric 3D human pose estimation. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 401\u2013407, 2025. DOI: https:\/\/doi.org\/10.1007\/978-3-031-73001-6_23."},{"key":"1599_CR20","doi-asserted-by":"publisher","first-page":"777","DOI":"10.1109\/CVPR52733.2024.00080","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"J Wang","year":"2024","unstructured":"J. Wang, Z. Cao, D. Luvizon, L. Liu, K. Sarkar, D. Tang, T. Beeler, C. Theobalt. Egocentric whole-body motion capture with FisheyeViT and diffusion-based motion refinement. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 777\u2013787, 2024. DOI: https:\/\/doi.org\/10.1109\/CVPR52733.2024.00080."},{"key":"1599_CR21","doi-asserted-by":"publisher","first-page":"14510","DOI":"10.1109\/CVPR52733.2024.01375","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Y Shen","year":"2024","unstructured":"Y. Shen, H. Wang, X. Yang, M. Feiszli, E. Elhamifar, L. Torresani, E. Mavroudi. Learning to segment referred objects from narrated egocentric videos. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 14510\u201314520, 2024. DOI: https:\/\/doi.org\/10.1109\/CVPR52733.2024.01375."},{"key":"1599_CR22","doi-asserted-by":"publisher","first-page":"26386","DOI":"10.1109\/CVPR52733.2024.02493","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"W Jia","year":"2024","unstructured":"W. Jia, M. Liu, H. Jiang, I. Ananthabhotla, J. M. Rehg, V. K. Ithapu, R. Gao. The audio-visual conversational graph: From an egocentric-exocentric perspective. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 26386\u201326395, 2024. DOI: https:\/\/doi.org\/10.1109\/CVPR52733.2024.02493."},{"key":"1599_CR23","doi-asserted-by":"publisher","first-page":"18622","DOI":"10.1109\/CVPR52733.2024.01762","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"I Rodin","year":"2024","unstructured":"I. Rodin, A. Furnari, K. Min, S. Tripathi, G. M. Farinella. Action scene graphs for long-form understanding of egocentric videos. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 18622\u201318632, 2024. DOI: https:\/\/doi.org\/10.1109\/CVPR52733.2024.01762."},{"key":"1599_CR24","volume-title":"Proceedings of the 33rd British Machine Vision Conference","author":"B Lai","year":"2022","unstructured":"B. Lai, M. Liu, F. Ryan, J. M. Rehg. In the eye of transformer: Global-local correlation for egocentric gaze estimation. In Proceedings of the 33rd British Machine Vision Conference, London, UK, Article number 227, 2022."},{"key":"1599_CR25","doi-asserted-by":"publisher","first-page":"216","DOI":"10.1007\/978-3-031-72684-2_13","volume-title":"Proceedings of the 18th European Conference on Computer Vision","author":"L Ouyang","year":"2025","unstructured":"L. Ouyang, R. Liu, Y. Huang, R. Furuta, Y. Sato. ActionVOS: Actions as prompts for video object segmentation. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 216\u2013235, 2025. DOI: https:\/\/doi.org\/10.1007\/978-3-031-72684-2_13."},{"key":"1599_CR26","doi-asserted-by":"publisher","first-page":"167","DOI":"10.1007\/978-3-031-73337-6_10","volume-title":"Proceedings of the 18th European Conference on Computer Vision","author":"L Mur-Labadia","year":"2025","unstructured":"L. Mur-Labadia, R. Martinez-Cantin, J. J. Guerrero, G. M. Farinella, A. Furnari. AFF-ttention! Affordances and attention models for short-term object interaction anticipation. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 167\u2013184, 2025. DOI: https:\/\/doi.org\/10.1007\/978-3-031-73337-6_10."},{"key":"1599_CR27","volume-title":"Proceedings of the 12th International Conference on Learning Representations","author":"Q Zhao","year":"2024","unstructured":"Q. Zhao, S. Wang, C. Zhang, C. Fu, M. Q. Do, N. Agarwal, K. Lee, C. Sun. AntGPT: Can large language models help long-term action anticipation from videos? In Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, 2024."},{"key":"1599_CR28","doi-asserted-by":"publisher","first-page":"18186","DOI":"10.1109\/CVPR52733.2024.01722","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Y Shen","year":"2024","unstructured":"Y. Shen, E. Elhamifar. Progress-aware online action segmentation for egocentric procedural task videos. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 18186\u201318197, 2024. DOI: https:\/\/doi.org\/10.1109\/CVPR52733.2024.01722."},{"key":"1599_CR29","doi-asserted-by":"publisher","first-page":"18483","DOI":"10.1109\/CVPR52733.2024.01749","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"A Flaborea","year":"2024","unstructured":"A. Flaborea, G. M. D. Di Melendugno, L. Plini, L. Scofano, E. De Matteis, A. Furnari, G. M. Farinella, F. Galasso. PREGO: Online mistake detection in procedural EGOcentric videos. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 18483\u201318492, 2024. DOI: https:\/\/doi.org\/10.1109\/CVPR52733.2024.01749."},{"issue":"12","key":"1599_CR30","doi-asserted-by":"publisher","first-page":"7509","DOI":"10.1109\/TPAMI.2024.3393571","volume":"46","author":"Y Cheng","year":"2024","unstructured":"Y. Cheng, H. Wang, Y. Bao, F. Lu. Appearance-based gaze estimation with deep learning: A review and benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 7509\u20137528, 2024. DOI: https:\/\/doi.org\/10.1109\/TPAMI.2024.3393571.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"1599_CR31","doi-asserted-by":"publisher","unstructured":"D. Cazzato, M. Leo, C. Distante, H. Voos. When I look into your eyes: A survey on computer vision contributions for human gaze estimation and tracking. Sensors, vol. 20, no. 13, Article number 3739, 2020. DOI: https:\/\/doi.org\/10.3390\/s20133739.","DOI":"10.3390\/s20133739"},{"key":"1599_CR32","doi-asserted-by":"publisher","first-page":"192","DOI":"10.1007\/978-3-031-72673-6_11","volume-title":"Proceedings of the 18th European Conference on Computer Vision","author":"B Lai","year":"2025","unstructured":"B. Lai, F. Ryan, W. Jia, M. Liu, J. M. Rehg. Listen to look into the future: Audio-visual egocentric gaze anticipation. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 192\u2013210, 2025. DOI: https:\/\/doi.org\/10.1007\/978-3-031-72673-6_11."},{"issue":"6","key":"1599_CR33","doi-asserted-by":"publisher","first-page":"6731","DOI":"10.1109\/TPAMI.2021.3051319","volume":"45","author":"Y Li","year":"2023","unstructured":"Y. Li, M. Liu, J. M. Rehg. In the eye of the beholder: Gaze and actions in first person video. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 6, pp. 6731\u20136747, 2023. DOI: https:\/\/doi.org\/10.1109\/TPAMI.2021.3051319.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"1599_CR34","doi-asserted-by":"publisher","first-page":"123","DOI":"10.1109\/MCSoC60832.2023.00026","volume-title":"Proceedings of the 16th International Symposium on Embedded Multicore\/Many-Core Systems-on-Chip","author":"Y Li","year":"2023","unstructured":"Y. Li, X. Wang, Z. Ma, Y. Wang, M. C. Meyer. SwinGaze: Egocentric gaze estimation with video swin transformer. In Proceedings of the 16th International Symposium on Embedded Multicore\/Many-Core Systems-on-Chip, Singapore, pp. 123\u2013127, 2023. DOI: https:\/\/doi.org\/10.1109\/MCSoC60832.2023.00026."},{"key":"1599_CR35","volume-title":"Aria everyday activities dataset","author":"Z Lv","year":"2024","unstructured":"Z. Lv, N. Charron, P. Moulon, A. Gamino, C. Peng, C. Sweeney, E. Miller, H. Tang, J. Meissner, J. Dong, K. Somasundaram, L. Pesqueira, M. Schwesinger, O. Parkhi, Q. Gu, R. De Nardi, S. Cheng, S. Saarinen, V. Baiyya, Y. Zou, R. Newcombe, J. J. Engel, X. Pan, C. Ren. Aria everyday activities dataset, [Online], Available: https:\/\/arxiv.org\/abs\/2402.13349, 2024."},{"key":"1599_CR36","doi-asserted-by":"publisher","first-page":"7795","DOI":"10.1109\/TIP.2020.3007841","volume":"29","author":"Y Huang","year":"2020","unstructured":"Y. Huang, M. Cai, Z. Li, F. Lu, Y. Sato. Mutual context network for jointly estimating egocentric gaze and action. IEEE Transactions on Image Processing, vol. 29, pp. 7795\u20137806, 2020. DOI: https:\/\/doi.org\/10.1109\/TIP.2020.3007841.","journal-title":"IEEE Transactions on Image Processing"},{"key":"1599_CR37","doi-asserted-by":"publisher","first-page":"717","DOI":"10.1145\/3462244.3479954","volume-title":"Proceedings of International Conference on Multimodal Interaction","author":"S K Thakur","year":"2021","unstructured":"S. K. Thakur, C. Beyan, P. Morerio, A. Del Bue. Predicting gaze from egocentric social interaction videos and IMU data. In Proceedings of International Conference on Multimodal Interaction, Montreal, Canada, pp. 717\u2013722, 2021. DOI: https:\/\/doi.org\/10.1145\/3462244.3479954."},{"key":"1599_CR38","first-page":"6000","volume-title":"Proceedings of the 31st International Conference on Neural Information Processing Systems","author":"A Vaswani","year":"2017","unstructured":"A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, USA, pp. 6000\u20136010, 2017."},{"key":"1599_CR39","doi-asserted-by":"publisher","first-page":"9992","DOI":"10.1109\/ICCV48922.2021.00986","volume-title":"Proceedings of IEEE\/CVF International Conference on Computer Vision","author":"Z Liu","year":"2021","unstructured":"Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of IEEE\/CVF International Conference on Computer Vision, Montreal, Canada, pp. 9992\u201310002, 2021. DOI: https:\/\/doi.org\/10.1109\/ICCV48922.2021.00986."},{"key":"1599_CR40","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1016\/j.cviu.2016.09.002","volume":"152","author":"N Sarafianos","year":"2016","unstructured":"N. Sarafianos, B. Boteanu, B. Ionescu, I. A. Kakadiaris. 3D human pose estimation: A review of the literature and analysis of covariates. Computer Vision and Image Understanding, vol. 152, pp. 1\u201320, 2016. DOI: https:\/\/doi.org\/10.1016\/j.cviu.2016.09.002.","journal-title":"Computer Vision and Image Understanding"},{"key":"1599_CR41","doi-asserted-by":"publisher","unstructured":"M. Ben Gamra, M. A. Akhloufi. A review of deep learning techniques for 2D and 3D human pose estimation. Image and Vision Computing, vol. 114, Article number 104282, 2021. DOI: https:\/\/doi.org\/10.1016\/j.imavis.2021.104282.","DOI":"10.1016\/j.imavis.2021.104282"},{"key":"1599_CR42","doi-asserted-by":"publisher","unstructured":"C. Zheng, W. Wu, C. Chen, T. Yang, S. Zhu, J. Shen, N. Kehtarnavaz, M. Shah. Deep learning-based human pose estimation: A survey. ACM Computing Surveys, vol. 56, no. 1, Article number 11, 2024. DOI: https:\/\/doi.org\/10.1145\/3603618.","DOI":"10.1145\/3603618"},{"key":"1599_CR43","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-20068-7_1","volume-title":"Proceedings of the 17th European Conference on Computer Vision","author":"H Akada","year":"2022","unstructured":"H. Akada, J. Wang, S. Shimada, M. Takahashi, C. Theobalt, V. Golyanik. UnrealEgo: A new dataset for robust egocentric 3D human motion capture. In Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, 2022. DOI: https:\/\/doi.org\/10.1007\/978-3-031-20068-7_1"},{"key":"1599_CR44","doi-asserted-by":"publisher","first-page":"13031","DOI":"10.1109\/CVPR52729.2023.01252","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"J Wang","year":"2023","unstructured":"J. Wang, D. Luvizon, W. Xu, L. Liu, K. Sarkar, C. Theobalt. Scene-aware egocentric 3D human pose estimation. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, pp. 13031\u201313040, 2023. DOI: https:\/\/doi.org\/10.1109\/CVPR52729.2023.01252."},{"key":"1599_CR45","doi-asserted-by":"publisher","DOI":"10.1145\/3610548.3618147","volume-title":"Proceedings of SIGGRAPH Asia Conference Papers","author":"T Kang","year":"2023","unstructured":"T. Kang, K. Lee, J. Zhang, Y. Lee. Ego3DPose: Capturing 3D cues from binocular egocentric views. In Proceedings of SIGGRAPH Asia Conference Papers, Sydney, Australia, Article number 82, 2023. DOI: https:\/\/doi.org\/10.1145\/3610548.3618147."},{"key":"1599_CR46","doi-asserted-by":"publisher","first-page":"842","DOI":"10.1109\/CVPR52733.2024.00086","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"T Kang","year":"2024","unstructured":"T. Kang, Y. Lee. Attention-propagation network for egocentric heatmap to 3D pose lifting. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 842\u2013851, 2024. DOI: https:\/\/doi.org\/10.1109\/CVPR52733.2024.00086."},{"key":"1599_CR47","doi-asserted-by":"publisher","first-page":"767","DOI":"10.1109\/CVPR52733.2024.00079","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"H Akada","year":"2024","unstructured":"H. Akada, J. Wang, V. Golyanik, C. Theobalt. 3D human pose perception from egocentric stereo videos. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 767\u2013776, 2024. DOI: https:\/\/doi.org\/10.1109\/CVPR52733.2024.00079."},{"key":"1599_CR48","doi-asserted-by":"publisher","first-page":"5441","DOI":"10.1109\/ICCV.2019.00554","volume-title":"Proceedings of IEEE\/CVF International Conference on Computer Vision","author":"N Mahmood","year":"2019","unstructured":"N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, M. J. Black. AMASS: Archive of motion capture as surface shapes. In Proceedings of IEEE\/CVF International Conference on Computer Vision, Republic of Korea, pp. 5441\u20135450, 2019. DOI: https:\/\/doi.org\/10.1109\/ICCV.2019.00554."},{"key":"1599_CR49","doi-asserted-by":"publisher","first-page":"4316","DOI":"10.1109\/CVPR46437.2021.00430","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"V Guzov","year":"2021","unstructured":"V. Guzov, A. Mir, T. Sattler, G. Pons-Moll. Human POSEitioning system (HPS): 3D human pose estimation and self-localization in large scenes from bodymounted sensors. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, pp. 4316\u20134327, 2021. DOI: https:\/\/doi.org\/10.1109\/CVPR46437.2021.00430."},{"key":"1599_CR50","doi-asserted-by":"publisher","first-page":"443","DOI":"10.1007\/978-3-031-20065-6_26","volume-title":"Proceedings of the 17th European Conference on Computer Vision","author":"J Jiang","year":"2022","unstructured":"J. Jiang, P. Streli, H. Qiu, A. Fender, L. Laich, P. Snape, C. Holz. AvatarPoser: Articulated full-body pose tracking from sparse motion sensing. In Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, pp. 443\u2013460, 2022. DOI: https:\/\/doi.org\/10.1007\/978-3-031-20065-6_26."},{"key":"1599_CR51","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01290","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"S Aliakbarian","year":"2022","unstructured":"S. Aliakbarian, P. Cameron, F. Bogo, A. Fitzgibbon, T. J. Cashman. FLAG: Flow-based 3D avatar generation from sparse observations. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, pp. 13243\u201313252, 2022. DOI: https:\/\/doi.org\/10.1109\/CVPR52688.2022.01290."},{"key":"1599_CR52","doi-asserted-by":"publisher","first-page":"481","DOI":"10.1109\/CVPR52729.2023.00054","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Y Du","year":"2023","unstructured":"Y. Du, R. Kips, A. Pumarola, S. Starke, A. Thabet, A. Sanakoyeu. Avatars grow legs: Generating smooth human motion from sparse tracking inputs with diffusion model. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, pp. 481\u2013490, 2023. DOI: https:\/\/doi.org\/10.1109\/CVPR52729.2023.00054."},{"key":"1599_CR53","doi-asserted-by":"publisher","first-page":"277","DOI":"10.1007\/978-3-031-72627-9_16","volume-title":"Proceedings of the 18th European Conference on Computer Vision","author":"J Jiang","year":"2025","unstructured":"J. Jiang, P. Streli, M. Meier, C. Holz. EgoPoser: Robust real-time egocentric pose estimation from sparse and intermittent observations everywhere. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 277\u2013294, 2025. DOI: https:\/\/doi.org\/10.1007\/978-3-031-72627-9_16"},{"key":"1599_CR54","doi-asserted-by":"publisher","first-page":"12999","DOI":"10.1109\/CVPR52729.2023.01249","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"T Ohkawa","year":"2023","unstructured":"T. Ohkawa, K. He, F. Sener, T. Hodan, L. Tran, C. Keskin. AssemblyHands: Towards egocentric activity understanding via 3D hand pose estimation. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, pp. 12999\u201313008, 2023. DOI: https:\/\/doi.org\/10.1109\/CVPR52729.2023.01249."},{"key":"1599_CR55","doi-asserted-by":"publisher","first-page":"12943","DOI":"10.1109\/CVPR52729.2023.01244","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Z Fan","year":"2023","unstructured":"Z. Fan, O. Taheri, D. Tzionas, M. Kocabas, M. Kaufmann, M. J. Black, O. Hilliges. ARCTIC: A dataset for dexterous bimanual hand-object manipulation. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, pp. 12943\u201312954, 2023. DOI: https:\/\/doi.org\/10.1109\/CVPR52729.2023.01244."},{"key":"1599_CR56","volume-title":"1st place solution of egocentric 3D hand pose estimation challenge 2023 technical report: A concise pipeline for egocentric hand pose reconstruction","author":"Z Zhou","year":"2023","unstructured":"Z. Zhou, Z. Lv, S. Zhou, M. Zou, T. Wu, M. Yu, Y. Tang, J. Liang. 1st place solution of egocentric 3D hand pose estimation challenge 2023 technical report: A concise pipeline for egocentric hand pose reconstruction, [Online], Available: https:\/\/arxiv.org\/abs\/2310.04769, 2023."},{"key":"1599_CR57","doi-asserted-by":"publisher","first-page":"677","DOI":"10.1109\/CVPR52733.2024.00071","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"R Liu","year":"2024","unstructured":"R. Liu, T. Ohkawa, M. Zhang, Y. Sato. Single-to-dualview adaptation for egocentric 3D hand pose estimation. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 677\u2013686, 2024. DOI: https:\/\/doi.org\/10.1109\/CVPR52733.2024.00071."},{"key":"1599_CR58","doi-asserted-by":"publisher","first-page":"183","DOI":"10.1007\/978-3-031-73229-4_11","volume-title":"Proceedings of the 18th European Conference on Computer Vision","author":"A Prakash","year":"2025","unstructured":"A. Prakash, R. Tu, M. Chang, S. Gupta. 3D hand pose estimation in everyday egocentric images. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 183\u2013202, 2025. DOI: https:\/\/doi.org\/10.1007\/978-3-031-73229-4_11."},{"key":"1599_CR59","doi-asserted-by":"publisher","first-page":"7727","DOI":"10.1109\/ICCV.2019.00782","volume-title":"Proceedings of IEEE\/CVF International Conference on Computer Vision","author":"D Tome","year":"2019","unstructured":"D. Tome, P. Peluse, L. Agapito, H. Badino. xR-EgoPose: Egocentric 3D human pose from an HMD camera. In Proceedings of IEEE\/CVF International Conference on Computer Vision, Republic of Korea, pp. 7727\u20137737, 2019. DOI: https:\/\/doi.org\/10.1109\/ICCV.2019.00782."},{"issue":"5","key":"1599_CR60","doi-asserted-by":"publisher","first-page":"2093","DOI":"10.1109\/TVCG.2019.2898650","volume":"25","author":"W Xu","year":"2019","unstructured":"W. Xu, A. Chatterjee, M. Zollh\u00f6fer, H. Rhodin, P. Fua, H. P. Seidel, C. Theobalt. Mo.2Cap.2: Real-time mobile 3D Motion capture with a cap-mounted fisheye camera. IEEE Transactions on Visualization and Computer Graphics, vol. 25, no. 5, pp. 2093\u20132101, 2019. DOI: https:\/\/doi.org\/10.1109\/TVCG.2019.2898650.","journal-title":"IEEE Transactions on Visualization and Computer Graphics"},{"key":"1599_CR61","doi-asserted-by":"publisher","first-page":"54957","DOI":"10.1109\/ACCESS.2022.3177623","volume":"10","author":"R Hori","year":"2022","unstructured":"R. Hori, R. Hachiuma, M. Isogawa, D. Mikami, H. Saito. Silhouette-based 3D human pose estimation using a single wrist-mounted 360\u00b0 camera. IEEE Access, vol. 10, pp. 54957\u201354968, 2022. DOI: https:\/\/doi.org\/10.1109\/ACCESS.2022.3177623.","journal-title":"IEEE Access"},{"key":"1599_CR62","doi-asserted-by":"publisher","first-page":"9887","DOI":"10.1109\/CVPR42600.2020.00991","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"E Ng","year":"2020","unstructured":"E. Ng, D. Xiang, H. Joo, K. Grauman. You2Me: Inferring body pose in egocentric video via first and second person interactions. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 9887\u20139897, 2020. DOI: https:\/\/doi.org\/10.1109\/CVPR42600.2020.00991."},{"key":"1599_CR63","doi-asserted-by":"publisher","first-page":"11480","DOI":"10.1109\/ICCV48922.2021.01130","volume-title":"Proceedings of IEEE\/CVF International Conference on Computer Vision","author":"J Wang","year":"2021","unstructured":"J. Wang, L. Liu, W. Xu, K. Sarkar, C. Theobalt. Estimating egocentric 3D human pose in global space. In Proceedings of IEEE\/CVF International Conference on Computer Vision, Montreal, Canada, pp. 11480\u201311489, 2021. DOI: https:\/\/doi.org\/10.1109\/ICCV48922.2021.01130."},{"key":"1599_CR64","doi-asserted-by":"publisher","first-page":"32","DOI":"10.1109\/3DV53792.2021.00014","volume-title":"Proceedings of International Conference on 3D Vision","author":"D Zhao","year":"2021","unstructured":"D. Zhao, Z. Wei, J. Mahmud, J. M. Frahm. EgoGlass: Egocentric-view human pose estimation from an eyeglass frame. In Proceedings of International Conference on 3D Vision, London, UK, pp. 32\u201341, 2021. DOI: https:\/\/doi.org\/10.1109\/3DV53792.2021.00014."},{"key":"1599_CR65","doi-asserted-by":"publisher","first-page":"375","DOI":"10.1007\/978-3-031-72986-7_22","volume-title":"Proceedings of the 18th European Conference on Computer Vision","author":"A Zhao","year":"2025","unstructured":"A. Zhao, C. Tang, L. Wang, Y. Li, M. Dave, L. Tao, C. D. Twigg, R. Y. Wang. EgoBody3M: Egocentric body tracking on a VR headset using a diverse dataset. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 375\u2013392, 2025. DOI: https:\/\/doi.org\/10.1007\/978-3-031-72986-7_22."},{"key":"1599_CR66","doi-asserted-by":"publisher","first-page":"1186","DOI":"10.1109\/CVPR52733.2024.00119","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"C Millerdurai","year":"2024","unstructured":"C. Millerdurai, H. Akada, J. Wang, D. Luvizon, C. Theobalt, V. Golyanik. EventEgo3D: 3D human motion capture from egocentric event streams. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 1186\u20131195, 2024. DOI: https:\/\/doi.org\/10.1109\/CVPR52733.2024.00119."},{"key":"1599_CR67","doi-asserted-by":"publisher","unstructured":"M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, M. J. Black. SMPL: A skinned multi-person linear model. Seminal Graphics Papers: Pushing the Boundaries, vol. 2, Article number 88, 2023. DOI: https:\/\/doi.org\/10.1145\/3596711.3596800.","DOI":"10.1145\/3596711.3596800"},{"key":"1599_CR68","doi-asserted-by":"publisher","first-page":"10967","DOI":"10.1109\/CVPR.2019.01123","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"G Pavlakos","year":"2019","unstructured":"G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A. A. Osman, D. Tzionas, M. J. Black. Expressive body capture: 3D hands, face, and body from a single image. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, pp. 10967\u201310977, 2019. DOI: https:\/\/doi.org\/10.1109\/CVPR.2019.01123."},{"key":"1599_CR69","doi-asserted-by":"publisher","first-page":"11667","DOI":"10.1109\/ICCV48922.2021.01148","volume-title":"Proceedings of IEEE\/CVF International Conference on Computer Vision","author":"A Dittadi","year":"2021","unstructured":"A. Dittadi, S. Dziadzio, D. Cosker, B. Lundell, T. J. Cashman, J. Shotton. Full-body motion from a single head-mounted device: Generating SMPL poses from partial observations. In Proceedings of IEEE\/CVF International Conference on Computer Vision, Montreal, Canada, pp. 11667\u201311677, 2021. DOI: https:\/\/doi.org\/10.1109\/ICCV48922.2021.01148."},{"key":"1599_CR70","doi-asserted-by":"publisher","first-page":"10118","DOI":"10.1109\/ICCV48922.2021.00998","volume-title":"Proceedings of IEEE\/CVF International Conference on Computer Vision","author":"T Kwon","year":"2021","unstructured":"T. Kwon, B. Tekin, J. St\u00fchmer, F. Bogo, M. Pollefeys. H2O: Two hands manipulating objects for first person interaction recognition. In Proceedings of IEEE\/CVF International Conference on Computer Vision, Montreal, Canada, pp. 10118\u201310128, 2021. DOI: https:\/\/doi.org\/10.1109\/ICCV48922.2021.00998."},{"key":"1599_CR71","doi-asserted-by":"publisher","first-page":"9826","DOI":"10.1109\/CVPR52733.2024.00938","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"G Pavlakos","year":"2024","unstructured":"G. Pavlakos, D. Shan, I. Radosavovic, A. Kanazawa, D. Fouhey, J. Malik. Reconstructing hands in 3D with transformers. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 9826\u20139836, 2024. DOI: https:\/\/doi.org\/10.1109\/CVPR52733.2024.00938."},{"issue":"5","key":"1599_CR72","doi-asserted-by":"publisher","first-page":"1366","DOI":"10.1007\/s11263-022-01594-9","volume":"130","author":"Y Kong","year":"2022","unstructured":"Y. Kong, Y. Fu. Human action recognition and prediction: A survey. International Journal of Computer Vision, vol. 130, no. 5, pp. 1366\u20131401, 2022. DOI: https:\/\/doi.org\/10.1007\/s11263-022-01594-9.","journal-title":"International Journal of Computer Vision"},{"key":"1599_CR73","doi-asserted-by":"publisher","first-page":"395","DOI":"10.1016\/j.neucom.2022.03.069","volume":"491","author":"X Hu","year":"2022","unstructured":"X. Hu, J. Dai, M. Li, C. Peng, Y. Li, S. Du. Online human action detection and anticipation in videos: A survey. Neurocomputing, vol. 491, pp. 395\u2013413, 2022. DOI: https:\/\/doi.org\/10.1016\/j.neucom.2022.03.069.","journal-title":"Neurocomputing"},{"key":"1599_CR74","volume-title":"Human action anticipation: A survey","author":"B Lai","year":"2024","unstructured":"B. Lai, S. Toyer, T. Nagarajan, R. Girdhar, S. Zha, J. M. Rehg, K. Kitani, K. Grauman, R. Desai, M. Liu. Human action anticipation: A survey, [Online], Available: https:\/\/arxiv.org\/abs\/2410.14045, 2024."},{"issue":"2","key":"1599_CR75","doi-asserted-by":"publisher","first-page":"1011","DOI":"10.1109\/TPAMI.2023.3327284","volume":"46","author":"G Ding","year":"2024","unstructured":"G. Ding, F. Sener, A. Yao. Temporal action segmentation: An analysis of modern techniques. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 2, pp. 1011\u20131030, 2024. DOI: https:\/\/doi.org\/10.1109\/TPAMI.2023.3327284.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"issue":"1","key":"1599_CR76","doi-asserted-by":"publisher","first-page":"33","DOI":"10.1007\/s11263-021-01531-2","volume":"130","author":"D Damen","year":"2022","unstructured":"D. Damen, H. Doughty, G. M. Farinella, A. Furnari, E. Kazakos, J. Ma, D. Moltisanti, J. Munro, T. Perrett, W. Price, M. Wray. Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100. International Journal of Computer Vision, vol. 130, no. 1, pp. 33\u201355, 2022. DOI: https:\/\/doi.org\/10.1007\/s11263-021-01531-2.","journal-title":"International Journal of Computer Vision"},{"key":"1599_CR77","doi-asserted-by":"publisher","first-page":"3138","DOI":"10.1109\/CVPR52688.2022.00315","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"R Herzig","year":"2022","unstructured":"R. Herzig, E. Ben-Avraham, K. Mangalam, A. Bar, G. Chechik, A. Rohrbach, T. Darrell, A. Globerson. Object-region video transformers. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, pp. 3138\u20133149, 2022. DOI: https:\/\/doi.org\/10.1109\/CVPR52688.2022.00315."},{"key":"1599_CR78","doi-asserted-by":"publisher","first-page":"3284","DOI":"10.1109\/ICCV51070.2023.00306","volume-title":"Proceedings of IEEE\/CVF International Conference on Computer Vision","author":"Q Wang","year":"2023","unstructured":"Q. Wang, L. Zhao, L. Yuan, T. Liu, X. Peng. Learning from semantic alignment between unpaired multiviews for egocentric video recognition. In Proceedings of IEEE\/CVF International Conference on Computer Vision, Paris, France, pp. 3284\u20133294, 2023. DOI: https:\/\/doi.org\/10.1109\/ICCV51070.2023.00306."},{"key":"1599_CR79","doi-asserted-by":"publisher","first-page":"6527","DOI":"10.1109\/WACV57701.2024.00641","volume-title":"Proceedings of IEEE\/CVF Winter Conference on Applications of Computer Vision","author":"T Shiota","year":"2024","unstructured":"T. Shiota, M. Takagi, K. Kumagai, H. Seshimo, Y. Aono. Egocentric action recognition by capturing handobject contact and object state. In Proceedings of IEEE\/CVF Winter Conference on Applications of Computer Vision, Waikoloa, USA, pp. 6527\u20136537, 2024. DOI: https:\/\/doi.org\/10.1109\/WACV57701.2024.00641."},{"key":"1599_CR80","volume-title":"Proceedings of the 38th International Conference on Neural Information Processing Systems","author":"N B Gundavarapu","year":"2025","unstructured":"N. B. Gundavarapu, L. Friedman, R. Goyal, C. Hegde, E. Agustsson, S. Waghmare, M. Sirotenko, M. H. Yang, T. Weyand, B. Gong, L. Sigal. Extending video masked autoencoders to 128 frames. In Proceedings of the 38th International Conference on Neural Information Processing Systems, Vancouver, Canada, Article number 3856, 2025."},{"key":"1599_CR81","doi-asserted-by":"publisher","first-page":"26354","DOI":"10.1109\/CVPR52733.2024.02490","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"A Kukleva","year":"2024","unstructured":"A. Kukleva, F. Sener, E. Remelli, B. Tekin, E. Sauser, B. Schiele, S. Ma. X-MIC: Cross-modal instance conditioning for egocentric action generalization. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 26354\u201326363, 2024. DOI: https:\/\/doi.org\/10.1109\/CVPR52733.2024.02490."},{"key":"1599_CR82","doi-asserted-by":"publisher","first-page":"492","DOI":"10.1007\/978-3-031-19772-7_29","volume-title":"Proceedings of the 17th European Conference on Computer Vision","author":"C L Zhang","year":"2022","unstructured":"C. L. Zhang, J. Wu, Y. Li. ActionFormer: Localizing moments of actions with transformers. In Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, pp. 492\u2013510, 2022. DOI: https:\/\/doi.org\/10.1007\/978-3-031-19772-7_29."},{"key":"1599_CR83","doi-asserted-by":"publisher","first-page":"18857","DOI":"10.1109\/CVPR52729.2023.01808","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"D Shi","year":"2023","unstructured":"D. Shi, Y. Zhong, Q. Cao, L. Ma, J. Li, D. Tao. TriDet: Temporal action detection with relative boundary modeling. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, pp. 18857\u201318866, 2023. DOI: https:\/\/doi.org\/10.1109\/CVPR52729.2023.01808."},{"key":"1599_CR84","doi-asserted-by":"publisher","first-page":"5227","DOI":"10.1109\/ICCV51070.2023.00484","volume-title":"Proceedings of IEEE\/CVF International Conference on Computer Vision","author":"H Wang","year":"2023","unstructured":"H. Wang, M. K. Singh, L. Torresani. Ego-only: Egocentric action detection without exocentric transferring. In Proceedings of IEEE\/CVF International Conference on Computer Vision, Paris, France, pp. 5227\u20135238, 2023. DOI: https:\/\/doi.org\/10.1109\/ICCV51070.2023.00484."},{"key":"1599_CR85","doi-asserted-by":"publisher","first-page":"18591","DOI":"10.1109\/CVPR52733.2024.01759","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"S Liu","year":"2024","unstructured":"S. Liu, C. L. Zhang, C. Zhao, B. Ghanem. End-to-end temporal action detection with 1B parameters across 1 000 frames. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 18591\u201318601, 2024. DOI: https:\/\/doi.org\/10.1109\/CVPR52733.2024.01759."},{"key":"1599_CR86","doi-asserted-by":"publisher","first-page":"18644","DOI":"10.1109\/CVPR52733.2024.01764","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"H Guo","year":"2024","unstructured":"H. Guo, N. Agarwal, S. Y. Lo, K. Lee, Q. Ji. Uncertainty-aware action decoupling transformer for action anticipation. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 18644\u201318654, 2024. DOI: https:\/\/doi.org\/10.1109\/CVPR52733.2024.01764."},{"key":"1599_CR87","doi-asserted-by":"publisher","first-page":"6726","DOI":"10.1109\/WACV57701.2024.00660","volume-title":"Proceedings of IEEE\/CVF Winter Conference on Applications of Computer Vision","author":"D Roy","year":"2024","unstructured":"D. Roy, R. Rajendiran, B. Fernando. Interaction region visual transformer for egocentric action anticipation. In Proceedings of IEEE\/CVF Winter Conference on Applications of Computer Vision, Waikoloa, USA, pp. 6726\u20136736, 2024. DOI: https:\/\/doi.org\/10.1109\/WACV57701.2024.00660."},{"key":"1599_CR88","doi-asserted-by":"publisher","first-page":"448","DOI":"10.1007\/978-3-031-73390-1_26","volume-title":"Proceedings of the 18th European Conference on Computer Vision","author":"A Diko","year":"2025","unstructured":"A. Diko, D. Avola, B. Prenkaj, F. Fontana, L. Cinque. Semantically guided representation learning for action anticipation. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 448\u2013466, 2025. DOI: https:\/\/doi.org\/10.1007\/978-3-031-73390-1_26."},{"key":"1599_CR89","doi-asserted-by":"publisher","first-page":"8642","DOI":"10.1109\/WACV57701.2024.00846","volume-title":"Proceedings of IEEE\/CVF Winter Conference on Applications of Computer Vision","author":"S Thakur","year":"2024","unstructured":"S. Thakur, C. Beyan, P. Morerio, V. Murino, A. Del Bue. Leveraging next-active objects for context-aware anticipation in egocentric videos. In Proceedings of IEEE\/CVF Winter Conference on Applications of Computer Vision, Waikoloa, USA, pp. 8642\u20138651, 2024. DOI: https:\/\/doi.org\/10.1109\/WACV57701.2024.00846."},{"key":"1599_CR90","doi-asserted-by":"publisher","first-page":"140","DOI":"10.1007\/978-3-031-73007-8_9","volume-title":"Proceedings of the 18th European Conference on Computer Vision","author":"S Kim","year":"2025","unstructured":"S. Kim, D. Huang, Y. Xian, O. Hilliges, L. Van Gool, X. Wang. PALM: Predicting actions through language models. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 140\u2013158, 2025. DOI: https:\/\/doi.org\/10.1007\/978-3-031-73007-8_9."},{"key":"1599_CR91","doi-asserted-by":"publisher","first-page":"18580","DOI":"10.1109\/CVPR52733.2024.01758","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"H Mittal","year":"2024","unstructured":"H. Mittal, N. Agarwal, S. Y. Lo, K. Lee. Can\u2019t make an omelette without breaking some eggs: Plausible action anticipation using large video-language models. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 18580\u201318590, 2024. DOI: https:\/\/doi.org\/10.1109\/CVPR52733.2024.01758."},{"key":"1599_CR92","doi-asserted-by":"publisher","first-page":"5491","DOI":"10.1109\/ICCV.2019.00559","volume-title":"Proceedings of IEEE\/CVF International Conference on Computer Vision","author":"E Kazakos","year":"2019","unstructured":"E. Kazakos, A. Nagrani, A. Zisserman, D. Damen. EPIC-fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of IEEE\/CVF International Conference on Computer Vision, Seoul, Republic of Korea, pp. 5491\u20135500, 2019. DOI: https:\/\/doi.org\/10.1109\/ICCV.2019.00559."},{"key":"1599_CR93","doi-asserted-by":"publisher","first-page":"1068","DOI":"10.1109\/WACV48630.2021.00111","volume-title":"Proceedings of IEEE Winter Conference on Applications of Computer Vision","author":"K Min","year":"2021","unstructured":"K. Min, J. J. Corso. Integrating human gaze into attention for egocentric activity recognition. In Proceedings of IEEE Winter Conference on Applications of Computer Vision, Waikoloa, USA, pp. 1068\u20131077, 2021. DOI: https:\/\/doi.org\/10.1109\/WACV48630.2021.00111."},{"key":"1599_CR94","doi-asserted-by":"publisher","first-page":"399","DOI":"10.1007\/978-3-030-58520-4_24","volume-title":"Proceedings of the 16th European Conference on Computer Vision","author":"D Thapar","year":"2020","unstructured":"D. Thapar, C. Arora, A. Nigam. Is sharing of egocentric video giving away your biometric signature? In Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, pp. 399\u2013416, 2020. DOI: https:\/\/doi.org\/10.1007\/978-3-030-58520-4_24."},{"key":"1599_CR95","doi-asserted-by":"publisher","first-page":"19903","DOI":"10.1109\/CVPR52688.2022.01931","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"C Plizzari","year":"2022","unstructured":"C. Plizzari, M. Planamente, G. Goletto, M. Cannici, E. Gusso, M. Matteucci, B. Caputo. E.2(GO)MOTION: Motion augmented event stream for egocentric action recognition. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, pp. 19903\u201319915, 2022. DOI: https:\/\/doi.org\/10.1109\/CVPR52688.2022.01931."},{"key":"1599_CR96","doi-asserted-by":"publisher","first-page":"6939","DOI":"10.1109\/CVPR46437.2021.00687","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Y Li","year":"2021","unstructured":"Y. Li, T. Nagarajan, B. Xiong, K. Grauman. Ego-exo: Transferring visual representations from third-person to first-person videos. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, pp. 6939\u20136949, 2021. DOI: https:\/\/doi.org\/10.1109\/CVPR46437.2021.00687."},{"key":"1599_CR97","volume-title":"Proceedings of the 37th International Conference on Neural Information Processing Systems","author":"S Tan","year":"2023","unstructured":"S. Tan, T. Nagarajan, K. Grauman. EgoDistill: Egocentric head motion distillation for efficient video understanding. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 1455, 2023."},{"key":"1599_CR98","doi-asserted-by":"publisher","first-page":"6481","DOI":"10.1109\/CVPR52729.2023.00627","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"X Gong","year":"2023","unstructured":"X. Gong, S. Mohan, N. Dhingra, J. C. Bazin, Y. Li, Z. Wang, R. Ranjan. MMG-Ego4D: Multi-modal generalization in egocentric action recognition. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, pp. 6481\u20136491, 2023. DOI: https:\/\/doi.org\/10.1109\/CVPR52729.2023.00627."},{"key":"1599_CR99","doi-asserted-by":"publisher","first-page":"13577","DOI":"10.1109\/CVPR52688.2022.01322","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"C Y Wu","year":"2022","unstructured":"C. Y. Wu, Y. Li, K. Mangalam, H. Fan, B. Xiong, J. Malik, C. Feichtenhofer. MeMViT: Memory-augmented multiscale vision transformer for efficient long-term video recognition. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, pp. 13577\u201313587, 2022. DOI: https:\/\/doi.org\/10.1109\/CVPR52688.2022.01322."},{"key":"1599_CR100","doi-asserted-by":"publisher","first-page":"3323","DOI":"10.1109\/CVPR52688.2022.00333","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"S Yan","year":"2022","unstructured":"S. Yan, X. Xiong, A. Arnab, Z. Lu, M. Zhang, C. Sun, C. Schmid. Multiview transformers for video recognition. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, pp. 3323\u20133333, 2022. DOI: https:\/\/doi.org\/10.1109\/CVPR52688.2022.00333."},{"key":"1599_CR101","doi-asserted-by":"publisher","unstructured":"X. Lu, Y. Hao, L. Cheng, S. Zhao, Y. Liu, M. Song. Mixed attention and channel shift transformer for efficient action recognition. ACM Transactions on Multimedia Computing, Communications and Applications, vol. 21, no. 3, Article number 93, 2025. DOI: https:\/\/doi.org\/10.1145\/3712594.","DOI":"10.1145\/3712594"},{"key":"1599_CR102","doi-asserted-by":"publisher","first-page":"2043","DOI":"10.1609\/aaai.v39i2.32201","volume-title":"Proceedings of the 39th AAAI Conference on Artificial Intelligence","author":"H Chen","year":"2025","unstructured":"H. Chen, Y. Yang, Y. Lyu. Skeleton-based action recognition with non-linear dependency modeling and Hilbert-Schmidt independence criterion. In Proceedings of the 39th AAAI Conference on Artificial Intelligence, Philadelphia, USA, pp. 2043\u20132051, 2025. DOI: https:\/\/doi.org\/10.1609\/aaai.v39i2.32201."},{"key":"1599_CR103","doi-asserted-by":"publisher","first-page":"11003","DOI":"10.1109\/TMM.2024.3428330","volume":"26","author":"P Geng","year":"2024","unstructured":"P. Geng, X. Lu, W. Li, L. Lyu. Hierarchical aggregated graph neural network for skeleton-based action recognition. IEEE Transactions on Multimedia, vol. 26, pp. 11003\u201311017, 2024. DOI: https:\/\/doi.org\/10.1109\/TMM.2024.3428330.","journal-title":"IEEE Transactions on Multimedia"},{"key":"1599_CR104","doi-asserted-by":"publisher","first-page":"9135","DOI":"10.1109\/TMM.2024.3386339","volume":"26","author":"Y Liu","year":"2024","unstructured":"Y. Liu, F. Liu, L. Jiao, Q. Bao, L. Li, Y. Guo, P. Chen. A knowledge-based hierarchical causal inference network for video action recognition. IEEE Transactions on Multimedia, vol. 26, pp. 9135\u20139149, 2024. DOI: https:\/\/doi.org\/10.1109\/TMM.2024.3386339.","journal-title":"IEEE Transactions on Multimedia"},{"key":"1599_CR105","doi-asserted-by":"publisher","unstructured":"Y. Liu, F. Liu, L. Jiao, Q. Bao, S. Li, L. Li, X. Liu. Knowledge-driven compositional action recognition. Pattern Recognition, vol. 163, Article number 111452, 2025. DOI: https:\/\/doi.org\/10.1016\/j.patcog.2025.111452.","DOI":"10.1016\/j.patcog.2025.111452"},{"issue":"4","key":"1599_CR106","doi-asserted-by":"publisher","first-page":"5921","DOI":"10.1109\/TNNLS.2024.3401711","volume":"36","author":"L Jiao","year":"2025","unstructured":"L. Jiao, M. Ma, P. He, X. Geng, X. Liu, F. Liu, W. Ma, S. Yang, B. Hou, X. Tang. Brain-inspired learning, perception, and cognition: A comprehensive review. IEEE Transactions on Neural Networks and Learning Systems, vol. 36, no. 4, pp. 5921\u20135941, 2025. DOI: https:\/\/doi.org\/10.1109\/TNNLS.2024.3401711.","journal-title":"IEEE Transactions on Neural Networks and Learning Systems"},{"key":"1599_CR107","doi-asserted-by":"publisher","first-page":"13610","DOI":"10.1109\/ICCV51070.2023.01256","volume-title":"Proceedings of IEEE\/CVF International Conference on Computer Vision","author":"C Plizzari","year":"2023","unstructured":"C. Plizzari, T. Perrett, B. Caputo, D. Damen. What can a cook in Italy teach a mechanic in India? Action recognition generalisation over scenarios and locations. In Proceedings of IEEE\/CVF International Conference on Computer Vision, Paris, France, pp. 13610\u201313620, 2023. DOI: https:\/\/doi.org\/10.1109\/ICCV51070.2023.01256."},{"key":"1599_CR108","doi-asserted-by":"publisher","first-page":"39","DOI":"10.1007\/978-3-031-73202-7_3","volume-title":"Proceedings of the 18th European Conference on Computer Vision","author":"S Kundu","year":"2025","unstructured":"S. Kundu, S. Trehan, S. N. Aakur. Discovering novel actions from open world egocentric videos with object-grounded visual commonsense reasoning. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 39\u201356, 2025. DOI: https:\/\/doi.org\/10.1007\/978-3-031-73202-7_3."},{"key":"1599_CR109","doi-asserted-by":"publisher","first-page":"182","DOI":"10.1007\/978-3-031-73414-4_11","volume-title":"Proceedings of the 18th European Conference on Computer Vision","author":"M Hatano","year":"2025","unstructured":"M. Hatano, R. Hachiuma, R. Fujii, H. Saito. Multimodal cross-domain few-shot learning for egocentric action recognition. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 182\u2013199, 2025. DOI: https:\/\/doi.org\/10.1007\/978-3-031-73414-4_11."},{"key":"1599_CR110","doi-asserted-by":"publisher","first-page":"14021","DOI":"10.1109\/CVPR42600.2020.01404","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Y Huang","year":"2020","unstructured":"Y. Huang, Y. Sugano, Y. Sato. Improving action segmentation via graph-based temporal reasoning. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 14021\u201314031, 2020. DOI: https:\/\/doi.org\/10.1109\/CVPR42600.2020.01404."},{"key":"1599_CR111","volume-title":"Proceedings of the 36th International Conference on Neural Information Processing Systems","author":"K Q Lin","year":"2022","unstructured":"K. Q. Lin, A. J. Wang, M. Soldan, M. Wray, R. Yan, E. Z. Xu, D. Gao, R. C. Tu, W. Zhao, W. Kong, C. Cai, H. Wang, D. Damen, B. Ghanem, W. Liu, M. Z. Shou. Egocentric video-language pretraining. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 550, 2022."},{"key":"1599_CR112","doi-asserted-by":"publisher","first-page":"253","DOI":"10.1007\/978-3-031-73220-1_15","volume-title":"Proceedings of the 18th European Conference on Computer Vision","author":"C Quattrocchi","year":"2025","unstructured":"C. Quattrocchi, A. Furnari, D. Di Mauro, M. V. Giuffrida, G. M. Farinella. Synchronization is all you need: Exocentric-to-egocentric transfer for temporal action segmentation with unlabeled synchronized video pairs. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 253\u2013270, 2025. DOI: https:\/\/doi.org\/10.1007\/978-3-031-73220-1_15."},{"key":"1599_CR113","doi-asserted-by":"publisher","first-page":"18655","DOI":"10.1109\/CVPR52733.2024.01765","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"S P Lee","year":"2024","unstructured":"S. P. Lee, Z. Lu, Z. Zhang, M. Hoai, E. Elhamifar. Error detection in egocentric procedural task videos. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 18655\u201318666, 2024. DOI: https:\/\/doi.org\/10.1109\/CVPR52733.2024.01765."},{"key":"1599_CR114","doi-asserted-by":"publisher","first-page":"205","DOI":"10.1007\/978-3-031-72664-4_12","volume-title":"Proceedings of the 18th European Conference on Computer Vision","author":"S Reza","year":"2025","unstructured":"S. Reza, Y. Zhang, M. Moghaddam, O. Camps. HAT: History-augmented anchor transformer for online temporal action localization. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 205\u2013222, 2025. DOI: https:\/\/doi.org\/10.1007\/978-3-031-72664-4_12."},{"key":"1599_CR115","volume-title":"Fr\u00e9chet audio distance: A metric for evaluating music enhancement algorithms","author":"K Kilgour","year":"2018","unstructured":"K. Kilgour, M. Zuluaga, D. Roblek, M. Sharifi. Fr\u00e9chet audio distance: A metric for evaluating music enhancement algorithms, [Online], Available: https:\/\/arxiv.org\/abs\/1812.08466, 2018."},{"key":"1599_CR116","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP49357.2023.10095969","volume-title":"Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing","author":"Y Wu","year":"2023","unstructured":"Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Berg-Kirkpatrick, S. Dubnov. Large-scale contrastive languageaudio pretraining with feature fusion and keyword-to-caption augmentation. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 2023. DOI: https:\/\/doi.org\/10.1109\/ICASSP49357.2023.10095969."},{"key":"1599_CR117","volume-title":"Proceedings of the 37th International Conference on Neural Information Processing Systems","author":"S Luo","year":"2023","unstructured":"S. Luo, C. Yan, C. Hu, H. Zhao. DIFF-FOLEY: Synchronized video-to-audio synthesis with latent diffusion models. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 2121, 2023."},{"key":"1599_CR118","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP49357.2023.10096198","volume-title":"Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing","author":"J Huh","year":"2023","unstructured":"J. Huh, J. Chalk, E. Kazakos, D. Damen, A. Zisserman. Epic-sounds: A large-scale dataset of actions that sound. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 2023. DOI: https:\/\/doi.org\/10.1109\/ICASSP49357.2023.10096198."},{"key":"1599_CR119","doi-asserted-by":"publisher","first-page":"7300","DOI":"10.1109\/ICASSP48485.2024.10448486","volume-title":"Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing","author":"A M Oncescu","year":"2024","unstructured":"A. M. Oncescu, J. F. Henriques, A. Zisserman, S. Albanie, A. S. Koepke. A sound approach: Using large language models to generate audio descriptions for egocentric text-audio retrieval. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, Republic of Korea pp. 7300\u20137304, 2024. DOI: https:\/\/doi.org\/10.1109\/ICASSP48485.2024.10448486."},{"key":"1599_CR120","doi-asserted-by":"publisher","first-page":"277","DOI":"10.1007\/978-3-031-72897-6_16","volume-title":"Proceedings of the 18th European Conference on Computer Vision","author":"C Chen","year":"2025","unstructured":"C. Chen, P. Peng, A. Baid, S. Xue, W. N. Hsu, D. Harwath, K. Grauman. Action2Sound: Ambient-aware generation of action sounds from egocentric videos. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 277\u2013295, 2025. DOI: https:\/\/doi.org\/10.1007\/978-3-031-72897-6_16."},{"key":"1599_CR121","doi-asserted-by":"publisher","first-page":"27242","DOI":"10.1109\/CVPR52733.2024.02573","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"C Chen","year":"2024","unstructured":"C. Chen, K. Ashutosh, R. Girdhar, D. Harwath, K. Grauman. Soundingactions: Learning how actions sound from narrated egocentric videos. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 27242\u201327252, 2024. DOI: https:\/\/doi.org\/10.1109\/CVPR52733.2024.02573."},{"key":"1599_CR122","doi-asserted-by":"publisher","first-page":"704","DOI":"10.1007\/978-3-030-58452-8_41","volume-title":"Proceedings of the 16th European Conference on Computer Vision","author":"M Liu","year":"2020","unstructured":"M. Liu, S. Tang, Y. Li, J. M. Rehg. Forecasting humanobject interaction: Joint prediction of motor attention and actions in first person video. In Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, pp. 704\u2013721, 2020. DOI: https:\/\/doi.org\/10.1007\/978-3-030-58452-8_41."},{"key":"1599_CR123","doi-asserted-by":"publisher","first-page":"639","DOI":"10.1007\/978-3-031-19778-9_37","volume-title":"Proceedings of the 17th European Conference on Computer Vision","author":"W Jia","year":"2022","unstructured":"W. Jia, M. Liu, J. M. Rehg. Generative adversarial network for future hand segmentation from egocentric video. In Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, pp. 639\u2013656, 2022. DOI: https:\/\/doi.org\/10.1007\/978-3-031-19778-9_37."},{"key":"1599_CR124","doi-asserted-by":"publisher","first-page":"13656","DOI":"10.1109\/ICCV51070.2023.01260","volume-title":"Proceedings of IEEE\/CVF International Conference on Computer Vision","author":"W Bao","year":"2023","unstructured":"W. Bao, L. Chen, L. Zeng, Z. Li, Y. Xu, J. Yuan, Y. Kong. Uncertainty-aware state space transformer for egocentric 3D hand trajectory forecasting. In Proceedings of IEEE\/CVF International Conference on Computer Vision, Paris, France, pp. 13656\u201313665, 2023. DOI: https:\/\/doi.org\/10.1109\/ICCV51070.2023.01260."},{"key":"1599_CR125","doi-asserted-by":"publisher","first-page":"262","DOI":"10.5220\/0012306300003660","volume-title":"Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications","author":"S Abilkassov","year":"2024","unstructured":"S. Abilkassov, M. Gentner, M. Popa. Augmenting human-robot collaboration task by human hand position forecasting. In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, Rome, Italy, pp. 262\u2013269, 2024. DOI: https:\/\/doi.org\/10.5220\/0012306300003660."},{"issue":"10","key":"1599_CR126","doi-asserted-by":"publisher","first-page":"12581","DOI":"10.1109\/TPAMI.2023.3282631","volume":"45","author":"K Li","year":"2023","unstructured":"K. Li, Y. Wang, J. Zhang, P. Gao, G. Song, Y. Liu, H. Li, Y. Qiao. UniFormer: Unifying convolution and self-attention for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 10, pp. 12581\u201312600, 2023. DOI: https:\/\/doi.org\/10.1109\/TPAMI.2023.3282631.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"1599_CR127","doi-asserted-by":"publisher","first-page":"169","DOI":"10.1007\/978-3-031-72667-5_10","volume-title":"Proceedings of the 18th European Conference on Computer Vision","author":"B Tang","year":"2025","unstructured":"B. Tang, K. Zhang, W. Luo, W. Liu, H. Li. Prompting future driven diffusion model for hand motion prediction. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 169\u2013186, 2025. DOI: https:\/\/doi.org\/10.1007\/978-3-031-72667-5_10."},{"key":"1599_CR128","doi-asserted-by":"publisher","first-page":"409","DOI":"10.1109\/CVPR.2018.00050","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"G Garcia-Hernando","year":"2018","unstructured":"G. Garcia-Hernando, S. Yuan, S. Baek, T. K. Kim. First-person hand action benchmark with RGB-D videos and 3D hand pose annotations. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, pp. 409\u2013419, 2018. DOI: https:\/\/doi.org\/10.1109\/CVPR.2018.00050."},{"key":"1599_CR129","doi-asserted-by":"publisher","first-page":"1143","DOI":"10.1109\/TIP.2020.3040521","volume":"30","author":"Y Wu","year":"2021","unstructured":"Y. Wu, L. Zhu, X. Wang, Y. Yang, F. Wu. Learning to anticipate egocentric actions by imagination. IEEE Transactions on Image Processing, vol. 30, pp. 1143\u20131152, 2021. DOI: https:\/\/doi.org\/10.1109\/TIP.2020.3040521.","journal-title":"IEEE Transactions on Image Processing"},{"key":"1599_CR130","doi-asserted-by":"publisher","first-page":"808","DOI":"10.1109\/WACV51458.2022.00088","volume-title":"Proceedings of IEEE\/CVF Winter Conference on Applications of Computer Vision","author":"D Roy","year":"2022","unstructured":"D. Roy, B. Fernando. Action anticipation using latent goal learning. In Proceedings of IEEE\/CVF Winter Conference on Applications of Computer Vision, Waikoloa, USA, pp. 808\u2013816, 2022. DOI: https:\/\/doi.org\/10.1109\/WACV51458.2022.00088."},{"key":"1599_CR131","doi-asserted-by":"publisher","first-page":"558","DOI":"10.1007\/978-3-031-19830-4_32","volume-title":"Proceedings of the 17th European Conference on Computer Vision","author":"M Nawhal","year":"2022","unstructured":"M. Nawhal, A. A. Jyothi, G. Mori. Rethinking learning approaches for long-term action anticipation. In Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, pp. 558\u2013576, 2022. DOI: https:\/\/doi.org\/10.1007\/978-3-031-19830-4_32."},{"key":"1599_CR132","doi-asserted-by":"publisher","first-page":"13485","DOI":"10.1109\/ICCV48922.2021.01325","volume-title":"Proceedings of IEEE\/CVF International Conference on Computer Vision","author":"R Girdhar","year":"2021","unstructured":"R. Girdhar, K. Grauman. Anticipative video transformer. In Proceedings of IEEE\/CVF International Conference on Computer Vision, Montreal, Canada, pp. 13485\u201313495, 2021. DOI: https:\/\/doi.org\/10.1109\/ICCV48922.2021.01325."},{"key":"1599_CR133","doi-asserted-by":"publisher","first-page":"23066","DOI":"10.1109\/CVPR52729.2023.02209","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"K Ashutosh","year":"2023","unstructured":"K. Ashutosh, R. Girdhar, L. Torresani, K. Grauman. HierVL: Learning hierarchical video-language embeddings. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, pp. 23066\u201323078, 2023. DOI: https:\/\/doi.org\/10.1109\/CVPR52729.2023.02209."},{"key":"1599_CR134","doi-asserted-by":"publisher","first-page":"781","DOI":"10.1007\/978-3-030-58526-6_46","volume-title":"Proceedings of the 16th European Conference on Computer Vision","author":"H Zhao","year":"2020","unstructured":"H. Zhao, R. P. Wildes. On diverse asynchronous activity anticipation. In Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, pp. 781\u2013799, 2020. DOI: https:\/\/doi.org\/10.1007\/978-3-030-58526-6_46."},{"key":"1599_CR135","doi-asserted-by":"publisher","first-page":"2976","DOI":"10.1109\/ICCV51070.2023.00279","volume-title":"Proceedings of IEEE\/CVF International Conference on Computer Vision","author":"M A Abdelslam","year":"2023","unstructured":"M. A. Abdelslam, S. B. Rangrej, I. Hadji, N. Dvornik, K. G. Derpanis, A. Fazly. GePSAn: Generative procedure step anticipation in cooking videos. In Proceedings of IEEE\/CVF International Conference on Computer Vision, Paris, France, pp. 2976\u20132985, 2023. DOI: https:\/\/doi.org\/10.1109\/ICCV51070.2023.00279."},{"key":"1599_CR136","volume-title":"Proceedings of the 34th International Conference on Neural Information Processing Systems","author":"J Ho","year":"2020","unstructured":"J. Ho, A. Jain, P. Abbeel. Denoising diffusion probabilistic models. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, Canada, Article number 574, 2020."},{"key":"1599_CR137","doi-asserted-by":"publisher","first-page":"135","DOI":"10.1007\/978-3-031-72673-6_8","volume-title":"Proceedings of the 18th European Conference on Computer Vision","author":"B Lai","year":"2025","unstructured":"B. Lai, X. Dai, L. Chen, G. Pang, J. M. Rehg, M. Liu. LEGO: Learning EGOcentric action frame generation via visual instruction tuning. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 135\u2013155, 2025. DOI: https:\/\/doi.org\/10.1007\/978-3-031-72673-6_8."},{"key":"1599_CR138","doi-asserted-by":"publisher","first-page":"10674","DOI":"10.1109\/CVPR52688.2022.01042","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"R Rombach","year":"2022","unstructured":"R. Rombach, A. Blattmann, D. Lorenz, P. Esser, B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, pp. 10674\u201310685, 2022. DOI: https:\/\/doi.org\/10.1109\/CVPR52688.2022.01042."},{"key":"1599_CR139","volume-title":"EasyCom: An augmented reality dataset to support algorithms for easy communication in noisy environments","author":"J Donley","year":"2021","unstructured":"J. Donley, V. Tourbabin, J. S. Lee, M. Broyles, H. Jiang, J. Shen, M. Pantic, V. K. Ithapu, R. Mehra. EasyCom: An augmented reality dataset to support algorithms for easy communication in noisy environments, [Online], Available: https:\/\/arxiv.org\/abs\/2107.04174, 2021."},{"key":"1599_CR140","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-72989-8_1","volume-title":"Proceedings of the 18th European Conference on Computer Vision","author":"M Tran","year":"2025","unstructured":"M. Tran, Y. Kim, C. C. Su, C. H. Kuo, M. Sun, M. Soleymani. Ex2Eg-MAE: A framework for adaptation of exocentric video masked autoencoders for egocentric social role understanding. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, 2025. DOI: https:\/\/doi.org\/10.1007\/978-3-031-72989-8_1."},{"key":"1599_CR141","doi-asserted-by":"publisher","first-page":"8250","DOI":"10.1109\/ICASSP48485.2024.10447323","volume-title":"Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing","author":"D Kong","year":"2024","unstructured":"D. Kong, F. Khan, X. Zhang, P. Singhal, Y. N. Wu. Long-term social interaction context: The key to egocentric addressee detection. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, Republic of Korea pp. 8250\u20138254, 2024. DOI: https:\/\/doi.org\/10.1109\/ICASSP48485.2024.10447323."},{"issue":"6","key":"1599_CR142","doi-asserted-by":"publisher","first-page":"6783","DOI":"10.1109\/TPAMI.2020.3025105","volume":"45","author":"C G Northcutt","year":"2023","unstructured":"C. G. Northcutt, S. Zha, S. Lovegrove, R. Newcombe. EgoCom: A multi-person multi-modal egocentric communications dataset. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 6, pp. 6783\u20136793, 2023. DOI: https:\/\/doi.org\/10.1109\/TPAMI.2020.3025105.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"1599_CR143","doi-asserted-by":"publisher","first-page":"27048","DOI":"10.1109\/CVPR52733.2024.02555","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"S Majumder","year":"2024","unstructured":"S. Majumder, Z. Al-Halah, K. Grauman. Learning spatial features from audio-visual correspondence in egocentric videos. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 27048\u201327058, 2024. DOI: https:\/\/doi.org\/10.1109\/CVPR52733.2024.02555."},{"key":"1599_CR144","volume-title":"PCIE_LAM solution for Ego4D looking at me challenge","author":"K Lertniphonphan","year":"2024","unstructured":"K. Lertniphonphan, J. Xie, Y. Meng, S. Wang, F. Chen, Z. Wang. PCIE_LAM solution for Ego4D looking at me challenge, [Online], Available: https:\/\/arxiv.org\/abs\/2406.12211, 2024."},{"issue":"4","key":"1599_CR145","doi-asserted-by":"publisher","first-page":"4132","DOI":"10.1109\/LRA.2018.2861569","volume":"3","author":"N F Duarte","year":"2018","unstructured":"N. F. Duarte, M. Rakovi\u0107, J. Tasevski, M. I. Coco, A. Billard, J. Santos-Victor. Action anticipation: Reading the intentions of humans and robots. IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 4132\u20134139, 2018. DOI: https:\/\/doi.org\/10.1109\/LRA.2018.2861569.","journal-title":"IEEE Robotics and Automation Letters"},{"key":"1599_CR146","doi-asserted-by":"publisher","first-page":"7924","DOI":"10.1109\/CVPR.2019.00812","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"H Li","year":"2019","unstructured":"H. Li, Y. Cai, W. S. Zheng. Deep dual relation modeling for egocentric interaction recognition. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, pp. 7924\u20137933, 2019. DOI: https:\/\/doi.org\/10.1109\/CVPR.2019.00812."},{"key":"1599_CR147","doi-asserted-by":"publisher","first-page":"6570","DOI":"10.18653\/v1\/2023.findings-acl.411","volume-title":"Proceedings of Findings of the Association for Computational Linguistics","author":"B Lai","year":"2023","unstructured":"B. Lai, H. Zhang, M. Liu, A. Pariani, F. Ryan, W. Jia, S. A. Hayati, J. Rehg, D. Yang. Werewolf among us: Multimodal resources for modeling persuasion behaviors in social deduction games. In Proceedings of Findings of the Association for Computational Linguistics, Toronto, Canada, pp. 6570\u20136588, 2023. DOI: https:\/\/doi.org\/10.18653\/v1\/2023.findings-acl.411."},{"key":"1599_CR148","doi-asserted-by":"publisher","first-page":"1774","DOI":"10.1109\/RO-MAN60168.2024.10731376","volume-title":"Proceedings of the 33rd IEEE International Conference on Robot and Human Interactive Communication","author":"C Grimaldi","year":"2024","unstructured":"C. Grimaldi, A. Rossi, S. Rossi. I am part of the robot\u2019s group: Evaluating engagement and group membership from egocentric views. In Proceedings of the 33rd IEEE International Conference on Robot and Human Interactive Communication, Pasadena, USA, pp. 1774\u20131779, 2024. DOI: https:\/\/doi.org\/10.1109\/RO-MAN60168.2024.10731376."},{"key":"1599_CR149","doi-asserted-by":"publisher","unstructured":"H. Lu, W. O. Brimijoin. Sound source selection based on head movements in natural group conversation. Trends in Hearing, vol. 26, Article number 23312165221097789, 2022. DOI: https:\/\/doi.org\/10.1177\/23312165221097789.","DOI":"10.1177\/23312165221097789"},{"key":"1599_CR150","doi-asserted-by":"publisher","first-page":"5460","DOI":"10.1109\/ICASSP48485.2024.10446324","volume-title":"Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing","author":"Y Yin","year":"2024","unstructured":"Y. Yin, I. Ananthabhotla, V. K. Ithapu, S. Petridis, Y. H. Wu, C. Miller. Hearing loss detection from facial expressions in one-on-one conversations. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, Republic of Korea, pp. 5460\u20135464, 2024. DOI: https:\/\/doi.org\/10.1109\/ICASSP48485.2024.10446324."},{"key":"1599_CR151","doi-asserted-by":"publisher","first-page":"10534","DOI":"10.1109\/CVPR52688.2022.01029","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"H Jiang","year":"2022","unstructured":"H. Jiang, C. Murdock, V. K. Ithapu. Egocentric deep multi-channel audio-visual active speaker localization. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, pp. 10534\u201310542, 2022. DOI: https:\/\/doi.org\/10.1109\/CVPR52688.2022.01029."},{"key":"1599_CR152","doi-asserted-by":"publisher","unstructured":"E. Chong, E. Clark-Whitney, A. Southerland, E. Stubbs, C. Miller, E. L. Ajodan, M. R. Silverman, C. Lord, A. Rozga, R. M. Jones, J. M. Rehg. Detection of eye contact with deep neural networks is as accurate as human experts. Nature Communications, vol. 11, no. 1, Article number 6386, 2020. DOI: https:\/\/doi.org\/10.1038\/s41467-020-19712-x.","DOI":"10.1038\/s41467-020-19712-x"},{"key":"1599_CR153","doi-asserted-by":"publisher","first-page":"2310","DOI":"10.1109\/CVPR52729.2023.00229","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Z Xue","year":"2023","unstructured":"Z. Xue, Y. Song, K. Grauman, L. Torresani. Egocentric video task translation. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, pp. 2310\u20132320, 2023. DOI: https:\/\/doi.org\/10.1109\/CVPR52729.2023.00229."},{"key":"1599_CR154","doi-asserted-by":"publisher","first-page":"81","DOI":"10.1007\/978-981-16-1103-2_8","volume-title":"Proceedings of the 5th International Conference on Computer Vision and Image Processing","author":"A Choudhary","year":"2021","unstructured":"A. Choudhary, D. Mishra, A. Karmakar. Domain adaptive egocentric person re-identification. In Proceedings of the 5th International Conference on Computer Vision and Image Processing, Prayagraj, India, pp. 81\u201392, 2021. DOI: https:\/\/doi.org\/10.1007\/978-981-16-1103-2_8."},{"key":"1599_CR155","doi-asserted-by":"publisher","first-page":"7593","DOI":"10.1109\/CVPR.2018.00792","volume-title":"Proceedings of IEEE Conference on Computer Vision and Pattern Recognition","author":"T Yagi","year":"2018","unstructured":"T. Yagi, K. Mangalam, R. Yonetani, Y. Sato. Future person localization in first-person videos. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, pp. 7593\u20137602, 2018. DOI: https:\/\/doi.org\/10.1109\/CVPR.2018.00792."},{"key":"1599_CR156","doi-asserted-by":"publisher","unstructured":"K. Chen, H. Zhu, D. Tang, K. Zheng. Future pedestrian location prediction in first-person videos for autonomous vehicles and social robots. Image and Vision Computing, vol. 134, Article number 104671, 2023. DOI: https:\/\/doi.org\/10.1016\/j.imavis.2023.104671.","DOI":"10.1016\/j.imavis.2023.104671"},{"key":"1599_CR157","doi-asserted-by":"publisher","first-page":"20053","DOI":"10.1007\/978-3-030-96530-3","volume-title":"Proceedings of IEEE\/CVF International Conference on Computer Vision","author":"C Zhu","year":"2023","unstructured":"C. Zhu, F. Xiao, A. Alvarado, Y. Babaei, J. Hu, H. El-Mohri, S. C. Culatana, R. Sumbaly, Z. Yan. EgoObjects: A large-scale egocentric dataset for fine-grained object understanding. In Proceedings of IEEE\/CVF International Conference on Computer Vision, Paris, France, pp. 20053\u201320063, 2023. DOI: https:\/\/doi.org\/10.1109\/ICCV51070.2023.01840."},{"key":"1599_CR158","doi-asserted-by":"publisher","first-page":"5205","DOI":"10.1109\/ICCV51070.2023.00482","volume-title":"Proceedings of IEEE\/CVF International Conference on Computer Vision","author":"P Akiva","year":"2023","unstructured":"P. Akiva, J. Huang, K. J. Liang, R. Kovvuri, X. Chen, M. Feiszli, K. Dana, T. Hassner. Self-supervised object detection from egocentric videos. In Proceedings of IEEE\/CVF International Conference on Computer Vision, Paris, France, pp. 5205\u20135214, 2023. DOI: https:\/\/doi.org\/10.1109\/ICCV51070.2023.00482."},{"key":"1599_CR159","doi-asserted-by":"publisher","first-page":"19189","DOI":"10.1109\/ICCV51070.2023.01763","volume-title":"Proceedings of IEEE\/CVF International Conference on Computer Vision","author":"J Z Wu","year":"2023","unstructured":"J. Z. Wu, D. J. Zhang, W. Hsu, M. Zhang, M. Z. Shou. Label-efficient online continual object detection in streaming video. In Proceedings of IEEE\/CVF International Conference on Computer Vision, Paris, France, pp. 19189\u201319198, 2023. DOI: https:\/\/doi.org\/10.1109\/ICCV51070.2023.01763."},{"key":"1599_CR160","volume-title":"Proceedings of the 36th International Conference on Neural Information Processing Systems","author":"A Darkhalil","year":"2022","unstructured":"A. Darkhalil, D. Shan, B. Zhu, J. Ma, A. Kar, R. Higgins, S. Fidler, D. Fouhey, D. Damen. Epic-kitchens visor benchmark video segmentations and object relations. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 999, 2022."},{"key":"1599_CR161","doi-asserted-by":"publisher","first-page":"22836","DOI":"10.1109\/CVPR52729.2023.02187","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"P Tokmakov","year":"2023","unstructured":"P. Tokmakov, J. Li, A. Gaidon. Breaking the \u201cobject\u201d in video object segmentation. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, pp. 22836\u201322845, 2023. DOI: https:\/\/doi.org\/10.1109\/CVPR52729.2023.02187."},{"key":"1599_CR162","volume-title":"Proceedings of the 37th International Conference on Neural Information Processing Systems","author":"H Jiang","year":"2023","unstructured":"H. Jiang, S. K. Ramakrishnan, K. Grauman. Singlestage visual query localization in egocentric videos. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 1049, 2023."},{"key":"1599_CR163","doi-asserted-by":"publisher","first-page":"2593","DOI":"10.1109\/CVPR52729.2023.00255","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"M Xu","year":"2023","unstructured":"M. Xu, Y. Li, C. Y. Fu, B. Ghanem, T. Xiang, J. M. P\u00e9rez-R\u00faa. Where is my wallet? Modeling object proposal sets for egocentric visual query localization. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, pp. 2593\u20132603, 2023. DOI: https:\/\/doi.org\/10.1109\/CVPR52729.2023.00255."},{"key":"1599_CR164","doi-asserted-by":"publisher","first-page":"3697","DOI":"10.1109\/CVPR52734.2025.00350","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"S Khosla","year":"2025","unstructured":"S. Khosla, S. T. V, A. Schwing, D. Hoiem. Relocate: A simple training-free baseline for visual query localization using region-based representations. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, pp. 3697\u20133706, 2025. DOI: https:\/\/doi.org\/10.1109\/CVPR52734.2025.00350."},{"key":"1599_CR165","doi-asserted-by":"publisher","first-page":"45","DOI":"10.1109\/ICCV51070.2023.00011","volume-title":"Proceedings of IEEE\/CVF International Conference on Computer Vision","author":"J Mai","year":"2023","unstructured":"J. Mai, A. Hamdi, S. Giancola, C. Zhao, B. Ghanem. EgoLoc: Revisiting 3D object localization from egocentric videos with visual queries. In Proceedings of IEEE\/CVF International Conference on Computer Vision, Paris, France, pp. 45\u201357, 2023. DOI: https:\/\/doi.org\/10.1109\/ICCV51070.2023.00011."},{"key":"1599_CR166","doi-asserted-by":"publisher","first-page":"22910","DOI":"10.1109\/CVPR52729.2023.02194","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"C Huang","year":"2023","unstructured":"C. Huang, Y. Tian, A. Kumar, C. Xu. Egocentric audio-visual object localization. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, pp. 22910\u201322921, 2023. DOI: https:\/\/doi.org\/10.1109\/CVPR52729.2023.02194."},{"key":"1599_CR167","doi-asserted-by":"publisher","first-page":"209","DOI":"10.1109\/TMM.2024.3521746","volume":"27","author":"Z Shi","year":"2025","unstructured":"Z. Shi, Q. Wu, F. Meng, L. Xu, H. Li. Cross-modal cognitive consensus guided audio-visual segmentation. IEEE Transactions on Multimedia, vol. 27, pp. 209\u2013223, 2025. DOI: https:\/\/doi.org\/10.1109\/TMM.2024.3521746.","journal-title":"IEEE Transactions on Multimedia"},{"key":"1599_CR168","doi-asserted-by":"publisher","first-page":"13855","DOI":"10.1109\/ICCV51070.2023.01278","volume-title":"Proceedings of IEEE\/CVF International Conference on Computer Vision","author":"C Zhang","year":"2023","unstructured":"C. Zhang, A. Gupta, A. Zisserman. Helping hands: An object-aware ego-centric video recognition model. In Proceedings of IEEE\/CVF International Conference on Computer Vision, Paris, France, pp. 13855\u201313866, 2023. DOI: https:\/\/doi.org\/10.1109\/ICCV51070.2023.01278."},{"key":"1599_CR169","doi-asserted-by":"publisher","first-page":"20382","DOI":"10.1109\/ICCV51070.2023.01869","volume-title":"Proceedings of IEEE\/CVF International Conference on Computer Vision","author":"J Yu","year":"2023","unstructured":"J. Yu, X. Li, X. Zhao, H. Zhang, Y. X. Wang. Video state-changing object segmentation. In Proceedings of IEEE\/CVF International Conference on Computer Vision, Paris, France, pp. 20382\u201320391, 2023. DOI: https:\/\/doi.org\/10.1109\/ICCV51070.2023.01869."},{"key":"1599_CR170","doi-asserted-by":"publisher","first-page":"910","DOI":"10.1109\/3DV53792.2021.00099","volume-title":"Proceedings of International Conference on 3D Vision","author":"V Tschernezki","year":"2021","unstructured":"V. Tschernezki, D. Larlus, A. Vedaldi. NeuralDiff: Segmenting 3D objects that move in egocentric videos. In Proceedings of International Conference on 3D Vision, London, UK, pp. 910\u2013919, 2021. DOI: https:\/\/doi.org\/10.1109\/3DV53792.2021.00099."},{"key":"1599_CR171","doi-asserted-by":"publisher","first-page":"1461","DOI":"10.1109\/CVPR52729.2023.00147","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"M Huang","year":"2023","unstructured":"M. Huang, X. Li, J. Hu, H. Peng, S. Lyu. Tracking multiple deformable objects in egocentric videos. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, pp. 1461\u20131471, 2023. DOI: https:\/\/doi.org\/10.1109\/CVPR52729.2023.00147."},{"key":"1599_CR172","doi-asserted-by":"publisher","first-page":"382","DOI":"10.1007\/978-3-031-72775-7_22","volume-title":"Proceedings of the 18th European Conference on Computer Vision","author":"Q Gu","year":"2025","unstructured":"Q. Gu, Z. Lv, D. Frost, S. Green, J. Straub, C. Sweeney. EgoLifter: Open-world 3D segmentation for egocentric perception. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 382\u2013400, 2025. DOI: https:\/\/doi.org\/10.1007\/978-3-031-72775-7_22."},{"key":"1599_CR173","doi-asserted-by":"publisher","first-page":"3992","DOI":"10.1109\/ICCV51070.2023.00371","volume-title":"Proceedings of IEEE\/CVF International Conference on Computer Vision","author":"A Kirillov","year":"2023","unstructured":"A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W. Y. Lo, P. Doll\u00e1r, R. Girshick. Segment anything. In Proceedings of IEEE\/CVF International Conference on Computer Vision, Paris, France, pp. 3992\u20134003, 2023. DOI: https:\/\/doi.org\/10.1109\/ICCV51070.2023.00371."},{"key":"1599_CR174","doi-asserted-by":"publisher","unstructured":"P. H\u00fcbner, K. Clintworth, Q. Liu, M. Weinmann, S. Wursthorn. Evaluation of hololens tracking and depth sensing for indoor mapping applications. Sensors, vol. 20, no. 4, Article number 1021, 2020. DOI: https:\/\/doi.org\/10.3390\/s20041021.","DOI":"10.3390\/s20041021"},{"key":"1599_CR175","doi-asserted-by":"publisher","unstructured":"X. Yi, Y. Zhou, M. Habermann, V. Golyanik, S. Pan, C. Theobalt, F. Xu. EgoLocate: Real-time motion capture, localization, and mapping with sparse body-mounted sensors. ACM Transactions on Graphics, vol. 42, no. 4, Article number 76, 2023. DOI: https:\/\/doi.org\/10.1145\/3592099.","DOI":"10.1145\/3592099"},{"key":"1599_CR176","doi-asserted-by":"publisher","first-page":"3437","DOI":"10.1109\/IROS55552.2023.10341922","volume-title":"Proceedings of IEEE\/RSJ International Conference on Intelligent Robots and Systems","author":"A Rosinol","year":"2023","unstructured":"A. Rosinol, J. J. Leonard, L. Carlone. NeRF-SLAM: Real-time dense monocular SLAM with neural radiance fields. In Proceedings of IEEE\/RSJ International Conference on Intelligent Robots and Systems, Detroit, USA, pp. 3437\u20133444, 2023. DOI: https:\/\/doi.org\/10.1109\/IROS55552.2023.10341922."},{"issue":"1","key":"1599_CR177","doi-asserted-by":"publisher","first-page":"99","DOI":"10.1145\/3503250","volume":"65","author":"B Mildenhall","year":"2022","unstructured":"B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, R. Ng. NeRF: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, vol. 65, no. 1, pp. 99\u2013106, 2022. DOI: https:\/\/doi.org\/10.1145\/3503250.","journal-title":"Communications of the ACM"},{"key":"1599_CR178","doi-asserted-by":"publisher","unstructured":"H. Yin, B. Liu, M. Kaufmann, J. He, S. Christen, J. Song, P. Hui. EgoHDM: A real-time egocentric-inertial human motion capture, localization, and dense mapping system. ACM Transactions on Graphics, vol. 43, no. 6, Article number 236, 2024. DOI: https:\/\/doi.org\/10.1145\/3687907.","DOI":"10.1145\/3687907"},{"key":"1599_CR179","doi-asserted-by":"publisher","first-page":"14.1","DOI":"10.5244\/C.31.14","volume-title":"Proceedings of British Machine Vision Conference","author":"M Trumble","year":"2017","unstructured":"M. Trumble, A. Gilbert, C. Malleson, A. Hilton, J. Collomosse. Total capture: 3D human pose estimation fusing video and inertial sensors. In Proceedings of British Machine Vision Conference, London, UK, pp. 14.1\u201314.13, 2017. DOI: https:\/\/doi.org\/10.5244\/C.31.14."},{"key":"1599_CR180","doi-asserted-by":"publisher","first-page":"4868","DOI":"10.1109\/CVPR52688.2022.00483","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"G Berton","year":"2022","unstructured":"G. Berton, C. Masone, B. Caputo. Rethinking visual geo-localization for large-scale applications. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, pp. 4868\u20134878, 2022. DOI: https:\/\/doi.org\/10.1109\/CVPR52688.2022.00483."},{"key":"1599_CR181","doi-asserted-by":"publisher","first-page":"2997","DOI":"10.1109\/WACV56688.2023.00301","volume-title":"Proceedings of IEEE\/CVF Winter Conference on Applications of Computer Vision","author":"A Ali-Bey","year":"2023","unstructured":"A. Ali-Bey, B. Chaib-Draa, P. Gigu\u00e9re. MixVPR: Feature mixing for visual place recognition. In Proceedings of IEEE\/CVF Winter Conference on Applications of Computer Vision, Waikoloa, USA, pp. 2997\u20133006, 2023. DOI: https:\/\/doi.org\/10.1109\/WACV56688.2023.00301."},{"key":"1599_CR182","doi-asserted-by":"publisher","unstructured":"T. Suveges, S. McKenna. Unsupervised mapping and semantic user localisation from first-person monocular video. Pattern Recognition, vol. 158, Article number 110923, 2025. DOI: https:\/\/doi.org\/10.1016\/j.patcog.2024.110923.","DOI":"10.1016\/j.patcog.2024.110923"},{"key":"1599_CR183","doi-asserted-by":"publisher","first-page":"3685","DOI":"10.1109\/CVPRW63382.2024.00372","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Y Huang","year":"2024","unstructured":"Y. Huang, M. A. Hassan, J. He, J. Higgins, M. McCrory, H. Eicher-Miller, J. G. Thomas, E. Sazonov, F. Zhu. Automatic recognition of food ingestion environment from the AIM-2 wearable sensor. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 3685\u20133694, 2024. DOI: https:\/\/doi.org\/10.1109\/CVPRW63382.2024.00372."},{"key":"1599_CR184","doi-asserted-by":"publisher","first-page":"170","DOI":"10.1109\/CVPRW50498.2020.00027","volume-title":"Proceedings of 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops","author":"H Blanton","year":"2020","unstructured":"H. Blanton, C. Greenwell, S. Workman, N. Jacobs. Extending absolute pose regression to multiple scenes. In Proceedings of 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, USA, pp. 170\u2013178, 2020. DOI: https:\/\/doi.org\/10.1109\/CVPRW50498.2020.00027."},{"key":"1599_CR185","doi-asserted-by":"publisher","first-page":"2713","DOI":"10.1109\/ICCV48922.2021.00273","volume-title":"Proceedings of IEEE\/CVF International Conference on Computer Vision","author":"Y Shavit","year":"2021","unstructured":"Y. Shavit, R. Ferens, Y. Keller. Learning multi-scene absolute pose regression with transformers. In Proceedings of IEEE\/CVF International Conference on Computer Vision, Montreal, Canada, pp. 2713\u20132722, 2021. DOI: https:\/\/doi.org\/10.1109\/ICCV48922.2021.00273."},{"key":"1599_CR186","doi-asserted-by":"publisher","first-page":"11122","DOI":"10.1109\/CVPR52688.2022.01085","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"T Do","year":"2022","unstructured":"T. Do, O. Miksik, J. DeGol, H. S. Park, S. N. Sinha. Learning to detect scene landmarks for camera localization. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, pp. 11122\u201311132, 2022. DOI: https:\/\/doi.org\/10.1109\/CVPR52688.2022.01085."},{"key":"1599_CR187","doi-asserted-by":"publisher","first-page":"589","DOI":"10.1007\/978-3-031-20047-2_34","volume-title":"Proceedings of the 17th European Conference on Computer Vision","author":"V Panek","year":"2022","unstructured":"V. Panek, Z. Kukelova, T. Sattler. MeshLoc: Meshbased visual localization. In Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, pp. 589\u2013609, 2022. DOI: https:\/\/doi.org\/10.1007\/978-3-031-20047-2_34."},{"key":"1599_CR188","doi-asserted-by":"publisher","first-page":"19370","DOI":"10.1109\/CVPR52729.2023.01856","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"S Zhu","year":"2023","unstructured":"S. Zhu, L. Yang, C. Chen, M. Shah, X. Shen, H. Wang. R.2 former: Unified retrieval and reranking transformer for place recognition. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, pp. 19370\u201319380, 2023. DOI: https:\/\/doi.org\/10.1109\/CVPR52729.2023.01856."},{"key":"1599_CR189","doi-asserted-by":"publisher","first-page":"506","DOI":"10.1145\/3664647.3681628","volume-title":"Proceedings of the 32nd ACM International Conference on Multimedia","author":"H Lin","year":"2024","unstructured":"H. Lin, C. Long, Y. Fei, Q. Xia, E. Yin, B. Yin, X. Yang. Exploring matching rates: From keypoint selection to camera relocalization. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, pp. 506\u2013514, 2024. DOI: https:\/\/doi.org\/10.1145\/3664647.3681628."},{"key":"1599_CR190","doi-asserted-by":"publisher","unstructured":"H. Xiong, L. Wang, H. Qiu, T. Zhao, B. Qiu, H. Li. Adaptively forget with crossmodal and textual distillation for class-incremental video captioning. Neurocomputing, vol. 624, Article number 129388, 2025. DOI: https:\/\/doi.org\/10.1016\/j.neucom.2025.129388.","DOI":"10.1016\/j.neucom.2025.129388"},{"issue":"6","key":"1599_CR191","doi-asserted-by":"publisher","first-page":"6832","DOI":"10.1109\/TPAMI.2021.3118077","volume":"45","author":"P Nagar","year":"2023","unstructured":"P. Nagar, A. Rathore, C. V. Jawahar, C. Arora. Generating personalized summaries of day long egocentric videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 6, pp. 6832\u20136845, 2023. DOI: https:\/\/doi.org\/10.1109\/TPAMI.2021.3118077.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"1599_CR192","doi-asserted-by":"publisher","first-page":"185","DOI":"10.1109\/WACV51458.2022.00026","volume-title":"Proceedings of IEEE\/CVF Winter Conference on Applications of Computer Vision","author":"M Elfeki","year":"2022","unstructured":"M. Elfeki, L. Wang, A. Borji. Multi-stream dynamic video summarization. In Proceedings of IEEE\/CVF Winter Conference on Applications of Computer Vision, Waikoloa, USA, pp. 185\u2013195, 2022. DOI: https:\/\/doi.org\/10.1109\/WACV51458.2022.00026."},{"key":"1599_CR193","first-page":"2504","volume-title":"Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops","author":"V S Furlan","year":"2018","unstructured":"V. S. Furlan, R. Bajcsy, E. R. Nascimento. Fast forwarding egocentric videos by listening and watching. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, USA, pp. 2504\u20132507, 2018."},{"key":"1599_CR194","doi-asserted-by":"publisher","first-page":"3260","DOI":"10.1109\/WACV45572.2020.9093330","volume-title":"Proceedings of IEEE Winter Conference on Applications of Computer Vision","author":"W L S Ramos","year":"2020","unstructured":"W. L. S. Ramos, M. M. Silva, E. R. Araujo, A. C. Neves, E. R. Nascimento. Personalizing fast-forward videos based on visual and textual features from social network. In Proceedings of IEEE Winter Conference on Applications of Computer Vision, Snowmass, USA, pp. 3260\u20133269, 2020. DOI: https:\/\/doi.org\/10.1109\/WACV45572.2020.9093330."},{"key":"1599_CR195","doi-asserted-by":"publisher","first-page":"10493","DOI":"10.1109\/CVPR52688.2022.01025","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"G Wu","year":"2022","unstructured":"G. Wu, J. Lin, C. T. Silva. IntentVizor: Towards generic query guided interactive video summarization. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, pp. 10493\u201310502, 2022. DOI: https:\/\/doi.org\/10.1109\/CVPR52688.2022.01025."},{"key":"1599_CR196","doi-asserted-by":"publisher","unstructured":"A. Sahu, A. S. Chowdhury. Egocentric video co-sum-marization using transfer learning and refined random walk on a constrained graph. Pattern Recognition, vol. 134, Article number 109128, 2023. DOI: https:\/\/doi.org\/10.1016\/j.patcog.2022.109128.","DOI":"10.1016\/j.patcog.2022.109128"},{"key":"1599_CR197","doi-asserted-by":"publisher","first-page":"50","DOI":"10.1016\/j.patrec.2024.03.012","volume":"181","author":"Z Dai","year":"2024","unstructured":"Z. Dai, V. Tran, A. Markham, N. Trigoni, M. A. Rahman, L. N. S. Wijayasingha, J. Stankovic, C. Li. EgoCap and EgoFormer: First-person image captioning with context fusion. Pattern Recognition Letters, vol. 181, pp. 50\u201356, 2024. DOI: https:\/\/doi.org\/10.1016\/j.patrec.2024.03.012.","journal-title":"Pattern Recognition Letters"},{"issue":"2","key":"1599_CR198","doi-asserted-by":"publisher","first-page":"679","DOI":"10.1109\/TCYB.2023.3243999","volume":"54","author":"J Qiu","year":"2024","unstructured":"J. Qiu, F. P. W. Lo, X. Gu, M. L. Jobarteh, W. Jia, T. Baranowski, M. Steiner-Asiedu, A. K. Anderson, M. A. McCrory, E. Sazonov, M. Sun, G. Frost, B. Lo. Egocentric image captioning for privacy-preserved passive dietary intake monitoring. IEEE Transactions on Cybernetics, vol. 54, no. 2, pp. 679\u2013692, 2024. DOI: https:\/\/doi.org\/10.1109\/TCYB.2023.3243999.","journal-title":"IEEE Transactions on Cybernetics"},{"key":"1599_CR199","doi-asserted-by":"publisher","first-page":"40","DOI":"10.1145\/3675095.3676611","volume-title":"Proceedings of ACM International Symposium on Wearable Computers","author":"V Parikh","year":"2024","unstructured":"V. Parikh, S. Mahmud, D. Agarwal, K. Li, F. Guimbreti\u00e8re, C. Zhang. EchoGuide: Active acoustic guidance for LLM-based eating event analysis from egocentric videos. In Proceedings of ACM International Symposium on Wearable Computers, Melbourne, Australia, pp. 40\u201347, 2024. DOI: https:\/\/doi.org\/10.1145\/3675095.3676611."},{"key":"1599_CR200","doi-asserted-by":"publisher","first-page":"14867","DOI":"10.1109\/CVPR52729.2023.01428","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"B He","year":"2023","unstructured":"B. He, J. Wang, J. Qiu, T. Bui, A. Shrivastava, Z. Wang. Align and attend: Multimodal summarization with dual contrastive losses. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, pp. 14867\u201314878, 2023. DOI: https:\/\/doi.org\/10.1109\/CVPR52729.2023.01428."},{"key":"1599_CR201","doi-asserted-by":"publisher","first-page":"13525","DOI":"10.1109\/CVPR52733.2024.01284","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"J Xu","year":"2024","unstructured":"J. Xu, Y. Huang, J. Hou, G. Chen, Y. Zhang, R. Feng, W. Xie. Retrieval-augmented egocentric video captioning. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 13525\u201313536, 2024. DOI: https:\/\/doi.org\/10.1109\/CVPR52733.2024.01284."},{"issue":"1","key":"1599_CR202","doi-asserted-by":"publisher","first-page":"118","DOI":"10.1007\/s11263-023-01857-z","volume":"132","author":"R Han","year":"2024","unstructured":"R. Han, W. Feng, F. Wang, Z. Qian, H. Yan, S. Wang. Benchmarking the complementary-view multi-human association and tracking. International Journal of Computer Vision, vol. 132, no. 1, pp. 118\u2013136, 2024. DOI: https:\/\/doi.org\/10.1007\/s11263-023-01857-z.","journal-title":"International Journal of Computer Vision"},{"issue":"1","key":"1599_CR203","doi-asserted-by":"publisher","first-page":"445","DOI":"10.1109\/TCSVT.2024.3453277","volume":"35","author":"F Yang","year":"2025","unstructured":"F. Yang, S. Yamao, I. Kusajima, A. Moteki, S. Masui, S. Jiang. YOWO: You only walk once to jointly map an indoor scene and register ceiling-mounted cameras. IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 1, pp. 445\u2013460, 2025. DOI: https:\/\/doi.org\/10.1109\/TCSVT.2024.3453277.","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology"},{"key":"1599_CR204","volume-title":"Proceedings of the 37th International Conference on Neural Information Processing Systems","author":"Z Xue","year":"2023","unstructured":"Z. Xue, K. Grauman. Learning fine-grained view-invariant representations from unpaired ego-exo videos via temporal alignment. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 2336, 2023."},{"key":"1599_CR205","doi-asserted-by":"publisher","first-page":"18","DOI":"10.1007\/978-3-031-73039-9_2","volume-title":"Proceedings of the 18th European Conference on Computer Vision","author":"J H Jang","year":"2025","unstructured":"J. H. Jang, H. Seo, S. Y. Chun. INTRA: Interaction relationship-aware weakly supervised affordance grounding. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 18\u201334, 2025. DOI: https:\/\/doi.org\/10.1007\/978-3-031-73039-9_2."},{"key":"1599_CR206","doi-asserted-by":"publisher","unstructured":"T. D. Truong, K. Luu. Cross-view action recognition understanding from exocentric to egocentric perspective. Neurocomputing, vol. 614, Article number 128731, 2025. DOI: https:\/\/doi.org\/10.1016\/j.neucom.2024.128731.","DOI":"10.1016\/j.neucom.2024.128731"},{"key":"1599_CR207","doi-asserted-by":"publisher","first-page":"7396","DOI":"10.1109\/CVPR.2018.00772","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"G A Sigurdsson","year":"2018","unstructured":"G. A. Sigurdsson, A. Gupta, C. Schmid, A. Farhadi, K. Alahari. Actor and observer: Joint modeling of first and third-person videos. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, USA, pp. 7396\u20137404, 2018. DOI: https:\/\/doi.org\/10.1109\/CVPR.2018.00772."},{"key":"1599_CR208","doi-asserted-by":"publisher","first-page":"409","DOI":"10.1007\/978-3-031-72691-0_23","volume-title":"Proceedings of the 18th European Conference on Computer Vision","author":"F Cheng","year":"2025","unstructured":"F. Cheng, M. Luo, H. Wang, A. Dimakis, L. Torresani, G. Bertasius, K. Grauman. 4DIFF: 3D-aware diffusion model for third-to-first viewpoint translation. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 409\u2013427, 2025. DOI: https:\/\/doi.org\/10.1007\/978-3-031-72691-0_23."},{"key":"1599_CR209","doi-asserted-by":"publisher","first-page":"407","DOI":"10.1007\/978-3-031-72920-1_23","volume-title":"Proceedings of the 18th European Conference on Computer Vision","author":"M Luo","year":"2025","unstructured":"M. Luo, Z. Xue, A. Dimakis, K. Grauman. Put myself in your shoes: Lifting the egocentric perspective from exocentric videos. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 407\u2013425, 2025. DOI: https:\/\/doi.org\/10.1007\/978-3-031-72920-1_23."},{"key":"1599_CR210","doi-asserted-by":"publisher","first-page":"4359","DOI":"10.1109\/ICCVW.2019.00536","volume-title":"Proceedings of IEEE\/CVF International Conference on Computer Vision Workshops","author":"C Fan","year":"2019","unstructured":"C. Fan. EgoVQA-an egocentric video question answering benchmark dataset. In Proceedings of IEEE\/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, pp. 4359\u20134366, 2019. DOI: https:\/\/doi.org\/10.1109\/ICCVW.2019.00536."},{"key":"1599_CR211","doi-asserted-by":"publisher","first-page":"1655","DOI":"10.1109\/ICCV48922.2021.00170","volume-title":"Proceedings of IEEE\/CVF International Conference on Computer Vision","author":"D Gao","year":"2021","unstructured":"D. Gao, R. Wang, Z. Bai, X. Chen. Env-QA: A video question answering benchmark for comprehensive understanding of dynamic environments. In Proceedings of IEEE\/CVF International Conference on Computer Vision, Montreal, Canada, pp. 1655\u20131665, 2021. DOI: https:\/\/doi.org\/10.1109\/ICCV48922.2021.00170."},{"key":"1599_CR212","volume-title":"Proceedings of the 36th International Conference on Neural Information Processing Systems","author":"B Jia","year":"2022","unstructured":"B. Jia, T. Lei, S. C. Zhu, S. Huang. EgoTaskQA: Understanding human tasks in egocentric videos. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 242, 2022."},{"key":"1599_CR213","doi-asserted-by":"publisher","first-page":"14773","DOI":"10.1109\/CVPR52729.2023.01419","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"D Gao","year":"2023","unstructured":"D. Gao, L. Zhou, L. Ji, L. Zhu, Y. Yang, M. Z. Shou. MIST: Multi-modal iterative spatial-temporal transformer for long-form video question answering. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, pp. 14773\u201314783, 2023. DOI: https:\/\/doi.org\/10.1109\/CVPR52729.2023.01419."},{"key":"1599_CR214","doi-asserted-by":"publisher","first-page":"92","DOI":"10.1007\/978-3-031-72624-8_6","volume-title":"Proceedings of the 18th European Conference on Computer Vision","author":"G Goletto","year":"2025","unstructured":"G. Goletto, T. Nagarajan, G. Averta, D. Damen. AMEGO: Active memory from long EGOcentric videos. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 92\u2013110, 2025. DOI: https:\/\/doi.org\/10.1007\/978-3-031-72624-8_6."},{"key":"1599_CR215","doi-asserted-by":"publisher","first-page":"6302","DOI":"10.1109\/CVPR.2019.00647","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"L Yu","year":"2019","unstructured":"L. Yu, X. Chen, G. Gkioxari, M. Bansal, T. L. Berg, D. Batra. Multi-target embodied question answering. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, USA, pp. 6302\u20136311, 2019. DOI: https:\/\/doi.org\/10.1109\/CVPR.2019.00647."},{"key":"1599_CR216","volume-title":"Proceedings of the 11th International Conference on Learning Representations","author":"X Ma","year":"2023","unstructured":"X. Ma, S. Yong, Z. Zheng, Q. Li, Y. Liang, S. C. Zhu, S. Huang. SQA3D: Situated question answering in 3D scenes. In Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 2023."},{"key":"1599_CR217","doi-asserted-by":"publisher","first-page":"14931","DOI":"10.1109\/CVPR52729.2023.01434","volume-title":"Proceedings of 2023 IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"H Zhu","year":"2023","unstructured":"H. Zhu, R. Kapoor, S. Y. Min, W. Han, J. Li, K. Geng, G. Neubig, Y. Bisk, A. Kembhavi, L. Weihs. EXCALIBUR: Encouraging and evaluating embodied exploration. In Proceedings of 2023 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, pp. 14931\u201314942, 2023. DOI: https:\/\/doi.org\/10.1109\/CVPR52729.2023.01434."},{"key":"1599_CR218","doi-asserted-by":"publisher","first-page":"485","DOI":"10.1007\/978-3-031-20059-5_28","volume-title":"Proceedings of the 17th European Conference on Computer Vision","author":"B Wong","year":"2022","unstructured":"B. Wong, J. Chen, Y. Wu, S. W. Lei, D. Mao, D. Gao, M. Z. Shou. AssistQ: Affordance-centric question-driven task completion for egocentric assistant. In Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, pp. 485\u2013501, 2022. DOI: https:\/\/doi.org\/10.1007\/978-3-031-20059-5_28."},{"key":"1599_CR219","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-72913-3_1","volume-title":"Proceedings of the 18th European Conference on Computer Vision","author":"T Hummel","year":"2025","unstructured":"T. Hummel, S. Karthik, M. I. Georgescu, Z. Akata. EgoCVR: An egocentric benchmark for fine-grained composed video retrieval. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, 2025. DOI: https:\/\/doi.org\/10.1007\/978-3-031-72913-3_1."},{"key":"1599_CR220","doi-asserted-by":"publisher","first-page":"14291","DOI":"10.1109\/CVPR52733.2024.01355","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"S Cheng","year":"2024","unstructured":"S. Cheng, Z. Guo, J. Wu, K. Fang, P. Li, H. Liu, Y. Liu. EgoThink: Evaluating first-person perspective thinking capability of vision-language models. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 14291\u201314302, 2024. DOI: https:\/\/doi.org\/10.1109\/CVPR52733.2024.01355."},{"key":"1599_CR221","first-page":"8748","volume-title":"Proceedings of the 38th International Conference on Machine Learning","author":"A Radford","year":"2021","unstructured":"A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pp. 8748\u20138763, 2021."},{"key":"1599_CR222","volume-title":"Proceedings of the 36th International Conference on Neural Information Processing Systems","author":"Z Tong","year":"2022","unstructured":"Z. Tong, Y. Song, J. Wang, L. Wang. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 732, 2022."},{"key":"1599_CR223","doi-asserted-by":"publisher","first-page":"5262","DOI":"10.1109\/ICCV51070.2023.00487","volume-title":"Proceedings of IEEE\/CVF International Conference on Computer Vision","author":"S Pramanick","year":"2023","unstructured":"S. Pramanick, Y. Song, S. Nag, K. Q. Lin, H. Shah, M. Z. Shou, R. Chellappa, P. Zhang. EgoVLPv2: Egocentric video-language pre-training with fusion in the backbone. In Proceedings of IEEE\/CVF International Conference on Computer Vision, Paris, France, pp. 5262\u20135274, 2023. DOI: https:\/\/doi.org\/10.1109\/ICCV51070.2023.00487."},{"key":"1599_CR224","doi-asserted-by":"publisher","first-page":"2847","DOI":"10.1109\/CVPR.2012.6248010","volume-title":"Proceedings of IEEE Conference on Computer Vision and Pattern Recognition","author":"H Pirsiavash","year":"2012","unstructured":"H. Pirsiavash, D. Ramanan. Detecting activities of daily living in first-person camera views. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Providence, USA, pp. 2847\u20132854, 2012. DOI: https:\/\/doi.org\/10.1109\/CVPR.2012.6248010."},{"key":"1599_CR225","doi-asserted-by":"publisher","first-page":"20981","DOI":"10.1109\/CVPR52688.2022.02034","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Y Liu","year":"2022","unstructured":"Y. Liu, Y. Liu, C. Jiang, K. Lyu, W. Wan, H. Shen, B. Liang, Z. Fu, H. Wang, L. Yi. HOI4D: A 4D egocentric dataset for category-level human-object interaction. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, pp. 20981\u201320990, 2022. DOI: https:\/\/doi.org\/10.1109\/CVPR52688.2022.02034."},{"key":"1599_CR226","doi-asserted-by":"publisher","first-page":"377","DOI":"10.1007\/978-3-031-72913-3_21","volume-title":"Proceedings of the 18th European Conference on Computer Vision","author":"A Bar","year":"2025","unstructured":"A. Bar, A. Bakhtiar, D. Tran, A. Loquercio, J. Rajasegaran, Y. LeCun, A. Globerson, T. Darrell. EgoPet: Egomotion and interaction data from an animal\u2019s perspective. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 377\u2013394, 2025. DOI: https:\/\/doi.org\/10.1007\/978-3-031-72913-3_21."},{"key":"1599_CR227","doi-asserted-by":"publisher","unstructured":"M. Bock, H. Kuehne, K. Van Laerhoven, M. Moeller. WEAR: An outdoor sports dataset for wearable and egocentric activity recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 8, no. 4, Article number 175, 2024. DOI: https:\/\/doi.org\/10.1145\/3699776.","DOI":"10.1145\/3699776"},{"key":"1599_CR228","doi-asserted-by":"publisher","first-page":"19383","DOI":"10.1109\/CVPR52733.2024.01834","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"K Grauman","year":"2024","unstructured":"K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V. Baiyya, S. Bansal, B. Boote, E. Byrne, Z. Chavis, J. Chen, F. Cheng, F. J. Chu, S. Crane, A. Dasgupta, J. Dong, M. Escobar, C. Forigua, A. Gebreselasie, S. Haresh, J. Huang, M. Islam, S. Jain, R. Khirodkar, D. Kukreja, K. J. Liang, J. W. Liu, S. Majumder, Y. Mao, M. Martin, E. Mavroudi, T. Nagarajan, F. Ragusa, S. K. Ramakrishnan, L. Seminara, A. Somayazulu, Y. Song, S. Su, Z. Xue, E. Zhang, J. Zhang, A. Castillo, C. Chen, X. Fu, R. Furuta, C. Gonz\u00e1lez, P. Gupta, J. Hu, Y. Huang, Y. Huang, W. Khoo, A. Kumar, R. Kuo, S. Lakhavani, M. Liu, M. Luo, Z. Luo, B. Meredith, A. Miller, O. Oguntola, X. Pan, P. Peng, S. Pramanick, M. Ramazanova, F. Ryan, W. Shan, K. Somasundaram, C. Song, A. Southerland, M. Tateno, H. Wang, Y. Wang, T. Yagi, M. Yan, X. Yang, Z. Yu, S. C. Zha, C. Zhao, Z. Zhao, Z. Zhu, J. Zhuo, P. Arbel\u00e1ez, G. Bertasius, D. Damen, J. Engel, G. M. Farinella, A. Furnari, B. Ghanem, J. Hoffman, C. V. Jawahar, R. Newcombe, H. S. Park, J. M. Rehg, Y. Sato, M. Savva, J. Shi, M. Z. Shout, M. Wray. Ego-Exo4D: Understanding skilled human activity from first-and third-person perspectives. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 19383\u201319400, 2024. DOI: https:\/\/doi.org\/10.1109\/CVPR52733.2024.01834."},{"key":"1599_CR229","doi-asserted-by":"publisher","first-page":"22072","DOI":"10.1109\/CVPR52733.2024.02084","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Y Huang","year":"2024","unstructured":"Y. Huang, G. Chen, J. Xu, M. Zhang, L. Yang, B. Pei, H. Zhang, L. Dong, Y. Wang, L. Wang, Y. Qiao. EgoExolearn: A dataset for bridging asynchronous ego-and exo-centric view of procedural activities in real world. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 22072\u201322086, 2024. DOI: https:\/\/doi.org\/10.1109\/CVPR52733.2024.02084."},{"key":"1599_CR230","doi-asserted-by":"publisher","first-page":"363","DOI":"10.1007\/978-3-031-72661-3_21","volume-title":"Proceedings of the 18th European Conference on Computer Vision","author":"Y M Li","year":"2025","unstructured":"Y. M. Li, W. J. Huang, A. L. Wang, L. A. Zeng, J. K. Meng, W. S. Zheng. EgoExo-fitness: Towards egocentric and exocentric full-body action understanding. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 363\u2013382, 2025. DOI: https:\/\/doi.org\/10.1007\/978-3-031-72661-3_21."},{"key":"1599_CR231","doi-asserted-by":"publisher","first-page":"21740","DOI":"10.1109\/CVPR52733.2024.02054","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Y Liu","year":"2024","unstructured":"Y. Liu, H. Yang, X. Si, L. Liu, Z. Li, Y. Zhang, Y. Liu, L. Yi. TACO: Benchmarking generalizable bimanual tool-action-object understanding. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, USA, pp. 21740\u201321751, 2024. DOI: https:\/\/doi.org\/10.1109\/CVPR52733.2024.02054."},{"key":"1599_CR232","doi-asserted-by":"publisher","first-page":"445","DOI":"10.1007\/978-3-031-72691-0_25","volume-title":"Proceedings of the 18th European Conference on Computer Vision","author":"L Ma","year":"2025","unstructured":"L. Ma, Y. Ye, F. Hong, V. Guzov, Y. Jiang, R. Postyeni, L. Pesqueira, A. Gamino, V. Baiyya, H. J. Kim, K. Bailey, D. S. Fosas, C. K. Liu, Z. Liu, J. Engel, R. De Nardi, R. Newcombe. Nymeria: A massive collection of multimodal egocentric daily motion in the wild. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, pp. 445\u2013465, 2025. DOI: https:\/\/doi.org\/10.1007\/978-3-031-72691-0_25."},{"key":"1599_CR233","doi-asserted-by":"publisher","first-page":"4353","DOI":"10.1109\/WACV57701.2024.00431","volume-title":"Proceedings of IEEE\/CVF Winter Conference on Applications of Computer Vision","author":"T J Schoonbeek","year":"2024","unstructured":"T. J. Schoonbeek, T. Houben, H. Onvlee, P. H. N. de With, F. Van der Sommen. IndustReal: A dataset for procedure step recognition handling execution errors in egocentric videos in an industrial-like setting. In Proceedings of IEEE\/CVF Winter Conference on Applications of Computer Vision, Waikoloa, USA, pp. 4353\u20134362, 2024. DOI: https:\/\/doi.org\/10.1109\/WACV57701.2024.00431."},{"key":"1599_CR234","volume-title":"Proceedings of the 37th International Conference on Neural Information Processing Systems","author":"H Tang","year":"2023","unstructured":"H. Tang, K. J. Liang, K. Grauman, M. Feiszli, W. Wang. EgoTracks: A long-term egocentric visual object tracking dataset. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, USA, Article number 3309, 2023."},{"key":"1599_CR235","volume-title":"EgoMe: A new dataset and challenge for following me via egocentric view in real world","author":"H Qiu","year":"2025","unstructured":"H. Qiu, Z. Shi, L. Wang, H. Xiong, X. Li, H. Li. EgoMe: A new dataset and challenge for following me via egocentric view in real world, [Online], Available: https:\/\/arxiv.org\/abs\/2501.19061, 2025."},{"issue":"6","key":"1599_CR236","doi-asserted-by":"publisher","first-page":"1945","DOI":"10.1007\/s11263-023-01962-z","volume":"132","author":"H Luo","year":"2024","unstructured":"H. Luo, W. Zhai, J. Zhang, Y. Cao, D. Tao. Grounded affordance from exocentric view. International Journal of Computer Vision, vol. 132, no. 6, pp. 1945\u20131969, 2024. DOI: https:\/\/doi.org\/10.1007\/s11263-023-01962-z.","journal-title":"International Journal of Computer Vision"},{"issue":"12","key":"1599_CR237","doi-asserted-by":"publisher","first-page":"11236","DOI":"10.1109\/TPAMI.2024.3457229","volume":"46","author":"Y Dai","year":"2024","unstructured":"Y. Dai, Z. Wang, X. Lin, C. Wen, L. Xu, S. Shen, Y. Ma, C. Wang. HiSC4D: Human-centered interaction and 4D scene capture in large-scale space using wearable IMUs and LiDAR. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 11236\u201311253, 2024. DOI: https:\/\/doi.org\/10.1109\/TPAMI.2024.3457229.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"1599_CR238","doi-asserted-by":"publisher","first-page":"15979","DOI":"10.1109\/CVPR52688.2022.01553","volume-title":"Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"K He","year":"2022","unstructured":"K. He, X. Chen, S. Xie, Y. Li, P. Doll\u00e1r, R. Girshick. Masked autoencoders are scalable vision learners. In Proceedings of IEEE\/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, USA, pp. 15979\u201315988, 2022. DOI: https:\/\/doi.org\/10.1109\/CVPR52688.2022.01553."}],"container-title":["Machine Intelligence Research"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11633-025-1599-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s11633-025-1599-4","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s11633-025-1599-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,2]],"date-time":"2026-02-02T15:03:39Z","timestamp":1770044619000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s11633-025-1599-4"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,2]]},"references-count":238,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2026,2]]}},"alternative-id":["1599"],"URL":"https:\/\/doi.org\/10.1007\/s11633-025-1599-4","relation":{},"ISSN":["2731-538X","2731-5398"],"issn-type":[{"value":"2731-538X","type":"print"},{"value":"2731-5398","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,2]]},"assertion":[{"value":"9 April 2025","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 September 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"2 February 2026","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}