{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T04:59:40Z","timestamp":1750309180675,"version":"3.41.0"},"reference-count":80,"publisher":"Association for Computing Machinery (ACM)","issue":"5","license":[{"start":{"date-parts":[[2024,1,11]],"date-time":"2024-01-11T00:00:00Z","timestamp":1704931200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"National Key Research and Development Project of New Generation Artificial Intelligence of China","award":["2018AAA0102500"],"award-info":[{"award-number":["2018AAA0102500"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2024,5,31]]},"abstract":"<jats:p>Predicting the unknown from the first-person perspective is expected as a necessary step toward machine intelligence, which is essential for practical applications including autonomous driving and robotics. As a human-level task, egocentric action anticipation aims at predicting an unknown action seconds before it is performed from the first-person viewpoint. Egocentric actions are usually provided as verb-noun pairs; however, predicting the unknown action may be trapped in insufficient training data for all possible combinations. Therefore, it is crucial for intelligent systems to use limited known verb-noun pairs to predict new combinations of actions that have never appeared, which is known as compositional generalization. In this article, we are the first to explore the egocentric compositional action anticipation problem, which is more in line with real-world settings but neglected by existing studies. Whereas prediction results are prone to suffer from semantic bias considering the distinct difference between training and test distributions, we further introduce a general and flexible adaptive semantic debiasing framework that is compatible with different deep neural networks. To capture and mitigate semantic bias, we can imagine one counterfactual situation where no visual representations have been observed and only semantic patterns of observation are used to predict the next action. Instead of the traditional counterfactual analysis scheme that reduces semantic bias in a mindless way, we devise a novel counterfactual analysis scheme to adaptively amplify or penalize the effect of semantic experience by considering the discrepancy both among categories and among examples. We also demonstrate that the traditional counterfactual analysis scheme is a special case of the devised adaptive counterfactual analysis scheme. We conduct experiments on three large-scale egocentric video datasets. Experimental results verify the superiority and effectiveness of our proposed solution.<\/jats:p>","DOI":"10.1145\/3633333","type":"journal-article","created":{"date-parts":[[2023,12,4]],"date-time":"2023-12-04T11:48:40Z","timestamp":1701690520000},"page":"1-21","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Toward Egocentric Compositional Action Anticipation with Adaptive Semantic Debiasing"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5096-3548","authenticated-orcid":false,"given":"Tianyu","family":"Zhang","sequence":"first","affiliation":[{"name":"Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences and University of Chinese Academy of Sciences, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6668-9208","authenticated-orcid":false,"given":"Weiqing","family":"Min","sequence":"additional","affiliation":[{"name":"Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences and University of Chinese Academy of Sciences, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0680-5179","authenticated-orcid":false,"given":"Tao","family":"Liu","sequence":"additional","affiliation":[{"name":"Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences and University of Chinese Academy of Sciences, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1596-4326","authenticated-orcid":false,"given":"Shuqiang","family":"Jiang","sequence":"additional","affiliation":[{"name":"Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences and University of Chinese Academy of Sciences, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9142-5914","authenticated-orcid":false,"given":"Yong","family":"Rui","sequence":"additional","affiliation":[{"name":"Lenovo Group, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2024,1,11]]},"reference":[{"issue":"5","key":"e_1_3_1_2_2","doi-asserted-by":"crossref","first-page":"744","DOI":"10.1109\/TCSVT.2015.2409731","article-title":"The evolution of first person vision methods: A survey","volume":"25","author":"Betancourt Alejandro","year":"2015","unstructured":"Alejandro Betancourt, Pietro Morerio, Carlo S. Regazzoni, and Matthias Rauterberg. 2015. The evolution of first person vision methods: A survey. IEEE Transactions on Circuits and Systems for Video Technology 25, 5 (2015), 744\u2013760.","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology"},{"key":"e_1_3_1_3_2","first-page":"3312","volume-title":"Proceedings of the International Conference on Pattern Recognition","author":"Camporese Guglielmo","year":"2021","unstructured":"Guglielmo Camporese, Pasquale Coscia, Antonino Furnari, Giovanni Maria Farinella, and Lamberto Ballan. 2021. Knowledge distillation for action anticipation via label smoothing. In Proceedings of the International Conference on Pattern Recognition. 3312\u20133319."},{"key":"e_1_3_1_4_2","first-page":"9824","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Chen Guangyi","year":"2021","unstructured":"Guangyi Chen, Junlong Li, Jiwen Lu, and Jie Zhou. 2021. Human trajectory prediction via counterfactual analysis. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 9824\u20139833."},{"issue":"6","key":"e_1_3_1_5_2","doi-asserted-by":"crossref","first-page":"2205","DOI":"10.3982\/ECTA10582","article-title":"Inference on counterfactual distributions","volume":"81","author":"Chernozhukov Victor","year":"2013","unstructured":"Victor Chernozhukov, Iv\u00e1n Fern\u00e1ndez-Val, and Blaise Melly. 2013. Inference on counterfactual distributions. Econometrica 81, 6 (2013), 2205\u20132268.","journal-title":"Econometrica"},{"issue":"1","key":"e_1_3_1_6_2","doi-asserted-by":"crossref","first-page":"33","DOI":"10.1007\/s11263-021-01531-2","article-title":"Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100","volume":"130","author":"Damen Dima","year":"2022","unstructured":"Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et\u00a0al. 2022. Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100. International Journal of Computer Vision 130, 1 (2022), 33\u201355.","journal-title":"International Journal of Computer Vision"},{"key":"e_1_3_1_7_2","first-page":"720","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Damen Dima","year":"2018","unstructured":"Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et\u00a0al. 2018. Scaling egocentric vision: The EPIC-KITCHENS dataset. In Proceedings of the European Conference on Computer Vision. 720\u2013736."},{"key":"e_1_3_1_8_2","first-page":"1549","volume-title":"Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision","author":"Geest Roeland De","year":"2018","unstructured":"Roeland De Geest and Tinne Tuytelaars. 2018. Modeling temporal structure with LSTM for online action detection. In Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision. 1549\u20131557."},{"key":"e_1_3_1_9_2","article-title":"Forecasting action through contact representations from first person video","author":"Dessalene Eadom","year":"2021","unstructured":"Eadom Dessalene, Chinmaya Devaraj, Michael Maynord, Cornelia Fermuller, and Yiannis Aloimonos. 2021. Forecasting action through contact representations from first person video. IEEE Transactions on Pattern Analysis and Machine Intelligence. Early Access, January 28, 2021.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence."},{"key":"e_1_3_1_10_2","first-page":"1","volume-title":"Proceedings of the European Conference on Computer Vision Workshops","author":"Furnari Antonino","year":"2018","unstructured":"Antonino Furnari, Sebastiano Battiato, and Giovanni Maria Farinella. 2018. Leveraging uncertainty to rethink loss functions and evaluation measures for egocentric action anticipation. In Proceedings of the European Conference on Computer Vision Workshops. 1\u201317."},{"key":"e_1_3_1_11_2","first-page":"6252","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Furnari Antonino","year":"2019","unstructured":"Antonino Furnari and Giovanni Maria Farinella. 2019. What would you expect? Anticipating egocentric actions with rolling-unrolling LSTMs and modality attention. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 6252\u20136261."},{"issue":"11","key":"e_1_3_1_12_2","doi-asserted-by":"crossref","first-page":"4021","DOI":"10.1109\/TPAMI.2020.2992889","article-title":"Rolling-unrolling LSTMs for action anticipation from first-person video","volume":"43","author":"Furnari Antonino","year":"2021","unstructured":"Antonino Furnari and Giovanni Maria Farinella. 2021. Rolling-unrolling LSTMs for action anticipation from first-person video. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 11 (2021), 4021\u20134036.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_1_13_2","first-page":"1250","volume-title":"Proceedings of the International Conference on Pattern Recognition","author":"Furnari Antonino","year":"2022","unstructured":"Antonino Furnari and Giovanni Maria Farinella. 2022. Towards streaming egocentric action anticipation. In Proceedings of the International Conference on Pattern Recognition. 1250\u20131257."},{"key":"e_1_3_1_14_2","first-page":"1","article-title":"RED: Reinforced encoder-decoder networks for action anticipation","author":"Gao Jiyang","year":"2017","unstructured":"Jiyang Gao, Zhenheng Yang, and Ram Nevatia. 2017. RED: Reinforced encoder-decoder networks for action anticipation. In Proceedings of the British Machine Vision Conference. 1\u201311.","journal-title":"Proceedings of the British Machine Vision Conference"},{"key":"e_1_3_1_15_2","first-page":"13505","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Girdhar Rohit","year":"2021","unstructured":"Rohit Girdhar and Kristen Grauman. 2021. Anticipative video transformer. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 13505\u201313515."},{"issue":"8","key":"e_1_3_1_16_2","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long short-term memory","volume":"9","author":"Hochreiter Sepp","year":"1997","unstructured":"Sepp Hochreiter and J\u00fcrgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735\u20131780.","journal-title":"Neural Computation"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2022.03.069"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1145\/3409332"},{"key":"e_1_3_1_19_2","first-page":"245","volume-title":"Proceedings of the ACM International Conference on Multimedia","author":"Huang Yi","year":"2021","unstructured":"Yi Huang, Xiaoshan Yang, and Changsheng Xu. 2021. Multimodal global relation knowledge distillation for egocentric action anticipation. In Proceedings of the ACM International Conference on Multimedia. 245\u2013254."},{"key":"e_1_3_1_20_2","doi-asserted-by":"crossref","first-page":"134611","DOI":"10.1109\/ACCESS.2021.3115476","article-title":"Video action understanding: A tutorial","author":"Hutchinson Matthew S.","year":"2021","unstructured":"Matthew S. Hutchinson and Vijay N. Gadepally. 2021. Video action understanding: A tutorial. IEEE Access 9 (2021), 134611\u2013134637.","journal-title":"IEEE Access"},{"key":"e_1_3_1_21_2","first-page":"3118","volume-title":"Proceedings of the IEEE International Conference on Robotics and Automation","author":"Jain Ashesh","year":"2016","unstructured":"Ashesh Jain, Avi Singh, Hema S Koppula, Shane Soh, and Ashutosh Saxena. 2016. Recurrent neural networks for driver activity anticipation via sensory-fusion architecture. In Proceedings of the IEEE International Conference on Robotics and Automation. 3118\u20133125."},{"key":"e_1_3_1_22_2","volume-title":"Thinking, Fast and Slow","author":"Kahneman Daniel","year":"2011","unstructured":"Daniel Kahneman. 2011. Thinking, Fast and Slow. Macmillan."},{"issue":"3","key":"e_1_3_1_23_2","doi-asserted-by":"crossref","first-page":"395","DOI":"10.2189\/asqu.53.3.395","article-title":"A political mediation model of corporate response to social movement activism","volume":"53","author":"King Brayden G.","year":"2008","unstructured":"Brayden G. King. 2008. A political mediation model of corporate response to social movement activism. Administrative Science Quarterly 53, 3 (2008), 395\u2013421.","journal-title":"Administrative Science Quarterly"},{"issue":"5","key":"e_1_3_1_24_2","doi-asserted-by":"crossref","first-page":"403","DOI":"10.1080\/00029890.1992.11995869","article-title":"Two notes on notation","volume":"99","author":"Knuth Donald E.","year":"1992","unstructured":"Donald E. Knuth. 1992. Two notes on notation. American Mathematical Monthly 99, 5 (1992), 403\u2013422.","journal-title":"American Mathematical Monthly"},{"key":"e_1_3_1_25_2","volume-title":"Towards More Human-Like Concept Learning in Machines: Compositionality, Causality, and Learning-to-Learn","author":"Lake Brenden M.","year":"2014","unstructured":"Brenden M. Lake. 2014. Towards More Human-Like Concept Learning in Machines: Compositionality, Causality, and Learning-to-Learn. Ph.D. Dissertation. Massachusetts Institute of Technology, Cambridge, MA."},{"key":"e_1_3_1_26_2","first-page":"1","article-title":"Building machines that learn and think like people","volume":"40","author":"Lake Brenden M.","year":"2017","unstructured":"Brenden M. Lake, Tomer Ullman, Joshua B. Tenenbaum, and Samuel J. Gershman. 2017. Building machines that learn and think like people. Behavioral and Brain Sciences 40 (2017), 1\u2013101.","journal-title":"Behavioral and Brain Sciences"},{"key":"e_1_3_1_27_2","first-page":"3216","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Li Yin","year":"2013","unstructured":"Yin Li, Alireza Fathi, and James M. Rehg. 2013. Learning to predict gaze in egocentric video. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 3216\u20133223."},{"key":"e_1_3_1_28_2","first-page":"619","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Li Yin","year":"2018","unstructured":"Yin Li, Miao Liu, and James M. Rehg. 2018. In the eye of beholder: Joint learning of gaze and actions in first person video. In Proceedings of the European Conference on Computer Vision. 619\u2013635."},{"key":"e_1_3_1_29_2","first-page":"7083","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Lin Ji","year":"2019","unstructured":"Ji Lin, Chuang Gan, and Song Han. 2019. TSM: Temporal shift module for efficient video understanding. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 7083\u20137093."},{"key":"e_1_3_1_30_2","first-page":"704","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Liu Miao","year":"2020","unstructured":"Miao Liu, Siyu Tang, Yin Li, and James M. Rehg. 2020. Forecasting human-object interaction: Joint prediction of motor attention and actions in first person video. In Proceedings of the European Conference on Computer Vision. 704\u2013721."},{"key":"e_1_3_1_31_2","first-page":"3282","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Liu Shaowei","year":"2022","unstructured":"Shaowei Liu, Subarna Tripathi, Somdeb Majumdar, and Xiaolong Wang. 2022. Joint hand motion and interaction hotspots prediction from egocentric videos. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 3282\u20133292."},{"key":"e_1_3_1_32_2","first-page":"13904","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Liu Tianshan","year":"2022","unstructured":"Tianshan Liu and Kin-Man Lam. 2022. A hybrid egocentric activity anticipation framework via memory-augmented recurrent and one-shot representation forecasting. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 13904\u201313913."},{"key":"e_1_3_1_33_2","first-page":"687","volume-title":"Proceedings of the ACM International Conference on Multimedia","author":"Liu Xiaohao","year":"2022","unstructured":"Xiaohao Liu, Zhulin Tao, Jiahong Shao, Lifang Yang, and Xianglin Huang. 2022. EliMRec: Eliminating single-modal bias in multimedia recommendation. In Proceedings of the ACM International Conference on Multimedia. 687\u2013695."},{"key":"e_1_3_1_34_2","first-page":"559","volume-title":"Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics","author":"Luo Zhekun","year":"2022","unstructured":"Zhekun Luo, Shalini Ghosh, Devin Guillory, Keizo Kato, Trevor Darrell, and Huijuan Xu. 2022. Disentangled action recognition with knowledge bases. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics. 559\u2013572."},{"key":"e_1_3_1_35_2","article-title":"Motion stimulation for compositional action recognition","author":"Ma Lei","year":"2022","unstructured":"Lei Ma, Yuhui Zheng, Zhao Zhang, Yazhou Yao, Xijian Fan, and Qiaolin Ye. 2022. Motion stimulation for compositional action recognition. IEEE Transactions on Circuits and Systems for Video Technology. Early Access, November 14, 2022.","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology."},{"key":"e_1_3_1_36_2","first-page":"1942","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Ma Shugao","year":"2016","unstructured":"Shugao Ma, Leonid Sigal, and Stan Sclaroff. 2016. Learning activity progression in LSTMs for activity detection and early detection. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 1942\u20131950."},{"key":"e_1_3_1_37_2","first-page":"1049","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Materzynska Joanna","year":"2020","unstructured":"Joanna Materzynska, Tete Xiao, Roei Herzig, Huijuan Xu, Xiaolong Wang, and Trevor Darrell. 2020. Something-else: Compositional action recognition with spatial-temporal interaction networks. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 1049\u20131059."},{"key":"e_1_3_1_38_2","first-page":"1","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Mikolov Tomas","year":"2013","unstructured":"Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the Advances in Neural Information Processing Systems. 1\u20139."},{"key":"e_1_3_1_39_2","first-page":"163","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Nagarajan Tushar","year":"2020","unstructured":"Tushar Nagarajan, Yanghao Li, Christoph Feichtenhofer, and Kristen Grauman. 2020. Ego-TOPO: Environment affordances from egocentric video. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 163\u2013172."},{"key":"e_1_3_1_40_2","first-page":"4220","volume-title":"Proceedings of the ACM International Conference on Multimedia","author":"Nakamura Katsuyuki","year":"2021","unstructured":"Katsuyuki Nakamura, Hiroki Ohashi, and Mitsuhiro Okada. 2021. Sensor-augmented egocentric-video captioning with dynamic modal attention. In Proceedings of the ACM International Conference on Multimedia. 4220\u20134229."},{"key":"e_1_3_1_41_2","first-page":"12700","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Niu Yulei","year":"2021","unstructured":"Yulei Niu, Kaihua Tang, Hanwang Zhang, Zhiwu Lu, Xian-Sheng Hua, and Ji-Rong Wen. 2021. Counterfactual VQA: A cause-effect look at language bias. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 12700\u201312710."},{"key":"e_1_3_1_42_2","first-page":"16292","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Niu Yulei","year":"2021","unstructured":"Yulei Niu and Hanwang Zhang. 2021. Introspective distillation for robust question answering. In Proceedings of the Advances in Neural Information Processing Systems. 16292\u201316304."},{"key":"e_1_3_1_43_2","first-page":"174","volume-title":"Proceedings of the International Conference on Image Analysis and Recognition","author":"N\u00fa\u00f1ez-Marcos Adri\u00e1n","year":"2020","unstructured":"Adri\u00e1n N\u00fa\u00f1ez-Marcos, Gorka Azkune, Eneko Agirre, Diego L\u00f3pez-de Ipi\u00f1a, and Ignacio Arganda-Carreras. 2020. Using external knowledge to improve zero-shot action recognition in egocentric videos. In Proceedings of the International Conference on Image Analysis and Recognition. 174\u2013185."},{"key":"e_1_3_1_44_2","first-page":"3437","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision Workshops","author":"Osman Nada","year":"2021","unstructured":"Nada Osman, Guglielmo Camporese, Pasquale Coscia, and Lamberto Ballan. 2021. SlowFast rolling-unrolling LSTMs for action anticipation in egocentric videos. In Proceedings of the IEEE\/CVF International Conference on Computer Vision Workshops. 3437\u20133445."},{"issue":"3","key":"e_1_3_1_45_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3487042","article-title":"Causal inference with knowledge distilling and curriculum learning for unbiased VQA","volume":"18","author":"Pan Yonghua","year":"2022","unstructured":"Yonghua Pan, Zechao Li, Liyan Zhang, and Jinhui Tang. 2022. Causal inference with knowledge distilling and curriculum learning for unbiased VQA. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 3 (2022), 1\u201323.","journal-title":"ACM Transactions on Multimedia Computing, Communications, and Applications"},{"key":"e_1_3_1_46_2","volume-title":"Causal Inference in Statistics: A Primer","author":"Pearl Judea","year":"2016","unstructured":"Judea Pearl, Madelyn Glymour, and Nicholas P. Jewell. 2016. Causal Inference in Statistics: A Primer. John Wiley & Sons."},{"key":"e_1_3_1_47_2","volume-title":"The Book of Why: The New Science of Cause and Effect","author":"Pearl Judea","year":"2018","unstructured":"Judea Pearl and Dana Mackenzie. 2018. The Book of Why: The New Science of Cause and Effect. Basic Books."},{"key":"e_1_3_1_48_2","article-title":"Self-regulated learning for egocentric video activity anticipation","author":"Qi Zhaobo","year":"2021","unstructured":"Zhaobo Qi, Shuhui Wang, Chi Su, Li Su, Qingming Huang, and Qi Tian. 2021. Self-regulated learning for egocentric video activity anticipation. IEEE Transactions on Pattern Analysis and Machine Intelligence. Early Access, February 17, 2021.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence."},{"key":"e_1_3_1_49_2","first-page":"5434","volume-title":"Proceedings of the Annual Meeting of the Association for Computational Linguistics","author":"Qian Chen","year":"2021","unstructured":"Chen Qian, Fuli Feng, Lijie Wen, Chunping Ma, and Pengjun Xie. 2021. Counterfactual inference for text classification debiasing. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 5434\u20135445."},{"key":"e_1_3_1_50_2","first-page":"1","volume-title":"Proceedings of the British Machine Vision Conference","author":"Radevski Gorjan","year":"2021","unstructured":"Gorjan Radevski, Marie-Francine Moens, and Tinne Tuytelaars. 2021. Revisiting spatio-temporal layouts for compositional action recognition. In Proceedings of the British Machine Vision Conference. 1\u201316."},{"key":"e_1_3_1_51_2","first-page":"1025","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Rao Yongming","year":"2021","unstructured":"Yongming Rao, Guangyi Chen, Jiwen Lu, and Jie Zhou. 2021. Counterfactual attention learning for fine-grained visual categorization and re-identification. In Proceedings of the IEEE\/CVF International Conference on Computer Vision. 1025\u20131034."},{"key":"e_1_3_1_52_2","first-page":"91","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Ren Shaoqing","year":"2015","unstructured":"Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems. 91\u201399."},{"issue":"5","key":"e_1_3_1_53_2","doi-asserted-by":"crossref","first-page":"1511","DOI":"10.1093\/ije\/dyt127","article-title":"Mediation analysis in epidemiology: Methods, interpretation and bias","volume":"42","author":"Richiardi Lorenzo","year":"2013","unstructured":"Lorenzo Richiardi, Rino Bellocco, and Daniela Zugna. 2013. Mediation analysis in epidemiology: Methods, interpretation and bias. International Journal of Epidemiology 42, 5 (2013), 1511\u20131519.","journal-title":"International Journal of Epidemiology"},{"key":"e_1_3_1_54_2","doi-asserted-by":"crossref","first-page":"103252","DOI":"10.1016\/j.cviu.2021.103252","article-title":"Predicting the future from first person (egocentric) vision: A survey","volume":"211","author":"Rodin Ivan","year":"2021","unstructured":"Ivan Rodin, Antonino Furnari, Dimitrios Mavroeidis, and Giovanni Maria Farinella. 2021. Predicting the future from first person (egocentric) vision: A survey. Computer Vision and Image Understanding 211 (2021), 103252\u201310370.","journal-title":"Computer Vision and Image Understanding"},{"key":"e_1_3_1_55_2","first-page":"337","volume-title":"Proceedings of the International Conference on Image Analysis and Processing","author":"Rodin Ivan","year":"2022","unstructured":"Ivan Rodin, Antonino Furnari, Dimitrios Mavroeidis, and Giovanni Maria Farinella. 2022. Untrimmed action anticipation. In Proceedings of the International Conference on Image Analysis and Processing. 337\u2013348."},{"key":"e_1_3_1_56_2","doi-asserted-by":"crossref","first-page":"8116","DOI":"10.1109\/TIP.2021.3113114","article-title":"Action anticipation using pairwise human-object interactions and transformers","volume":"30","author":"Roy Debaditya","year":"2021","unstructured":"Debaditya Roy and Basura Fernando. 2021. Action anticipation using pairwise human-object interactions and transformers. IEEE Transactions on Image Processing 30 (2021), 8116\u20138129.","journal-title":"IEEE Transactions on Image Processing"},{"key":"e_1_3_1_57_2","doi-asserted-by":"crossref","first-page":"4330","DOI":"10.1109\/TIP.2021.3070732","article-title":"Together recognizing, localizing and summarizing actions in egocentric videos","volume":"30","author":"Sahu Abhimanyu","year":"2021","unstructured":"Abhimanyu Sahu and Ananda S. Chowdhury. 2021. Together recognizing, localizing and summarizing actions in egocentric videos. IEEE Transactions on Image Processing 30 (2021), 4330\u20134340.","journal-title":"IEEE Transactions on Image Processing"},{"key":"e_1_3_1_58_2","first-page":"154","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Sener Fadime","year":"2020","unstructured":"Fadime Sener, Dipika Singhania, and Angela Yao. 2020. Temporal aggregate representations for long-range video understanding. In Proceedings of the European Conference on Computer Vision. 154\u2013171."},{"key":"e_1_3_1_59_2","first-page":"568","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of the Advances in Neural Information Processing Systems. 568\u2013576."},{"key":"e_1_3_1_60_2","article-title":"Learning to recognize actions on objects in egocentric video with attention dictionaries","author":"Sudhakaran Swathikiran","year":"2021","unstructured":"Swathikiran Sudhakaran, Sergio Escalera, and Oswald Lanz. 2021. Learning to recognize actions on objects in egocentric video with attention dictionaries. IEEE Transactions on Pattern Analysis and Machine Intelligence. Early Access, February 11, 2021.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence."},{"key":"e_1_3_1_61_2","first-page":"3220","volume-title":"Proceedings of the ACM International Conference on Multimedia","author":"Sun Pengzhan","year":"2021","unstructured":"Pengzhan Sun, Bo Wu, Xunsong Li, Wen Li, Lixin Duan, and Chuang Gan. 2021. Counterfactual debiasing inference for compositional action recognition. In Proceedings of the ACM International Conference on Multimedia. 3220\u20133228."},{"key":"e_1_3_1_62_2","first-page":"15","volume-title":"Proceedings of the ACM International Conference on Multimedia","author":"Sun Teng","year":"2022","unstructured":"Teng Sun, Wenjie Wang, Liqaing Jing, Yiran Cui, Xuemeng Song, and Liqiang Nie. 2022. Counterfactual reasoning for out-of-distribution multimodal sentiment analysis. In Proceedings of the ACM International Conference on Multimedia. 15\u201323."},{"key":"e_1_3_1_63_2","first-page":"1513","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Tang Kaihua","year":"2020","unstructured":"Kaihua Tang, Jianqiang Huang, and Hanwang Zhang. 2020. Long-tailed classification by keeping the good and removing the bad momentum causal effect. In Proceedings of the Advances in Neural Information Processing Systems. 1513\u20131524."},{"key":"e_1_3_1_64_2","first-page":"3716","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Tang Kaihua","year":"2020","unstructured":"Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. 2020. Unbiased scene graph generation from biased training. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 3716\u20133725."},{"key":"e_1_3_1_65_2","first-page":"2095","volume-title":"Proceedings of the ACM International Conference on Multimedia","author":"Thapar Daksh","year":"2020","unstructured":"Daksh Thapar, Aditya Nigam, and Chetan Arora. 2020. Recognizing camera wearer from hand gestures in egocentric videos. In Proceedings of the ACM International Conference on Multimedia. 2095\u20132103."},{"key":"e_1_3_1_66_2","first-page":"11376","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Tian Bing","year":"2022","unstructured":"Bing Tian, Yixin Cao, Yong Zhang, and Chunxiao Xing. 2022. Debiasing NLU models via causal intervention and counterfactual reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence. 11376\u201311384."},{"key":"e_1_3_1_67_2","first-page":"1","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems. 1\u201311."},{"key":"e_1_3_1_68_2","first-page":"98","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Vondrick Carl","year":"2016","unstructured":"Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2016. Anticipating visual representations from unlabeled video. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 98\u2013106."},{"key":"e_1_3_1_69_2","first-page":"20","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Wang Limin","year":"2016","unstructured":"Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision. 20\u201336."},{"key":"e_1_3_1_70_2","first-page":"10627","volume-title":"Proceedings of the International Conference on Intelligent Robots and Systems","author":"Wang Shunli","year":"2022","unstructured":"Shunli Wang, Shuaibing Wang, Bo Jiao, Dingkang Yang, Liuzhen Su, Peng Zhai, Chixiao Chen, and Lihua Zhang. 2022. CA-SpaceNet: Counterfactual analysis for 6D pose estimation in space. In Proceedings of the International Conference on Intelligent Robots and Systems. 10627\u201310634."},{"key":"e_1_3_1_71_2","article-title":"Symbiotic attention for egocentric action recognition with object-centric alignment","author":"Wang Xiaohan","year":"2020","unstructured":"Xiaohan Wang, Linchao Zhu, Yu Wu, and Yi Yang. 2020. Symbiotic attention for egocentric action recognition with object-centric alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence. Early Access, August 11, 2020.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence."},{"key":"e_1_3_1_72_2","first-page":"2308","volume-title":"Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval","author":"Wu Junfei","year":"2022","unstructured":"Junfei Wu, Qiang Liu, Weizhi Xu, and Shu Wu. 2022. Bias mitigation for evidence-aware fake news detection by causal intervention. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 2308\u20132313."},{"key":"e_1_3_1_73_2","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2020.3040521"},{"key":"e_1_3_1_74_2","first-page":"12734","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Xu Xinyu","year":"2022","unstructured":"Xinyu Xu, Yong-Lu Li, and Cewu Lu. 2022. Learning to anticipate future with dynamic context removal. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 12734\u201312744."},{"key":"e_1_3_1_75_2","doi-asserted-by":"crossref","first-page":"3666","DOI":"10.1145\/3503161.3547862","volume-title":"Proceedings of the ACM International Conference on Multimedia","author":"Yan Rui","year":"2022","unstructured":"Rui Yan, Peng Huang, Xiangbo Shu, Junhao Zhang, Yonghua Pan, and Jinhui Tang. 2022. Look less think more: Rethinking compositional action recognition. In Proceedings of the ACM International Conference on Multimedia. 3666\u20133675."},{"key":"e_1_3_1_76_2","first-page":"2249","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops","author":"Zatsarynna Olga","year":"2021","unstructured":"Olga Zatsarynna, Yazan Abu Farha, and Juergen Gall. 2021. Multi-modal temporal convolutional network for anticipating actions in egocentric videos. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2249\u20132258."},{"key":"e_1_3_1_77_2","first-page":"1316","volume-title":"Proceedings of the International Joint Conference on Artificial Intelligence","author":"Zhang Tianyu","year":"2021","unstructured":"Tianyu Zhang, Weiqing Min, Jiahao Yang, Tao Liu, Shuqiang Jiang, and Yong Rui. 2021. What if we could not see? Counterfactual analysis for egocentric action anticipation. In Proceedings of the International Joint Conference on Artificial Intelligence. 1316\u20131322."},{"key":"e_1_3_1_78_2","first-page":"402","volume-title":"Proceedings of the ACM International Conference on Multimedia","author":"Zhang Tianyu","year":"2020","unstructured":"Tianyu Zhang, Weiqing Min, Ying Zhu, Yong Rui, and Shuqiang Jiang. 2020. An egocentric action anticipation framework via fusing intuition and analysis. In Proceedings of the ACM International Conference on Multimedia. 402\u2013410."},{"key":"e_1_3_1_79_2","first-page":"121","volume-title":"Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision","author":"Zhang Yun C.","year":"2017","unstructured":"Yun C. Zhang, Yin Li, and James M. Rehg. 2017. First-person action decomposition and zero-shot learning. In Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision. 121\u2013129."},{"key":"e_1_3_1_80_2","article-title":"Egocentric early action prediction via adversarial knowledge distillation","author":"Zheng Na","year":"2022","unstructured":"Na Zheng, Xuemeng Song, Tianyu Su, Weifeng Liu, Yan Yan, and Liqiang Nie. 2022. Egocentric early action prediction via adversarial knowledge distillation. ACM Transactions on Multimedia Computing, Communications, and Applications. Early Access, June 16, 2022.","journal-title":"ACM Transactions on Multimedia Computing, Communications, and Applications."},{"key":"e_1_3_1_81_2","first-page":"6068","volume-title":"Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision","author":"Zhong Zeyun","year":"2023","unstructured":"Zeyun Zhong, David Schneider, Michael Voit, Rainer Stiefelhagen, and J\u00fcrgen Beyerer. 2023. Anticipative feature fusion transformer for multi-modal action anticipation. In Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision. 6068\u20136077."}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3633333","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3633333","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T22:54:01Z","timestamp":1750287241000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3633333"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,1,11]]},"references-count":80,"journal-issue":{"issue":"5","published-print":{"date-parts":[[2024,5,31]]}},"alternative-id":["10.1145\/3633333"],"URL":"https:\/\/doi.org\/10.1145\/3633333","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"type":"print","value":"1551-6857"},{"type":"electronic","value":"1551-6865"}],"subject":[],"published":{"date-parts":[[2024,1,11]]},"assertion":[{"value":"2023-01-19","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-10-21","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-01-11","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}