{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,1]],"date-time":"2026-05-01T17:04:47Z","timestamp":1777655087318,"version":"3.51.4"},"publisher-location":"New York, NY, USA","reference-count":58,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T00:00:00Z","timestamp":1665360000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,10,10]]},"DOI":"10.1145\/3503161.3547793","type":"proceedings-article","created":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T15:43:01Z","timestamp":1665416581000},"page":"4546-4554","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":17,"title":["Distance Matters in Human-Object Interaction Detection"],"prefix":"10.1145","author":[{"given":"Guangzhi","family":"Wang","sequence":"first","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yangyang","family":"Guo","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yongkang","family":"Wong","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Mohan","family":"Kankanhalli","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2022,10,10]]},"reference":[{"key":"e_1_3_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2007.70825"},{"key":"e_1_3_2_2_2_1","volume-title":"VQA: Visual question answering. In ICCV.","author":"Antol Stanislaw","year":"2015","unstructured":"Stanislaw Antol , Aishwarya Agrawal , Jiasen Lu , Margaret Mitchell , Dhruv Batra , C Lawrence Zitnick , and Devi Parikh . 2015 . VQA: Visual question answering. In ICCV. Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual question answering. In ICCV."},{"key":"e_1_3_2_2_3_1","doi-asserted-by":"crossref","unstructured":"Nicolas Carion Francisco Massa Gabriel Synnaeve Nicolas Usunier Alexander Kirillov and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In ECCV.  Nicolas Carion Francisco Massa Gabriel Synnaeve Nicolas Usunier Alexander Kirillov and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In ECCV.","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"e_1_3_2_2_4_1","doi-asserted-by":"crossref","unstructured":"Yu-Wei Chao Yunfan Liu Xieyang Liu Huayi Zeng and Jia Deng. 2018. Learning to detect human-object interactions. In WACV.  Yu-Wei Chao Yunfan Liu Xieyang Liu Huayi Zeng and Jia Deng. 2018. Learning to detect human-object interactions. In WACV.","DOI":"10.1109\/WACV.2018.00048"},{"key":"e_1_3_2_2_5_1","volume-title":"Regionvit: Regional-to-local attention for vision transformers. In ICLR.","author":"Chen Chun-Fu","year":"2022","unstructured":"Chun-Fu Chen , Rameswar Panda , and Quanfu Fan . 2022 . Regionvit: Regional-to-local attention for vision transformers. In ICLR. Chun-Fu Chen, Rameswar Panda, and Quanfu Fan. 2022. Regionvit: Regional-to-local attention for vision transformers. In ICLR."},{"key":"e_1_3_2_2_6_1","doi-asserted-by":"crossref","unstructured":"Mingfei Chen Yue Liao Si Liu Zhiyuan Chen Fei Wang and Chen Qian. 2021. Reformulating hoi detection as adaptive set prediction. In CVPR.  Mingfei Chen Yue Liao Si Liu Zhiyuan Chen Fei Wang and Chen Qian. 2021. Reformulating hoi detection as adaptive set prediction. In CVPR.","DOI":"10.1109\/CVPR46437.2021.00889"},{"key":"e_1_3_2_2_7_1","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly Jakob Uszkoreit and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR.  Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly Jakob Uszkoreit and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR."},{"key":"e_1_3_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2022.3161735"},{"key":"e_1_3_2_2_9_1","volume-title":"Kankanhalli","author":"Fan Hehe","year":"2021","unstructured":"Hehe Fan , Yi Yang , and Mohan S . Kankanhalli . 2021 . Point 4D Transformer Networks for Spatio-Temporal Modeling in Point Cloud Videos. In CVPR. Hehe Fan, Yi Yang, and Mohan S. Kankanhalli. 2021. Point 4D Transformer Networks for Spatio-Temporal Modeling in Point Cloud Videos. In CVPR."},{"key":"e_1_3_2_2_10_1","volume-title":"DRG: Dual relation graph for human-object interaction detection. In ECCV.","author":"Gao Chen","year":"2020","unstructured":"Chen Gao , Jiarui Xu , Yuliang Zou , and Jia-Bin Huang . 2020 . DRG: Dual relation graph for human-object interaction detection. In ECCV. Chen Gao, Jiarui Xu, Yuliang Zou, and Jia-Bin Huang. 2020. DRG: Dual relation graph for human-object interaction detection. In ECCV."},{"key":"e_1_3_2_2_11_1","unstructured":"Chen Gao Yuliang Zou and Jia-Bin Huang. 2018. iCAN: Instance-Centric Attention Network for Human-Object Interaction Detection. In BMVC.  Chen Gao Yuliang Zou and Jia-Bin Huang. 2018. iCAN: Instance-Centric Attention Network for Human-Object Interaction Detection. In BMVC."},{"key":"e_1_3_2_2_12_1","doi-asserted-by":"crossref","unstructured":"Georgia Gkioxari Ross Girshick Piotr Doll\u00e1r and Kaiming He. 2018. Detecting and recognizing human-object interactions. In CVPR.  Georgia Gkioxari Ross Girshick Piotr Doll\u00e1r and Kaiming He. 2018. Detecting and recognizing human-object interactions. In CVPR.","DOI":"10.1109\/CVPR.2018.00872"},{"key":"e_1_3_2_2_13_1","unstructured":"Yangyang Guo Zhiyong Cheng Liqiang Nie Yibing Liu Yinglong Wang and Mohan Kankanhalli. 2019a. Quantifying and alleviating the language prior problem in visual question answering. In SIGIR.  Yangyang Guo Zhiyong Cheng Liqiang Nie Yibing Liu Yinglong Wang and Mohan Kankanhalli. 2019a. Quantifying and alleviating the language prior problem in visual question answering. In SIGIR."},{"key":"e_1_3_2_2_14_1","volume-title":"Attentive long short-term preference modeling for personalized product search. ACM TOIS","author":"Guo Yangyang","year":"2019","unstructured":"Yangyang Guo , Zhiyong Cheng , Liqiang Nie , Yinglong Wang , Jun Ma , and Mohan Kankanhalli . 2019b. Attentive long short-term preference modeling for personalized product search. ACM TOIS ( 2019 ). Yangyang Guo, Zhiyong Cheng, Liqiang Nie, Yinglong Wang, Jun Ma, and Mohan Kankanhalli. 2019b. Attentive long short-term preference modeling for personalized product search. ACM TOIS (2019)."},{"key":"e_1_3_2_2_15_1","volume-title":"Visual semantic role labeling. arXiv preprint arXiv:1505.04474","author":"Gupta Saurabh","year":"2015","unstructured":"Saurabh Gupta and Jitendra Malik . 2015. Visual semantic role labeling. arXiv preprint arXiv:1505.04474 ( 2015 ). Saurabh Gupta and Jitendra Malik. 2015. Visual semantic role labeling. arXiv preprint arXiv:1505.04474 (2015)."},{"key":"e_1_3_2_2_16_1","doi-asserted-by":"crossref","unstructured":"Tanmay Gupta Alexander Schwing and Derek Hoiem. 2019. No-frills human-object interaction detection: Factorization layout encodings and training techniques. In ICCV.  Tanmay Gupta Alexander Schwing and Derek Hoiem. 2019. No-frills human-object interaction detection: Factorization layout encodings and training techniques. In ICCV.","DOI":"10.1109\/ICCV.2019.00977"},{"key":"e_1_3_2_2_17_1","unstructured":"Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.  Kaiming He Xiangyu Zhang Shaoqing Ren and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR."},{"key":"e_1_3_2_2_18_1","doi-asserted-by":"crossref","unstructured":"Zhi Hou Xiaojiang Peng Yu Qiao and Dacheng Tao. 2020. Visual compositional learning for human-object interaction detection. In ECCV.  Zhi Hou Xiaojiang Peng Yu Qiao and Dacheng Tao. 2020. Visual compositional learning for human-object interaction detection. In ECCV.","DOI":"10.1007\/978-3-030-58555-6_35"},{"key":"e_1_3_2_2_19_1","doi-asserted-by":"crossref","unstructured":"Zhi Hou Baosheng Yu Yu Qiao Xiaojiang Peng and Dacheng Tao. 2021a. Affordance Transfer Learning for Human-Object Interaction Detection. In CVPR.  Zhi Hou Baosheng Yu Yu Qiao Xiaojiang Peng and Dacheng Tao. 2021a. Affordance Transfer Learning for Human-Object Interaction Detection. In CVPR.","DOI":"10.1109\/CVPR46437.2021.00056"},{"key":"e_1_3_2_2_20_1","doi-asserted-by":"crossref","unstructured":"Zhi Hou Baosheng Yu Yu Qiao Xiaojiang Peng and Dacheng Tao. 2021b. Detecting human-object interaction via fabricated compositional learning. In CVPR.  Zhi Hou Baosheng Yu Yu Qiao Xiaojiang Peng and Dacheng Tao. 2021b. Detecting human-object interaction via fabricated compositional learning. In CVPR.","DOI":"10.1109\/CVPR46437.2021.01441"},{"key":"e_1_3_2_2_21_1","volume":"202","author":"Kim Bumsoo","unstructured":"Bumsoo Kim , Taeho Choi , Jaewoo Kang , and Hyunwoo J Kim. 202 0. UnionDet: Union-level detector towards real-time human-object interaction detection. In ECCV. Bumsoo Kim, Taeho Choi, Jaewoo Kang, and Hyunwoo J Kim. 2020. UnionDet: Union-level detector towards real-time human-object interaction detection. In ECCV.","journal-title":"Hyunwoo J Kim."},{"key":"e_1_3_2_2_22_1","volume":"2021","author":"Kim Bumsoo","unstructured":"Bumsoo Kim , Junhyun Lee , Jaewoo Kang , Eun-Sol Kim , and Hyunwoo J Kim. 2021 a. HOTR: End-to-End Human-Object Interaction Detection with Transformers. In CVPR. Bumsoo Kim, Junhyun Lee, Jaewoo Kang, Eun-Sol Kim, and Hyunwoo J Kim. 2021a. HOTR: End-to-End Human-Object Interaction Detection with Transformers. In CVPR.","journal-title":"Hyunwoo J Kim."},{"key":"e_1_3_2_2_23_1","volume-title":"Acp: Action Co-occurrence Priors for Human-Object Interaction Detection","author":"Kim Dong-Jin","year":"2021","unstructured":"Dong-Jin Kim , Xiao Sun , Jinsoo Choi , Stephen Lin , and In So Kweon . 2021 b. Acp: Action Co-occurrence Priors for Human-Object Interaction Detection . IEEE TIP ( 2021). Dong-Jin Kim, Xiao Sun, Jinsoo Choi, Stephen Lin, and In So Kweon. 2021b. Acp: Action Co-occurrence Priors for Human-Object Interaction Detection. IEEE TIP (2021)."},{"key":"e_1_3_2_2_24_1","first-page":"2d","volume":"2020","author":"Li Yong-Lu","unstructured":"Yong-Lu Li , Xinpeng Liu , Han Lu , Shiyi Wang , Junqi Liu , Jiefeng Li , and Cewu Lu. 2020 a. Detailed 2d - 23 d joint representation for human-object interaction. In CVPR. Yong-Lu Li, Xinpeng Liu, Han Lu, Shiyi Wang, Junqi Liu, Jiefeng Li, and Cewu Lu. 2020a. Detailed 2d-3d joint representation for human-object interaction. In CVPR.","journal-title":"Cewu Lu."},{"key":"e_1_3_2_2_25_1","unstructured":"Yong-Lu Li Xinpeng Liu Xiaoqian Wu Yizhuo Li and Cewu Lu. 2020b. HOI Analysis: Integrating and Decomposing Human-Object Interaction. In NeurIPS.  Yong-Lu Li Xinpeng Liu Xiaoqian Wu Yizhuo Li and Cewu Lu. 2020b. HOI Analysis: Integrating and Decomposing Human-Object Interaction. In NeurIPS."},{"key":"e_1_3_2_2_26_1","unstructured":"Yong-Lu Li Siyuan Zhou Xijie Huang Liang Xu Ze Ma Hao-Shu Fang Yanfeng Wang and Cewu Lu. 2019. Transferable interactiveness knowledge for human-object interaction detection. In CVPR.  Yong-Lu Li Siyuan Zhou Xijie Huang Liang Xu Ze Ma Hao-Shu Fang Yanfeng Wang and Cewu Lu. 2019. Transferable interactiveness knowledge for human-object interaction detection. In CVPR."},{"key":"e_1_3_2_2_27_1","volume-title":"PPDM: Parallel point detection and matching for real-time human-object interaction detection. In CVPR.","author":"Liao Yue","year":"2020","unstructured":"Yue Liao , Si Liu , Fei Wang , Yanjie Chen , Chen Qian , and Jiashi Feng . 2020 . PPDM: Parallel point detection and matching for real-time human-object interaction detection. In CVPR. Yue Liao, Si Liu, Fei Wang, Yanjie Chen, Chen Qian, and Jiashi Feng. 2020. PPDM: Parallel point detection and matching for real-time human-object interaction detection. In CVPR."},{"key":"e_1_3_2_2_28_1","unstructured":"Tsung-Yi Lin Priya Goyal Ross B. Girshick Kaiming He and Piotr Doll\u00e1r. 2017. Focal Loss for Dense Object Detection. ICCV.  Tsung-Yi Lin Priya Goyal Ross B. Girshick Kaiming He and Piotr Doll\u00e1r. 2017. Focal Loss for Dense Object Detection. ICCV."},{"key":"e_1_3_2_2_29_1","unstructured":"Tsung-Yi Lin Michael Maire Serge Belongie James Hays Pietro Perona Deva Ramanan Piotr Doll\u00e1r and C Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In ECCV.  Tsung-Yi Lin Michael Maire Serge Belongie James Hays Pietro Perona Deva Ramanan Piotr Doll\u00e1r and C Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In ECCV."},{"key":"e_1_3_2_2_30_1","unstructured":"Chunxiao Liu Zhendong Mao An-An Liu Tianzhu Zhang Bin Wang and Yongdong Zhang. 2019. Focus your attention: A bidirectional focal attention network for image-text matching. In ACM MM.  Chunxiao Liu Zhendong Mao An-An Liu Tianzhu Zhang Bin Wang and Yongdong Zhang. 2019. Focus your attention: A bidirectional focal attention network for image-text matching. In ACM MM."},{"key":"e_1_3_2_2_31_1","unstructured":"Fenglin Liu Xian Wu Shen Ge Xiaoyu Zhang Wei Fan and Yuexian Zou. 2020a. Bridging the gap between vision and language domains for improved image captioning. In ACM MM.  Fenglin Liu Xian Wu Shen Ge Xiaoyu Zhang Wei Fan and Yuexian Zou. 2020a. Bridging the gap between vision and language domains for improved image captioning. In ACM MM."},{"key":"e_1_3_2_2_32_1","doi-asserted-by":"crossref","unstructured":"Ye Liu Junsong Yuan and Chang Wen Chen. 2020b. ConsNet: Learning consistency graph for zero-shot human-object interaction detection. In ACM MM.  Ye Liu Junsong Yuan and Chang Wen Chen. 2020b. ConsNet: Learning consistency graph for zero-shot human-object interaction detection. In ACM MM.","DOI":"10.1145\/3394171.3413600"},{"key":"e_1_3_2_2_33_1","volume-title":"Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In ICCV.","author":"Liu Ze","year":"2021","unstructured":"Ze Liu , Yutong Lin , Yue Cao , Han Hu , Yixuan Wei , Zheng Zhang , Stephen Lin , and Baining Guo . 2021 . Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In ICCV. Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In ICCV."},{"key":"e_1_3_2_2_34_1","unstructured":"Ilya Loshchilov and Frank Hutter. 2018. Decoupled Weight Decay Regularization. In ICLR.  Ilya Loshchilov and Frank Hutter. 2018. Decoupled Weight Decay Regularization. In ICLR."},{"key":"e_1_3_2_2_35_1","volume-title":"Ishan Misra, Florian Metze, Christoph Feichtenhofer, Andrea Vedaldi, and Jo a o F. Henriques.","author":"Patrick Mandela","year":"2021","unstructured":"Mandela Patrick , Dylan Campbell , Yuki Markus Asano , Ishan Misra, Florian Metze, Christoph Feichtenhofer, Andrea Vedaldi, and Jo a o F. Henriques. 2021 . Keeping Your Eye on the Ball : Trajectory Attention in Video Transformers. In NeurIPS. Mandela Patrick, Dylan Campbell, Yuki Markus Asano, Ishan Misra, Florian Metze, Christoph Feichtenhofer, Andrea Vedaldi, and Jo a o F. Henriques. 2021. Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers. In NeurIPS."},{"key":"e_1_3_2_2_36_1","unstructured":"Siyuan Qi Wenguan Wang Baoxiong Jia Jianbing Shen and Song-Chun Zhu. 2018. Learning human-object interactions by graph parsing neural networks. In ECCV.  Siyuan Qi Wenguan Wang Baoxiong Jia Jianbing Shen and Song-Chun Zhu. 2018. Learning human-object interactions by graph parsing neural networks. In ECCV."},{"key":"e_1_3_2_2_37_1","unstructured":"Zhen Qin Weixuan Sun Hui Deng Dongxu Li Yunshen Wei Baohong Lv Junjie Yan Lingpeng Kong and Yiran Zhong. 2022. cosFormer: Rethinking Softmax in Attention. In ICLR.  Zhen Qin Weixuan Sun Hui Deng Dongxu Li Yunshen Wei Baohong Lv Junjie Yan Lingpeng Kong and Yiran Zhong. 2022. cosFormer: Rethinking Softmax in Attention. In ICLR."},{"key":"e_1_3_2_2_38_1","unstructured":"Shaoqing Ren Kaiming He Ross Girshick and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In NeurIPS.  Shaoqing Ren Kaiming He Ross Girshick and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In NeurIPS."},{"key":"e_1_3_2_2_39_1","doi-asserted-by":"crossref","unstructured":"Peter Shaw Jakob Uszkoreit and Ashish Vaswani. 2018. Self-Attention with Relative Position Representations. In NAACL (short).  Peter Shaw Jakob Uszkoreit and Ashish Vaswani. 2018. Self-Attention with Relative Position Representations. In NAACL (short).","DOI":"10.18653\/v1\/N18-2074"},{"key":"e_1_3_2_2_40_1","doi-asserted-by":"crossref","unstructured":"Liyue Shen Serena Yeung Judy Hoffman Greg Mori and Li Fei-Fei. 2018. Scaling human-object interaction recognition through zero-shot learning. In WACV.  Liyue Shen Serena Yeung Judy Hoffman Greg Mori and Li Fei-Fei. 2018. Scaling human-object interaction recognition through zero-shot learning. In WACV.","DOI":"10.1109\/WACV.2018.00181"},{"key":"e_1_3_2_2_41_1","volume-title":"QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information. In CVPR.","author":"Tamura Masato","year":"2021","unstructured":"Masato Tamura , Hiroki Ohashi , and Tomoaki Yoshinaga . 2021 . QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information. In CVPR. Masato Tamura, Hiroki Ohashi, and Tomoaki Yoshinaga. 2021. QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information. In CVPR."},{"key":"e_1_3_2_2_42_1","volume-title":"Xinyuan Qian, Mike Zheng Shou, and Haizhou Li.","author":"Tao Ruijie","year":"2021","unstructured":"Ruijie Tao , Zexu Pan , Rohan Kumar Das , Xinyuan Qian, Mike Zheng Shou, and Haizhou Li. 2021 . Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection. In ACM MM. Ruijie Tao, Zexu Pan, Rohan Kumar Das, Xinyuan Qian, Mike Zheng Shou, and Haizhou Li. 2021. Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection. In ACM MM."},{"key":"e_1_3_2_2_43_1","doi-asserted-by":"crossref","unstructured":"Oytun Ulutan ASM Iftekhar and Bangalore S Manjunath. 2020. VSGNet: Spatial attention network for detecting human object interactions using graph convolutions. In CVPR.  Oytun Ulutan ASM Iftekhar and Bangalore S Manjunath. 2020. VSGNet: Spatial attention network for detecting human object interactions using graph convolutions. In CVPR.","DOI":"10.1109\/CVPR42600.2020.01363"},{"key":"e_1_3_2_2_44_1","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is All you Need. In NeurIPS.  Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is All you Need. In NeurIPS."},{"key":"e_1_3_2_2_45_1","volume-title":"Show and tell: Lessons learned from the 2015 mscoco image captioning challenge","author":"Vinyals Oriol","year":"2016","unstructured":"Oriol Vinyals , Alexander Toshev , Samy Bengio , and Dumitru Erhan . 2016. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge . IEEE TPAMI ( 2016 ). Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2016. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE TPAMI (2016)."},{"key":"e_1_3_2_2_46_1","doi-asserted-by":"crossref","unstructured":"Bo Wan Desen Zhou Yongfei Liu Rongjie Li and Xuming He. 2019. Pose-aware multi-level feature network for human object interaction detection. In ICCV.  Bo Wan Desen Zhou Yongfei Liu Rongjie Li and Xuming He. 2019. Pose-aware multi-level feature network for human object interaction detection. In ICCV.","DOI":"10.1109\/ICCV.2019.00956"},{"key":"e_1_3_2_2_47_1","volume-title":"Xiangyu Zhang, and Jian Sun.","author":"Wang Tiancai","year":"2020","unstructured":"Tiancai Wang , Tong Yang , Martin Danelljan , Fahad Shahbaz Khan , Xiangyu Zhang, and Jian Sun. 2020 . Learning human-object interaction detection using interaction points. In CVPR. Tiancai Wang, Tong Yang, Martin Danelljan, Fahad Shahbaz Khan, Xiangyu Zhang, and Jian Sun. 2020. Learning human-object interaction detection using interaction points. In CVPR."},{"key":"e_1_3_2_2_48_1","volume-title":"Crossformer: A versatile vision transformer based on cross-scale attention. In ICLR.","author":"Wang Wenxiao","year":"2022","unstructured":"Wenxiao Wang , Lu Yao , Long Chen , Deng Cai , Xiaofei He , and Wei Liu . 2022 . Crossformer: A versatile vision transformer based on cross-scale attention. In ICLR. Wenxiao Wang, Lu Yao, Long Chen, Deng Cai, Xiaofei He, and Wei Liu. 2022. Crossformer: A versatile vision transformer based on cross-scale attention. In ICLR."},{"key":"e_1_3_2_2_49_1","unstructured":"Yiling Wu Shuhui Wang Guoli Song and Qingming Huang. 2019. Learning fragment self-attention embeddings for image-text matching. In ACM MM.  Yiling Wu Shuhui Wang Guoli Song and Qingming Huang. 2019. Learning fragment self-attention embeddings for image-text matching. In ACM MM."},{"key":"e_1_3_2_2_50_1","volume-title":"Interact as you intend: Intention-driven human-object interaction detection","author":"Xu Bingjie","year":"2019","unstructured":"Bingjie Xu , Junnan Li , Yongkang Wong , Qi Zhao , and Mohan S Kankanhalli . 2019a. Interact as you intend: Intention-driven human-object interaction detection . IEEE TMM ( 2019 ). Bingjie Xu, Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S Kankanhalli. 2019a. Interact as you intend: Intention-driven human-object interaction detection. IEEE TMM (2019)."},{"key":"e_1_3_2_2_51_1","unstructured":"Bingjie Xu Yongkang Wong Junnan Li Qi Zhao and Mohan S Kankanhalli. 2019b. Learning to detect human-object interactions with knowledge. In CVPR.  Bingjie Xu Yongkang Wong Junnan Li Qi Zhao and Mohan S Kankanhalli. 2019b. Learning to detect human-object interactions with knowledge. In CVPR."},{"key":"e_1_3_2_2_52_1","volume-title":"Relation-aware Compositional Zero-shot Learning for Attribute-Object Pair Recognition","author":"Xu Ziwei","year":"2021","unstructured":"Ziwei Xu , Guangzhi Wang , Yongkang Wong , and Mohan S Kankanhalli . 2021. Relation-aware Compositional Zero-shot Learning for Attribute-Object Pair Recognition . IEEE TMM ( 2021 ). Ziwei Xu, Guangzhi Wang, Yongkang Wong, and Mohan S Kankanhalli. 2021. Relation-aware Compositional Zero-shot Learning for Attribute-Object Pair Recognition. IEEE TMM (2021)."},{"key":"e_1_3_2_2_53_1","doi-asserted-by":"crossref","unstructured":"Hangjie Yuan Mang Wang Dong Ni and Liangpeng Xu. 2022. Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics. In AAAI.  Hangjie Yuan Mang Wang Dong Ni and Liangpeng Xu. 2022. Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics. In AAAI.","DOI":"10.1609\/aaai.v36i3.20229"},{"key":"e_1_3_2_2_54_1","unstructured":"Aixi Zhang Yue Liao Si Liu Miao Lu Yongliang Wang Chen Gao and Xiaobo Li. 2021b. Mining the Benefits of Two-stage and One-stage HOI Detection. In NeurIPS.  Aixi Zhang Yue Liao Si Liu Miao Lu Yongliang Wang Chen Gao and Xiaobo Li. 2021b. Mining the Benefits of Two-stage and One-stage HOI Detection. In NeurIPS."},{"key":"e_1_3_2_2_55_1","doi-asserted-by":"crossref","unstructured":"Frederic Z Zhang Dylan Campbell and Stephen Gould. 2021a. Spatially conditioned graphs for detecting human-object interactions. In ICCV.  Frederic Z Zhang Dylan Campbell and Stephen Gould. 2021a. Spatially conditioned graphs for detecting human-object interactions. In ICCV.","DOI":"10.1109\/ICCV48922.2021.01307"},{"key":"e_1_3_2_2_56_1","unstructured":"Frederic Z. Zhang Dylan Campbell and Stephen Gould. 2022. Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer. In CVPR.  Frederic Z. Zhang Dylan Campbell and Stephen Gould. 2022. Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer. In CVPR."},{"key":"e_1_3_2_2_57_1","doi-asserted-by":"crossref","unstructured":"Xubin Zhong Xian Qu Changxing Ding and Dacheng Tao. 2021. Glance and Gaze: Inferring Action-aware Points for One-Stage Human-Object Interaction Detection. In CVPR.  Xubin Zhong Xian Qu Changxing Ding and Dacheng Tao. 2021. Glance and Gaze: Inferring Action-aware Points for One-Stage Human-Object Interaction Detection. In CVPR.","DOI":"10.1109\/CVPR46437.2021.01303"},{"key":"e_1_3_2_2_58_1","doi-asserted-by":"crossref","unstructured":"Cheng Zou Bohan Wang Yue Hu Junqi Liu Qian Wu Yu Zhao Boxun Li Chenguang Zhang Chi Zhang Yichen Wei etal 2021. End-to-end human object interaction detection with hoi transformer. In CVPR.  Cheng Zou Bohan Wang Yue Hu Junqi Liu Qian Wu Yu Zhao Boxun Li Chenguang Zhang Chi Zhang Yichen Wei et al. 2021. End-to-end human object interaction detection with hoi transformer. In CVPR.","DOI":"10.1109\/CVPR46437.2021.01165"}],"event":{"name":"MM '22: The 30th ACM International Conference on Multimedia","location":"Lisboa Portugal","acronym":"MM '22","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 30th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3547793","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3503161.3547793","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:02:34Z","timestamp":1750186954000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3547793"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,10]]},"references-count":58,"alternative-id":["10.1145\/3503161.3547793","10.1145\/3503161"],"URL":"https:\/\/doi.org\/10.1145\/3503161.3547793","relation":{},"subject":[],"published":{"date-parts":[[2022,10,10]]},"assertion":[{"value":"2022-10-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}