{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,11]],"date-time":"2025-09-11T22:20:39Z","timestamp":1757629239442,"version":"3.44.0"},"reference-count":61,"publisher":"Association for Computing Machinery (ACM)","issue":"9","funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62222213, U22B2059, 62072423"],"award-info":[{"award-number":["62222213, U22B2059, 62072423"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Asian Low-Resour. Lang. Inf. Process."],"published-print":{"date-parts":[[2025,9,30]]},"abstract":"<jats:p>\n            Scene graph has long been treated as the basic tool to summarize the visual semantics from a structural perspective, which always confronts the challenge of capturing open entity and relation classes. Recently, some efforts have been made to enhance the open-vocabulary scene graph generation (OV-SGG) task via prompt tuning techniques. However, as only one prompt is utilized for all the classes, they may confuse the modeling for relations with similar spatial positions or semantics. To address these challenges, in this article, we propose a novel one-stage framework, named\n            <jats:bold>R<\/jats:bold>\n            elation-decoupled\n            <jats:bold>T<\/jats:bold>\n            ransformer framework with adaptive\n            <jats:bold>H<\/jats:bold>\n            ierarchical\n            <jats:bold>P<\/jats:bold>\n            rompt (\n            <jats:bold>RTHP<\/jats:bold>\n            ), based on the vision-language model. Specifically, we first develop the dual entities\/relations deformable attention module in the relation-decoupled transformer, which decouples relations into subject and object relations, and deploys on entity\/relation queries separately. Along this line, we further design the Adaptive Hierarchical Prompt (AHPro) module to model the inherent hierarchical structure of relation classes, which enables categories situated in varying positions to have different prompts. In this way, confusing categories can be easily distinguished by their respective positions in the hierarchical structure. Extensive experimental results demonstrate that our RTHP framework achieves competitive performance for OV-SGG, validating its effectiveness on base classes and generalization capability on novel classes.\n          <\/jats:p>","DOI":"10.1145\/3748318","type":"journal-article","created":{"date-parts":[[2025,7,28]],"date-time":"2025-07-28T11:25:29Z","timestamp":1753701929000},"page":"1-20","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Adaptive Hierarchical Prompt for Open-Vocabulary Scene Graph Generation"],"prefix":"10.1145","volume":"24","author":[{"ORCID":"https:\/\/orcid.org\/0009-0004-3516-474X","authenticated-orcid":false,"given":"Changkai","family":"Feng","sequence":"first","affiliation":[{"name":"School of Data Science, University of Science and Technology of China","place":["Hefei, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4246-5386","authenticated-orcid":false,"given":"Tong","family":"Xu","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China","place":["Hefei, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3206-6827","authenticated-orcid":false,"given":"Shiwei","family":"Wu","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, University of Science and Technology of China","place":["Hefei, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3971-9907","authenticated-orcid":false,"given":"Derong","family":"Xu","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China","place":["Hefei, China"]}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4835-4102","authenticated-orcid":false,"given":"Enhong","family":"Chen","sequence":"additional","affiliation":[{"name":"University of Science and Technology of China","place":["Hefei, China"]}]}],"member":"320","published-online":{"date-parts":[[2025,9,10]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"crossref","first-page":"213","DOI":"10.1007\/978-3-030-58452-8_13","volume-title":"Proceedings of the Computer Vision\u2013ECCV 2020: 16th European Conference, Glasgow, UK, August 23\u201328, 2020, Proceedings, Part I 16","author":"Carion Nicolas","year":"2020","unstructured":"Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Proceedings of the Computer Vision\u2013ECCV 2020: 16th European Conference, Glasgow, UK, August 23\u201328, 2020, Proceedings, Part I 16. Springer, 213\u2013229."},{"key":"e_1_3_2_3_2","first-page":"9004","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Chen Mingfei","year":"2021","unstructured":"Mingfei Chen, Yue Liao, Si Liu, Zhiyuan Chen, Fei Wang, and Chen Qian. 2021. Reformulating hoi detection as adaptive set prediction. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 9004\u20139013."},{"key":"e_1_3_2_4_2","first-page":"9962","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Chen Shizhe","year":"2020","unstructured":"Shizhe Chen, Qin Jin, Peng Wang, and Qi Wu. 2020. Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 9962\u20139971."},{"key":"e_1_3_2_5_2","first-page":"6163","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Chen Tianshui","year":"2019","unstructured":"Tianshui Chen, Weihao Yu, Riquan Chen, and Liang Lin. 2019. Knowledge-embedded routing network for scene graph generation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 6163\u20136171."},{"key":"e_1_3_2_6_2","doi-asserted-by":"crossref","first-page":"508","DOI":"10.1109\/ICME.2019.00094","volume-title":"Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME)","author":"Chen Yunian","year":"2019","unstructured":"Yunian Chen, Yanjie Wang, Yang Zhang, and Yanwen Guo. 2019. Panet: A context based predicate association network for scene graph generation. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 508\u2013513."},{"key":"e_1_3_2_7_2","doi-asserted-by":"crossref","first-page":"1581","DOI":"10.1145\/3474085.3475297","volume-title":"Proceedings of the 29th ACM International Conference on Multimedia","author":"Chiou Meng-Jiun","year":"2021","unstructured":"Meng-Jiun Chiou, Henghui Ding, Hanshu Yan, Changhu Wang, Roger Zimmermann, and Jiashi Feng. 2021. Recovering the unbiased scene graphs from the biased ones. In Proceedings of the 29th ACM International Conference on Multimedia. 1581\u20131590."},{"key":"e_1_3_2_8_2","doi-asserted-by":"crossref","unstructured":"Yuren Cong Michael Ying Yang and Bodo Rosenhahn. 2023. Reltr: Relation transformer for scene graph generation. IEEE Transactions on Pattern Analysis and Machine Intelligence 45 9 (2023) 11169\u201311183.","DOI":"10.1109\/TPAMI.2023.3268066"},{"key":"e_1_3_2_9_2","doi-asserted-by":"crossref","first-page":"248","DOI":"10.1109\/CVPR.2009.5206848","volume-title":"Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition","author":"Deng Jia","year":"2009","unstructured":"Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Ieee, 248\u2013255."},{"key":"e_1_3_2_10_2","first-page":"19538","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Dong Leizhen","year":"2022","unstructured":"Leizhen Dong, Zhimin Li, Kunlun Xu, Zhijun Zhang, Luxin Yan, Sheng Zhong, and Xu Zou. 2022. Category-aware transformer network for better human-object interaction detection. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 19538\u201319547."},{"key":"e_1_3_2_11_2","first-page":"19427","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Dong Xingning","year":"2022","unstructured":"Xingning Dong, Tian Gan, Xuemeng Song, Jianlong Wu, Yuan Cheng, and Liqiang Nie. 2022. Stacked hybrid-attention and group collaborative learning for unbiased scene graph generation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 19427\u201319436."},{"key":"e_1_3_2_12_2","first-page":"14084","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Du Yu","year":"2022","unstructured":"Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. 2022. Learning to prompt for open-vocabulary object detection with vision-language model. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 14084\u201314093."},{"issue":"4","key":"e_1_3_2_13_2","doi-asserted-by":"crossref","first-page":"184314","DOI":"10.1007\/s11704-023-2444-y","article-title":"LMR-CBT: Learning modality-fused representations with CB-transformer for multimodal emotion recognition from unaligned multimodal sequences","volume":"18","author":"Fu Ziwang","year":"2024","unstructured":"Ziwang Fu, Feng Liu, Qing Xu, Xiangling Fu, and Jiayin Qi. 2024. LMR-CBT: Learning modality-fused representations with CB-transformer for multimodal emotion recognition from unaligned multimodal sequences. Frontiers of Computer Science 18, 4 (2024), 184314.","journal-title":"Frontiers of Computer Science"},{"key":"e_1_3_2_14_2","unstructured":"Kaifeng Gao Long Chen Hanwang Zhang Jun Xiao and Qianru Sun. 2023. Compositional prompt tuning with motion cues for open-vocabulary video relation detection. In The Eleventh International Conference on Learning Representations ICLR 2023 Kigali Rwanda May 1-5 2023. OpenReview.net. Retrieved from https:\/\/openreview.net\/forum?id=mE91GkXYipg"},{"key":"e_1_3_2_15_2","first-page":"0","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision Workshops","author":"Gkanatsios Nikolaos","year":"2019","unstructured":"Nikolaos Gkanatsios, Vassilis Pitsikalis, Petros Koutras, and Petros Maragos. 2019. Attention-translation-relation network for scalable scene graph generation. In Proceedings of the IEEE\/CVF International Conference on Computer Vision Workshops. 0\u20130."},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","unstructured":"Sunan He Taian Guo Tao Dai Ruizhi Qiao Xiujun Shu Bo Ren and Shu-Tao Xia. 2023. Open-vocabulary multi-label classification via multi-modal knowledge transfer. In Thirty-Seventh AAAI Conference on Artificial Intelligence AAAI 2023 Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence IAAI 2023 Thirteenth Symposium on Educational Advances in Artificial Intelligence EAAI 2023 Washington DC USA February 7-14 2023 Brian Williams Yiling Chen and Jennifer Neville (Eds.). AAAI Press 808\u2013816. DOI:10.1609\/AAAI.V37I1.25159","DOI":"10.1609\/AAAI.V37I1.25159"},{"key":"e_1_3_2_17_2","first-page":"56","volume-title":"Proceedings of the Computer Vision\u2013ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23\u201327, 2022, Proceedings, Part XXVIII","author":"He Tao","year":"2022","unstructured":"Tao He, Lianli Gao, Jingkuan Song, and Yuan-Fang Li. 2022. Towards open-vocabulary scene graph generation with prompt-based finetuning. In Proceedings of the Computer Vision\u2013ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23\u201327, 2022, Proceedings, Part XXVIII. Springer, 56\u201373."},{"key":"e_1_3_2_18_2","first-page":"74","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Kim Bumsoo","year":"2021","unstructured":"Bumsoo Kim, Junhyun Lee, Jaewoo Kang, Eun-Sol Kim, and Hyunwoo J. Kim. 2021. Hotr: End-to-end human-object interaction detection with transformers. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 74\u201383."},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","unstructured":"Ranjay Krishna Yuke Zhu Oliver Groth Justin Johnson Kenji Hata Joshua Kravitz Stephanie Chen Yannis Kalantidis Li-Jia Li David A. Shamma Michael S. Bernstein and Li Fei-Fei. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123 1 (2017) 32\u201373. DOI:10.1007\/S11263-016-0981-7","DOI":"10.1007\/S11263-016-0981-7"},{"issue":"7","key":"e_1_3_2_20_2","doi-asserted-by":"crossref","first-page":"1956","DOI":"10.1007\/s11263-020-01316-z","article-title":"The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale","volume":"128","author":"Kuznetsova Alina","year":"2020","unstructured":"Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. 2020. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision 128, 7 (2020), 1956\u20131981.","journal-title":"International Journal of Computer Vision"},{"key":"e_1_3_2_21_2","first-page":"19486","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Li Rongjie","year":"2022","unstructured":"Rongjie Li, Songyang Zhang, and Xuming He. 2022. Sgtr: End-to-end scene graph generation with transformer. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 19486\u201319496."},{"key":"e_1_3_2_22_2","first-page":"11109","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Li Rongjie","year":"2021","unstructured":"Rongjie Li, Songyang Zhang, Bo Wan, and Xuming He. 2021. Bipartite graph network with adaptive message passing for unbiased scene graph generation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 11109\u201311119."},{"key":"e_1_3_2_23_2","first-page":"19447","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Li Wei","year":"2022","unstructured":"Wei Li, Haiwei Zhang, Qijie Bai, Guoqing Zhao, Ning Jiang, and Xiaojie Yuan. 2022. Ppdl: Predicate probability distribution based loss for unbiased scene graph generation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 19447\u201319456."},{"key":"e_1_3_2_24_2","first-page":"13250","volume-title":"Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Li Xinhang","year":"2024","unstructured":"Xinhang Li, Jingbo Zhou, Wei Chen, Derong Xu, Tong Xu, and Enhong Chen. 2024. Visualization recommendation with prompt-based reprogramming of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 13250\u201313262."},{"key":"e_1_3_2_25_2","first-page":"2980","volume-title":"Proceedings of the IEEE international conference on computer vision","author":"Lin Tsung-Yi","year":"2017","unstructured":"Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll\u00e1r. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision. 2980\u20132988."},{"key":"e_1_3_2_26_2","first-page":"3746","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Lin Xin","year":"2020","unstructured":"Xin Lin, Changxing Ding, Jinquan Zeng, and Dacheng Tao. 2020. Gps-net: Graph property sensing network for scene graph generation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 3746\u20133753."},{"issue":"12","key":"e_1_3_2_27_2","first-page":"7655","article-title":"Toward region-aware attention learning for scene graph generation","volume":"33","author":"Liu An-An","year":"2021","unstructured":"An-An Liu, Hongshuo Tian, Ning Xu, Weizhi Nie, Yongdong Zhang, and Mohan Kankanhalli. 2021. Toward region-aware attention learning for scene graph generation. IEEE Transactions on Neural Networks and Learning Systems 33, 12 (2021), 7655\u20137666.","journal-title":"IEEE Transactions on Neural Networks and Learning Systems"},{"key":"e_1_3_2_28_2","first-page":"14074","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Ma Zongyang","year":"2022","unstructured":"Zongyang Ma, Guan Luo, Jin Gao, Liang Li, Yuxin Chen, Shaoru Wang, Congxuan Zhang, and Weiming Hu. 2022. Open-vocabulary one-stage detection with hierarchical visual-language knowledge distillation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 14074\u201314083."},{"key":"e_1_3_2_29_2","doi-asserted-by":"crossref","first-page":"2609","DOI":"10.1109\/ICME55011.2023.00444","volume-title":"Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME)","author":"Peng Wenjun","year":"2023","unstructured":"Wenjun Peng, Weidong He, Derong Xu, Tong Xu, Chen Zhu, and Enhong Chen. 2023. Social context-aware GCN for video character search via scene-prior enhancement. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2609\u20132614."},{"key":"e_1_3_2_30_2","first-page":"3957","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Qi Mengshi","year":"2019","unstructured":"Mengshi Qi, Weijian Li, Zhengyuan Yang, Yunhong Wang, and Jiebo Luo. 2019. Attentive relational networks for mapping images to scene graphs. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 3957\u20133966."},{"key":"e_1_3_2_31_2","doi-asserted-by":"crossref","unstructured":"Tianwen Qian Jingjing Chen Shaoxiang Chen Bo Wu and Yu-Gang Jiang. 2022. Scene graph refinement network for visual question answering. IEEE Transactions on Multimedia 25 (2022) 3950\u20133961.","DOI":"10.1109\/TMM.2022.3169065"},{"key":"e_1_3_2_32_2","first-page":"19558","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Qu Xian","year":"2022","unstructured":"Xian Qu, Changxing Ding, Xingao Li, Xubin Zhong, and Dacheng Tao. 2022. Distillation using oracle queries for transformer-based human-object interaction detection. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 19558\u201319567."},{"key":"e_1_3_2_33_2","first-page":"8748","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. PMLR, 8748\u20138763."},{"issue":"2","key":"e_1_3_2_34_2","first-page":"909","article-title":"Scene graph generation with hierarchical context","volume":"32","author":"Ren Guanghui","year":"2020","unstructured":"Guanghui Ren, Lejian Ren, Yue Liao, Si Liu, Bo Li, Jizhong Han, and Shuicheng Yan. 2020. Scene graph generation with hierarchical context. IEEE Transactions on Neural Networks and Learning Systems 32, 2 (2020), 909\u2013915.","journal-title":"IEEE Transactions on Neural Networks and Learning Systems"},{"key":"e_1_3_2_35_2","first-page":"8376","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Shi Jiaxin","year":"2019","unstructured":"Jiaxin Shi, Hanwang Zhang, and Juanzi Li. 2019. Explainable and explicit visual reasoning over scene graphs. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 8376\u20138384."},{"key":"e_1_3_2_36_2","doi-asserted-by":"crossref","first-page":"422","DOI":"10.1007\/978-3-031-19836-6_24","volume-title":"Proceedings of the Computer Vision\u2013ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23\u201327, 2022, Proceedings, Part XXXVII","author":"Shit Suprosanna","year":"2022","unstructured":"Suprosanna Shit, Rajat Koner, Bastian Wittmann, Johannes Paetzold, Ivan Ezhov, Hongwei Li, Jiazhen Pan, Sahand Sharifzadeh, Georgios Kaissis, Volker Tresp, and Bjoern Menze2022. Relationformer: A unified framework for image-to-graph generation. In Proceedings of the Computer Vision\u2013ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23\u201327, 2022, Proceedings, Part XXXVII. Springer, 422\u2013439."},{"key":"e_1_3_2_37_2","first-page":"13936","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Suhail Mohammed","year":"2021","unstructured":"Mohammed Suhail, Abhay Mittal, Behjat Siddiquie, Chris Broaddus, Jayan Eledath, Gerard Medioni, and Leonid Sigal. 2021. Energy-based learning for scene graph generation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 13936\u201313945."},{"key":"e_1_3_2_38_2","first-page":"3716","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Tang Kaihua","year":"2020","unstructured":"Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. 2020. Unbiased scene graph generation from biased training. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 3716\u20133725."},{"key":"e_1_3_2_39_2","first-page":"6619","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Tang Kaihua","year":"2019","unstructured":"Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, and Wei Liu. 2019. Learning to compose dynamic tree structures for visual contexts. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 6619\u20136628."},{"key":"e_1_3_2_40_2","first-page":"19437","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Teng Yao","year":"2022","unstructured":"Yao Teng and Limin Wang. 2022. Structured sparse r-cnn for direct scene graph generation. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 19437\u201319446."},{"key":"e_1_3_2_41_2","doi-asserted-by":"crossref","first-page":"3155","DOI":"10.1145\/3394171.3413501","volume-title":"Proceedings of the 28th ACM International Conference on Multimedia","author":"Tian Hongshuo","year":"2020","unstructured":"Hongshuo Tian, Ning Xu, An-An Liu, and Yongdong Zhang. 2020. Part-aware interactive learning for scene graph generation. In Proceedings of the 28th ACM International Conference on Multimedia. 3155\u20133163."},{"key":"e_1_3_2_42_2","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N. Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) 5998\u20136008."},{"key":"e_1_3_2_43_2","doi-asserted-by":"crossref","first-page":"117","DOI":"10.1145\/3591156.3591173","volume-title":"Proceedings of the 2023 5th International Conference on Image, Video and Signal Processing","author":"Wang Xiaomeng","year":"2023","unstructured":"Xiaomeng Wang, Tong Xu, and Shiwei Wu. 2023. SGAT: Scene graph attention network for video recommendation. In Proceedings of the 2023 5th International Conference on Image, Video and Signal Processing. 117\u2013125."},{"key":"e_1_3_2_44_2","doi-asserted-by":"crossref","unstructured":"Sangmin Woo Junhyug Noh and Kangil Kim. 2022. Tackling the challenges in scene graph generation with local-to-global interactions. IEEE Transactions on Neural Networks and Learning Systems 34 12 (2022) 9713\u20139726.","DOI":"10.1109\/TNNLS.2022.3159990"},{"key":"e_1_3_2_45_2","first-page":"5410","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Xu Danfei","year":"2017","unstructured":"Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5410\u20135419."},{"issue":"3","key":"e_1_3_2_46_2","first-page":"1031","article-title":"Scene graph inference via multi-scale context modeling","volume":"31","author":"Xu Ning","year":"2020","unstructured":"Ning Xu, An-An Liu, Yongkang Wong, Weizhi Nie, Yuting Su, and Mohan Kankanhalli. 2020. Scene graph inference via multi-scale context modeling. IEEE Transactions on Circuits and Systems for Video Technology 31, 3 (2020), 1031\u20131041.","journal-title":"IEEE Transactions on Circuits and Systems for Video Technology"},{"issue":"1","key":"e_1_3_2_47_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3416493","article-title":"Socializing the videos: A multimodal approach for social relation recognition","volume":"17","author":"Xu Tong","year":"2021","unstructured":"Tong Xu, Peilun Zhou, Linkang Hu, Xiangnan He, Yao Hu, and Enhong Chen. 2021. Socializing the videos: A multimodal approach for social relation recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17, 1 (2021), 1\u201323.","journal-title":"ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)"},{"key":"e_1_3_2_48_2","first-page":"265","volume-title":"Proceedings of the 28th ACM International Conference on Multimedia","author":"Yan Shaotian","year":"2020","unstructured":"Shaotian Yan, Chen Shen, Zhongming Jin, Jianqiang Huang, Rongxin Jiang, Yaowu Chen, and Xian-Sheng Hua. 2020. Pcpl: Predicate-correlation perception learning for unbiased scene graph generation. In Proceedings of the 28th ACM International Conference on Multimedia. 265\u2013273."},{"issue":"1","key":"e_1_3_2_49_2","doi-asserted-by":"crossref","first-page":"181335","DOI":"10.1007\/s11704-023-3186-6","article-title":"Alignment efficient image-sentence retrieval considering transferable cross-modal representation learning","volume":"18","author":"Yang Yang","year":"2024","unstructured":"Yang Yang, Jinyi Guo, Guangyu Li, Lanyu Li, Wenjie Li, and Jian Yang. 2024. Alignment efficient image-sentence retrieval considering transferable cross-modal representation learning. Frontiers of Computer Science 18, 1 (2024), 181335.","journal-title":"Frontiers of Computer Science"},{"key":"e_1_3_2_50_2","doi-asserted-by":"crossref","unstructured":"Shukang Yin Sirui Zhao Hao Wang Tong Xu and Enhong Chen. 2024. Exploiting instance-level relationships in weakly supervised text-to-video retrieval. ACM Transactions on Multimedia Computing Communications and Applications 20 10 (2024) 1\u201321.","DOI":"10.1145\/3663571"},{"key":"e_1_3_2_51_2","first-page":"14393","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Zareian Alireza","year":"2021","unstructured":"Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, and Shih-Fu Chang. 2021. Open-vocabulary object detection using captions. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 14393\u201314402."},{"key":"e_1_3_2_52_2","first-page":"5831","volume-title":"Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition","author":"Zellers Rowan","year":"2018","unstructured":"Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5831\u20135840."},{"key":"e_1_3_2_53_2","doi-asserted-by":"crossref","first-page":"2519","DOI":"10.1145\/3511808.3557382","volume-title":"Proceedings of the 31st ACM International Conference on Information & Knowledge Management","author":"Zhang Chunhui","year":"2022","unstructured":"Chunhui Zhang, Chao Huang, Youhuan Li, Xiangliang Zhang, Yanfang Ye, and Chuxu Zhang. 2022. Look twice as much as you say: Scene graph contrastive learning for self-supervised image caption generation. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 2519\u20132528."},{"key":"e_1_3_2_54_2","first-page":"20104","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Zhang Frederic Z","year":"2022","unstructured":"Frederic Z Zhang, Dylan Campbell, and Stephen Gould. 2022. Efficient two-stage detection of human-object interactions with a novel unary-pairwise transformer. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 20104\u201320112."},{"key":"e_1_3_2_55_2","first-page":"1","volume-title":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","author":"Zhang Xiaoyi","year":"2021","unstructured":"Xiaoyi Zhang, Zheng Wang, Xing Xu, Jiwei Wei, and Yang Yang. 2021. Scene graph generation via multi-relation classification and cross-modal attention coordinator. In Proceedings of the 2nd ACM International Conference on Multimedia in Asia. 1\u20137."},{"key":"e_1_3_2_56_2","first-page":"1356","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Zhang Yifeng","year":"2021","unstructured":"Yifeng Zhang, Ming Jiang, and Qi Zhao. 2021. Explicit knowledge incorporation for visual reasoning. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 1356\u20131365."},{"key":"e_1_3_2_57_2","first-page":"19548","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Zhang Yong","year":"2022","unstructured":"Yong Zhang, Yingwei Pan, Ting Yao, Rui Huang, Tao Mei, and Chang-Wen Chen. 2022. Exploring structure-aware transformer over interaction proposals for human-object interaction detection. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 19548\u201319557."},{"key":"e_1_3_2_58_2","first-page":"19568","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Zhou Desen","year":"2022","unstructured":"Desen Zhou, Zhichao Liu, Jian Wang, Leshan Wang, Tao Hu, Errui Ding, and Jingdong Wang. 2022. Human-object interaction detection via disentangled transformer. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 19568\u201319577."},{"issue":"9","key":"e_1_3_2_59_2","doi-asserted-by":"crossref","first-page":"2337","DOI":"10.1007\/s11263-022-01653-1","article-title":"Learning to prompt for vision-language models","volume":"130","author":"Zhou Kaiyang","year":"2022","unstructured":"Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Learning to prompt for vision-language models. International Journal of Computer Vision 130, 9 (2022), 2337\u20132348.","journal-title":"International Journal of Computer Vision"},{"key":"e_1_3_2_60_2","unstructured":"Xizhou Zhu Weijie Su Lewei Lu Bin Li Xiaogang Wang and Jifeng Dai. 2021. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In 9th International Conference on Learning Representations ICLR 2021 Virtual Event Austria May 3-7 2021 OpenReview.net. Retrieved from https:\/\/openreview.net\/forum?id=gZ9hCDWe6ke"},{"key":"e_1_3_2_61_2","doi-asserted-by":"crossref","first-page":"1863","DOI":"10.18653\/v1\/2020.coling-main.169","volume-title":"Proceedings of the 28th International Conference on Computational Linguistics","author":"Ziaeefard Maryam","year":"2020","unstructured":"Maryam Ziaeefard and Freddy Lecue. 2020. Towards knowledge-augmented visual question answering. In Proceedings of the 28th International Conference on Computational Linguistics. 1863\u20131873."},{"key":"e_1_3_2_62_2","first-page":"11825","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Zou Cheng","year":"2021","unstructured":"Cheng Zou, Bohan Wang, Yue Hu, Junqi Liu, Qian Wu, Yu Zhao, Boxun Li, Chenguang Zhang, Chi Zhang, Yichen Wei, and Jian Sun. 2021. End-to-end human object interaction detection with hoi transformer. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 11825\u201311834."}],"container-title":["ACM Transactions on Asian and Low-Resource Language Information Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3748318","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,10]],"date-time":"2025-09-10T13:28:40Z","timestamp":1757510920000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3748318"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,10]]},"references-count":61,"journal-issue":{"issue":"9","published-print":{"date-parts":[[2025,9,30]]}},"alternative-id":["10.1145\/3748318"],"URL":"https:\/\/doi.org\/10.1145\/3748318","relation":{},"ISSN":["2375-4699","2375-4702"],"issn-type":[{"type":"print","value":"2375-4699"},{"type":"electronic","value":"2375-4702"}],"subject":[],"published":{"date-parts":[[2025,9,10]]},"assertion":[{"value":"2024-02-28","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-12-11","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-09-10","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}