{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,10]],"date-time":"2026-03-10T00:50:01Z","timestamp":1773103801526,"version":"3.50.1"},"reference-count":50,"publisher":"Association for Computing Machinery (ACM)","issue":"8","license":[{"start":{"date-parts":[[2024,8,16]],"date-time":"2024-08-16T00:00:00Z","timestamp":1723766400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100012166","name":"National Key R&D Program of China","doi-asserted-by":"crossref","award":["2023YFC3304902"],"award-info":[{"award-number":["2023YFC3304902"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2024,8,31]]},"abstract":"<jats:p>\n            Image-text retrieval, a fundamental cross-modal task, performs similarity reasoning for images and texts. The primary challenge for image-text retrieval is cross-modal semantic heterogeneity, where the semantic features of visual and textual modalities are rich but distinct. Scene graph is an effective representation for images and texts as it explicitly models objects and their relations. Existing scene graph based methods have not fully taken the features regarding various granularities implicit in scene graph into consideration (e.g., triplets), the inadequate feature matching incurs the absence of non-trivial semantic information (e.g., inner relations among triplets). Therefore, we propose a\n            <jats:bold>S<\/jats:bold>\n            emantic-Consistency\n            <jats:bold>E<\/jats:bold>\n            nhanced\n            <jats:bold>M<\/jats:bold>\n            ulti-Level\n            <jats:bold>Scene<\/jats:bold>\n            Graph Matching (SEMScene) network, which exploits the semantic relevance between visual and textual scene graphs from fine-grained to coarse-grained. Firstly, under the scene graph representation, we perform feature matching including low-level node matching, mid-level semantic triplet matching, and high-level holistic scene graph matching. Secondly, to enhance the semantic-consistency for object-fused triplets carrying key correlation information, we propose a dual-step constraint mechanism in mid-level matching. Thirdly, to guide the model to learn the semantic-consistency of matched image-text pairs, we devise effective loss functions for each stage of the dual-step constraint. Comprehensive experiments on Flickr30K and MS-COCO datasets demonstrate that SEMScene achieves state-of-the-art performances with significant improvements.\n          <\/jats:p>","DOI":"10.1145\/3664816","type":"journal-article","created":{"date-parts":[[2024,5,11]],"date-time":"2024-05-11T11:34:43Z","timestamp":1715427283000},"page":"1-28","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":4,"title":["SEMScene: Semantic-Consistency Enhanced Multi-Level Scene Graph Matching for Image-Text Retrieval"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0009-0009-3600-1586","authenticated-orcid":false,"given":"Yuankun","family":"Liu","sequence":"first","affiliation":[{"name":"School of Computer Science and Technology, North University of China, Taiyuan, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0003-4377-8031","authenticated-orcid":false,"given":"Xiang","family":"Yuan","sequence":"additional","affiliation":[{"name":"School of Software and Microelectronics, Peking University, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3197-4073","authenticated-orcid":false,"given":"Haochen","family":"Li","sequence":"additional","affiliation":[{"name":"School of Software and Microelectronics, Peking University, Beijing China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4934-1001","authenticated-orcid":false,"given":"Zhijie","family":"Tan","sequence":"additional","affiliation":[{"name":"School of Software and Microelectronics, Peking University, Beijing China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4337-7003","authenticated-orcid":false,"given":"Jinsong","family":"Huang","sequence":"additional","affiliation":[{"name":"School of Software and Microelectronics, Peking University, Beijing China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2027-0369","authenticated-orcid":false,"given":"Jingjie","family":"Xiao","sequence":"additional","affiliation":[{"name":"School of Software and Microelectronics, Peking University, Beijing China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2958-3097","authenticated-orcid":false,"given":"Weiping","family":"Li","sequence":"additional","affiliation":[{"name":"School of Software and Microelectronics, Peking University, Beijing China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3564-4610","authenticated-orcid":false,"given":"Tong","family":"Mo","sequence":"additional","affiliation":[{"name":"School of Software and Microelectronics, Peking University, Beijing China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2024,8,16]]},"reference":[{"key":"e_1_3_3_2_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46454-1_24"},{"key":"e_1_3_3_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_3_3_4_2","unstructured":"Dzmitry Bahdanau Kyunghyun Cho and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations."},{"key":"e_1_3_3_5_2","doi-asserted-by":"crossref","unstructured":"Fei-Long Chen Du-Zhen Zhang Ming-Lun Han Xiu-Yi Chen Jing Shi Shuang Xu and Bo Xu. 2023. Vlp: A survey on vision-language pre-training. Machine Intelligence Research 20 1 (2023) 38\u201356.","DOI":"10.1007\/s11633-022-1369-5"},{"key":"e_1_3_3_6_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01267"},{"key":"e_1_3_3_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01553"},{"key":"e_1_3_3_8_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58577-8_7"},{"key":"e_1_3_3_9_2","doi-asserted-by":"crossref","unstructured":"Yuhao Cheng Xiaoguang Zhu Jiuchao Qian Fei Wen and Peilin Liu. 2022. Cross-modal graph matching network for image-text retrieval. ACM Transactions on Multimedia Computing Communications and Applications 18 4 (2022) 1\u201323.","DOI":"10.1145\/3499027"},{"key":"e_1_3_3_10_2","unstructured":"Jacob Devlin Ming-Wei Chang Kenton Lee and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Volume 1 (Long and Short Papers) 4171\u20134186."},{"key":"e_1_3_3_11_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i2.16209"},{"key":"e_1_3_3_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/3460426.3463650"},{"key":"e_1_3_3_13_2","unstructured":"Fartash Faghri David J. Fleet Jamie Ryan Kiros and Sanja Fidler. 2018. VSE++: Improving visual-semantic embeddings with hard negatives. In Proceedings of the British Machine Vision Conference 12."},{"key":"e_1_3_3_14_2","doi-asserted-by":"publisher","DOI":"10.1145\/3512527.3531368"},{"key":"e_1_3_3_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_3_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00645"},{"key":"e_1_3_3_17_2","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2021\/106"},{"key":"e_1_3_3_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298990"},{"key":"e_1_3_3_19_2","doi-asserted-by":"crossref","unstructured":"Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing 1746\u20131751.","DOI":"10.3115\/v1\/D14-1181"},{"key":"e_1_3_3_20_2","first-page":"1","volume-title":"Proceedings of the 5th International Conference on Learning Representations","author":"Kipf Thomas N.","year":"2017","unstructured":"Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In Proceedings of the 5th International Conference on Learning Representations. 1\u201314."},{"key":"e_1_3_3_21_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01225-0_13"},{"key":"e_1_3_3_22_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6795"},{"key":"e_1_3_3_23_2","first-page":"19730","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Li Junnan","year":"2023","unstructured":"Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning. 19730\u201319742."},{"key":"e_1_3_3_24_2","doi-asserted-by":"crossref","unstructured":"Kun Li Jiaxiu Li Dan Guo Xun Yang and Meng Wang. 2023. Transformer-based visual grounding with cross-modality interaction. ACM Transactions on Multimedia Computing Communications and Applications 19 6 (2023) 1\u201319.","DOI":"10.1145\/3587251"},{"key":"e_1_3_3_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00475"},{"key":"e_1_3_3_26_2","doi-asserted-by":"crossref","unstructured":"Kunpeng Li Yulun Zhang Kai Li Yuanyuan Li and Yun Fu. 2022. Image-text embedding learning via visual and textual semantic reasoning. IEEE Transactions on Pattern Analysis and Machine Intelligence 45 1 (2022) 641\u2013656.","DOI":"10.1109\/TPAMI.2022.3148470"},{"key":"e_1_3_3_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01593"},{"key":"e_1_3_3_28_2","doi-asserted-by":"crossref","unstructured":"Wenhui Li Song Yang Qiang Li Xuanya Li and An-An Liu. 2023. Commonsense-guided semantic and relational consistencies for image-text retrieval. IEEE Transactions on Multimedia 26 (2023) 1867\u20131880.","DOI":"10.1109\/TMM.2023.3289753"},{"key":"e_1_3_3_29_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"e_1_3_3_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01093"},{"key":"e_1_3_3_31_2","unstructured":"Haotian Liu Chunyuan Li Qingyang Wu and Yong Jae Lee. 2023. Visual instruction tuning. Advances in Neural Information Processing Systems 36 (2023) 34892\u201334916."},{"key":"e_1_3_3_32_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3548132"},{"key":"e_1_3_3_33_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6833"},{"key":"e_1_3_3_34_2","unstructured":"Jiasen Lu Dhruv Batra Devi Parikh and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in Neural Information Processing Systems 32 (2019) 13\u201323."},{"key":"e_1_3_3_35_2","doi-asserted-by":"crossref","unstructured":"Manh-Duy Nguyen Binh T. Nguyen and Cathal Gurrin. 2021. A deep local and global scene-graph matching for image-text retrieval. In Proceedings of the 20th International Conference on New Trends in Intelligent Software Methodologies Tools and Techniques 510\u2013523.","DOI":"10.3233\/FAIA210049"},{"key":"e_1_3_3_36_2","doi-asserted-by":"crossref","unstructured":"Jiaming Pei Kaiyang Zhong Zhi Yu Lukun Wang and Kuruva Lakshmanna. 2023. Scene graph semantic inference for image and text matching. ACM Transactions on Asian and Low-Resource Language Information Processing 22 5 (2023) 1\u201323.","DOI":"10.1145\/3563390"},{"key":"e_1_3_3_37_2","doi-asserted-by":"crossref","unstructured":"Liang Peng Yang Yang Zheng Wang Zi Huang and Heng Tao Shen. 2020. Mra-net: Improving vqa via multi-modal relation attention network. IEEE Transactions on Pattern Analysis and Machine Intelligence 44 1 (2020) 318\u2013329.","DOI":"10.1109\/TPAMI.2020.3004830"},{"key":"e_1_3_3_38_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.emnlp-main.827"},{"key":"e_1_3_3_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/3477495.3531715"},{"key":"e_1_3_3_40_2","doi-asserted-by":"crossref","unstructured":"Zhangxiang Shi Tianzhu Zhang Xi Wei Feng Wu and Yongdong Zhang. 2022. Decoupled cross-modal phrase-attention network for image-sentence matching. IEEE Transactions on Image Processing 33 (2022) 1326\u20131337.","DOI":"10.1109\/TIP.2022.3197972"},{"key":"e_1_3_3_41_2","first-page":"6105","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Tan Mingxing","year":"2019","unstructured":"Mingxing Tan and Quoc Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning. 6105\u20136114."},{"key":"e_1_3_3_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00377"},{"key":"e_1_3_3_43_2","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N. Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017) 5998\u20136008."},{"key":"e_1_3_3_44_2","doi-asserted-by":"publisher","DOI":"10.1109\/WACV45572.2020.9093614"},{"key":"e_1_3_3_45_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV51070.2023.01973"},{"key":"e_1_3_3_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00183"},{"key":"e_1_3_3_47_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00677"},{"key":"e_1_3_3_48_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01094"},{"key":"e_1_3_3_49_2","doi-asserted-by":"crossref","unstructured":"Peter Young Alice Lai Micah Hodosh and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (2014) 67\u201378.","DOI":"10.1162\/tacl_a_00166"},{"key":"e_1_3_3_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00611"},{"key":"e_1_3_3_51_2","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475380"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3664816","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3664816","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:17:29Z","timestamp":1750295849000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3664816"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,8,16]]},"references-count":50,"journal-issue":{"issue":"8","published-print":{"date-parts":[[2024,8,31]]}},"alternative-id":["10.1145\/3664816"],"URL":"https:\/\/doi.org\/10.1145\/3664816","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,8,16]]},"assertion":[{"value":"2023-11-28","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-04-26","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-08-16","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}