{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,1]],"date-time":"2026-05-01T16:38:01Z","timestamp":1777653481365,"version":"3.51.4"},"reference-count":82,"publisher":"Association for Computing Machinery (ACM)","issue":"10","license":[{"start":{"date-parts":[[2024,9,12]],"date-time":"2024-09-12T00:00:00Z","timestamp":1726099200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62222213, U22B2059, U23A20319, 62072423, 61727809"],"award-info":[{"award-number":["62222213, U22B2059, U23A20319, 62072423, 61727809"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Young Scientists Fund of the Natural Science Foundation of Sichuan Province","award":["2023NSFSC1402"],"award-info":[{"award-number":["2023NSFSC1402"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2024,10,31]]},"abstract":"<jats:p>\n            Text-to-Video Retrieval is a typical cross-modal retrieval task that has been studied extensively under a conventional supervised setting. Recently, some works have sought to extend the problem to a weakly supervised formulation, which can be more consistent with real-life scenarios and more efficient in annotation cost. In this context, a new task called Partially Relevant Video Retrieval (PRVR) is proposed, which aims to retrieve videos that are partially relevant to a given textual query, i.e., the videos containing at least one semantically relevant moment. Formulating the task as a Multiple Instance Learning (MIL) ranking problem, prior arts rely on heuristics algorithms such as a simple greedy search strategy and deal with each query independently. Although these early explorations have achieved decent performance, they may not fully utilize the bag-level label and only consider the local optimum, which could result in suboptimal solutions and inferior final retrieval performance. To address this problem, in this paper, we propose to exploit the relationships between instances to boost retrieval performance. Based on this idea, we creatively put forward: (1) a new matching scheme for pairing queries and their related moments in the video; and (2) a new loss function to facilitate cross-modal alignment between two views of an instance. Extensive validations on three publicly available datasets have demonstrated the effectiveness of our solution and verified our hypothesis that modeling instance-level relationships is beneficial in the MIL ranking setting. Our code will be publicly available at\n            <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xlink:href=\"https:\/\/github.com\/xjtupanda\/BGM-Net\">https:\/\/github.com\/xjtupanda\/BGM-Net<\/jats:ext-link>\n            .\n          <\/jats:p>","DOI":"10.1145\/3663571","type":"journal-article","created":{"date-parts":[[2024,5,3]],"date-time":"2024-05-03T11:56:37Z","timestamp":1714737397000},"page":"1-21","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["Exploiting Instance-level Relationships in Weakly Supervised Text-to-Video Retrieval"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5356-1800","authenticated-orcid":false,"given":"Shukang","family":"Yin","sequence":"first","affiliation":[{"name":"School of Data Science, University of Science and Technology of China, Hefei, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8103-0321","authenticated-orcid":false,"given":"Sirui","family":"Zhao","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, University of Science and Technology of China, Hefei, China and School of Computer Science and Technology, Southwest University of Science and Technology, Mianyang, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9921-2078","authenticated-orcid":false,"given":"Hao","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Data Science, University of Science and Technology of China, Hefei, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4246-5386","authenticated-orcid":false,"given":"Tong","family":"Xu","sequence":"additional","affiliation":[{"name":"School of Data Science, University of Science and Technology of China, Hefei, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4835-4102","authenticated-orcid":false,"given":"Enhong","family":"Chen","sequence":"additional","affiliation":[{"name":"School of Data Science, University of Science and Technology of China, Hefei China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2024,9,12]]},"reference":[{"key":"e_1_3_1_2_2","volume-title":"ICML","author":"Amar Robert A.","year":"2001","unstructured":"Robert A. Amar, Daniel R. Dooly, Sally A. Goldman, and Qi Zhang. 2001. Multiple-instance learning of real-valued data. In ICML."},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.618"},{"key":"e_1_3_1_4_2","article-title":"Neural machine translation by jointly learning to align and translate","author":"Bahdanau Dzmitry","year":"2014","unstructured":"Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).","journal-title":"arXiv preprint arXiv:1409.0473"},{"key":"e_1_3_1_5_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00175"},{"key":"e_1_3_1_6_2","article-title":"Fast bundle algorithm for multiple-instance learning","author":"Bergeron Charles","year":"2011","unstructured":"Charles Bergeron, Gregory Moore, Jed Zaretzki, Curt M. Breneman, and Kristin P. Bennett. 2011. Fast bundle algorithm for multiple-instance learning. IEEE Transactions on Pattern Analysis and Machine Intelligence (2011).","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_1_7_2","doi-asserted-by":"publisher","DOI":"10.1007\/BF02186476"},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2017.10.009"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.502"},{"key":"e_1_3_1_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01065"},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1145\/3654674"},{"key":"e_1_3_1_12_2","doi-asserted-by":"publisher","DOI":"10.1145\/3487403"},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00831"},{"key":"e_1_3_1_14_2","first-page":"4171","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171\u20134186."},{"key":"e_1_3_1_15_2","doi-asserted-by":"publisher","DOI":"10.1016\/S0004-3702(96)00034-3"},{"key":"e_1_3_1_16_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3547976"},{"key":"e_1_3_1_17_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2018.2832602"},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00957"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2021.3059295"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2022.3150959"},{"key":"e_1_3_1_21_2","unstructured":"Fartash Faghri David J. Fleet Jamie Ryan Kiros and Sanja Fidler. 2018. VSE++: Improving visual-semantic embeddings with hard negatives. (2018)."},{"key":"e_1_3_1_22_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v31i1.10890"},{"key":"e_1_3_1_23_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58548-8_13"},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.563"},{"key":"e_1_3_1_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00155"},{"key":"e_1_3_1_26_2","doi-asserted-by":"publisher","DOI":"10.1038\/s42256-020-00257-z"},{"key":"e_1_3_1_27_2","doi-asserted-by":"publisher","DOI":"10.1145\/3538490"},{"key":"e_1_3_1_28_2","doi-asserted-by":"publisher","DOI":"10.1145\/3579825"},{"key":"e_1_3_1_29_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_1_30_2","volume-title":"2008 IEEE Conference on Computer Vision and Pattern Recognition","author":"Hu Yang","year":"2008","unstructured":"Yang Hu, Mingjing Li, and Nenghai Yu. 2008. Multiple-instance ranking: Learning to rank images for image retrieval. In 2008 IEEE Conference on Computer Vision and Pattern Recognition."},{"key":"e_1_3_1_31_2","volume-title":"International Conference on Machine Learning","author":"Ilse Maximilian","year":"2018","unstructured":"Maximilian Ilse, Jakub Tomczak, and Max Welling. 2018. Attention-based deep multiple instance learning. In International Conference on Machine Learning."},{"key":"e_1_3_1_32_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.compmedimag.2014.11.010"},{"key":"e_1_3_1_33_2","volume-title":"International Conference on Learning Representations (ICLR \u201915)","author":"Kingma Diederik","year":"2015","unstructured":"Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR \u201915)."},{"key":"e_1_3_1_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.83"},{"key":"e_1_3_1_35_2","doi-asserted-by":"publisher","DOI":"10.1002\/nav.3800020109"},{"key":"e_1_3_1_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/LWC.2018.2843359"},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00725"},{"key":"e_1_3_1_38_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58589-1_27"},{"key":"e_1_3_1_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.01409"},{"key":"e_1_3_1_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/3538533"},{"key":"e_1_3_1_41_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.161"},{"key":"e_1_3_1_42_2","doi-asserted-by":"publisher","DOI":"10.1145\/3343031.3350906"},{"key":"e_1_3_1_43_2","article-title":"GLAN: A graph-based linear assignment network","author":"Liu He","year":"2022","unstructured":"He Liu, Tao Wang, Congyan Lang, Songhe Feng, Yi Jin, and Yidong Li. 2022. GLAN: A graph-based linear assignment network. arXiv preprint arXiv:2201.02057 (2022).","journal-title":"arXiv preprint arXiv:2201.02057"},{"key":"e_1_3_1_44_2","article-title":"Use what you have: Video retrieval using representations from collaborative experts","author":"Liu Yang","year":"2019","unstructured":"Yang Liu, Samuel Albanie, Arsha Nagrani, and Andrew Zisserman. 2019. Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487 (2019).","journal-title":"arXiv preprint arXiv:1907.13487"},{"key":"e_1_3_1_45_2","article-title":"RoBERTa: A robustly optimized BERT pretraining approach","author":"Liu Yinhan","year":"2019","unstructured":"Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019).","journal-title":"arXiv preprint arXiv:1907.11692"},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2022.07.028"},{"key":"e_1_3_1_47_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58604-1_10"},{"key":"e_1_3_1_48_2","article-title":"A framework for multiple-instance learning","author":"Maron Oded","year":"1997","unstructured":"Oded Maron and Tom\u00e1s Lozano-P\u00e9rez. 1997. A framework for multiple-instance learning. Advances in Neural Information Processing Systems (1997).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1145\/3377875"},{"key":"e_1_3_1_50_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00272"},{"key":"e_1_3_1_51_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01186"},{"key":"e_1_3_1_52_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01082"},{"key":"e_1_3_1_53_2","article-title":"Representation learning with contrastive predictive coding","author":"Oord Aaron van den","year":"2018","unstructured":"Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).","journal-title":"arXiv preprint arXiv:1807.03748"},{"key":"e_1_3_1_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298668"},{"key":"e_1_3_1_55_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298780"},{"key":"e_1_3_1_56_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413961"},{"key":"e_1_3_1_57_2","volume-title":"International Conference on Machine Learning","author":"Ray Soumya","year":"2001","unstructured":"Soumya Ray and David Page. 2001. Multiple instance regression. In International Conference on Machine Learning."},{"key":"e_1_3_1_58_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298682"},{"key":"e_1_3_1_59_2","doi-asserted-by":"publisher","DOI":"10.2307\/2331554"},{"key":"e_1_3_1_60_2","article-title":"On mutual information maximization for representation learning","author":"Tschannen Michael","year":"2019","unstructured":"Michael Tschannen, Josip Djolonga, Paul K. Rubenstein, Sylvain Gelly, and Mario Lucic. 2019. On mutual information maximization for representation learning. ICLR.","journal-title":"ICLR"},{"key":"e_1_3_1_61_2","article-title":"Multiple instance learning with graph neural networks","author":"Tu Ming","year":"2019","unstructured":"Ming Tu, Jing Huang, Xiaodong He, and Bowen Zhou. 2019. Multiple instance learning with graph neural networks. ICML 2019 Workshop on Learning and Reasoning with Graph-Structured Data.","journal-title":"ICML 2019 Workshop on Learning and Reasoning with Graph-Structured Data"},{"key":"e_1_3_1_62_2","article-title":"Attention is all you need","volume":"30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_63_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2014.180"},{"key":"e_1_3_1_64_2","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475515"},{"key":"e_1_3_1_65_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.634"},{"key":"e_1_3_1_66_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE55515.2023.10231041"},{"key":"e_1_3_1_67_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.571"},{"key":"e_1_3_1_68_2","doi-asserted-by":"publisher","DOI":"10.1145\/3572844"},{"key":"e_1_3_1_69_2","article-title":"Predicting MHC-II binding affinity using multiple instance regression","author":"Yasser EL-Manzalawy","year":"2010","unstructured":"EL-Manzalawy Yasser, Drena Dobbs, and Vasant Honavar. 2010. Predicting MHC-II binding affinity using multiple instance regression. IEEE\/ACM Transactions on Computational Biology and Bioinformatics (2010).","journal-title":"IEEE\/ACM Transactions on Computational Biology and Bioinformatics"},{"key":"e_1_3_1_70_2","article-title":"A survey on multimodal large language models","author":"Yin Shukang","year":"2023","unstructured":"Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. 2023. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549 (2023).","journal-title":"arXiv preprint arXiv:2306.13549"},{"key":"e_1_3_1_71_2","article-title":"Woodpecker: Hallucination correction for multimodal large language models","author":"Yin Shukang","year":"2023","unstructured":"Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, and Enhong Chen. 2023. Woodpecker: Hallucination correction for multimodal large language models. arXiv preprint arXiv:2310.16045 (2023).","journal-title":"arXiv preprint arXiv:2310.16045"},{"key":"e_1_3_1_72_2","doi-asserted-by":"publisher","DOI":"10.1145\/3478025"},{"key":"e_1_3_1_73_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00225"},{"key":"e_1_3_1_74_2","article-title":"A hierarchical multi-modal encoder for moment localization in video corpus","author":"Zhang Bowen","year":"2020","unstructured":"Bowen Zhang, Hexiang Hu, Joonseok Lee, Ming Zhao, Sheide Chammas, Vihan Jain, Eugene Ie, and Fei Sha. 2020. A hierarchical multi-modal encoder for moment localization in video corpus. arXiv preprint arXiv:2011.09046 (2020).","journal-title":"arXiv preprint arXiv:2011.09046"},{"key":"e_1_3_1_75_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3547963"},{"key":"e_1_3_1_76_2","article-title":"Multiple instance learning on structured data","volume":"24","author":"Zhang Dan","year":"2011","unstructured":"Dan Zhang, Yan Liu, Luo Si, Jian Zhang, and Richard Lawrence. 2011. Multiple instance learning on structured data. Advances in Neural Information Processing Systems 24 (2011).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_77_2","article-title":"Maximum margin multiple instance clustering with applications to image and text clustering","author":"Zhang Dan","year":"2011","unstructured":"Dan Zhang, Fei Wang, Luo Si, and Tao Li. 2011. Maximum margin multiple instance clustering with applications to image and text clustering. IEEE Transactions on Neural Networks (2011).","journal-title":"IEEE Transactions on Neural Networks"},{"key":"e_1_3_1_78_2","doi-asserted-by":"publisher","DOI":"10.1145\/3404835.3462874"},{"key":"e_1_3_1_79_2","article-title":"EM-DD: An improved multiple-instance learning technique","volume":"14","author":"Zhang Qi","year":"2001","unstructured":"Qi Zhang and Sally Goldman. 2001. EM-DD: An improved multiple-instance learning technique. Advances in Neural Information Processing Systems 14 (2001).","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_80_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6984"},{"key":"e_1_3_1_81_2","article-title":"Progressive localization networks for language-based moment localization","author":"Zheng Qi","year":"2023","unstructured":"Qi Zheng, Jianfeng Dong, Xiaoye Qu, Xun Yang, Yabing Wang, Pan Zhou, Baolong Liu, and Xun Wang. 2023. Progressive localization networks for language-based moment localization. ACM Transactions on Multimedia Computing, Communications and Applications (2023).","journal-title":"ACM Transactions on Multimedia Computing, Communications and Applications"},{"key":"e_1_3_1_82_2","doi-asserted-by":"publisher","DOI":"10.1145\/1553374.1553534"},{"key":"e_1_3_1_83_2","doi-asserted-by":"publisher","DOI":"10.1007\/s10115-006-0029-3"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3663571","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3663571","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T00:57:59Z","timestamp":1750294679000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3663571"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,9,12]]},"references-count":82,"journal-issue":{"issue":"10","published-print":{"date-parts":[[2024,10,31]]}},"alternative-id":["10.1145\/3663571"],"URL":"https:\/\/doi.org\/10.1145\/3663571","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,9,12]]},"assertion":[{"value":"2023-10-05","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-04-26","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-09-12","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}