{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,7]],"date-time":"2026-03-07T17:46:47Z","timestamp":1772905607746,"version":"3.50.1"},"reference-count":61,"publisher":"Association for Computing Machinery (ACM)","issue":"6","license":[{"start":{"date-parts":[[2024,8,19]],"date-time":"2024-08-19T00:00:00Z","timestamp":1724025600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62376140, 62376137, and U23A20315"],"award-info":[{"award-number":["62376140, 62376137, and U23A20315"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100007129","name":"Shandong Provincial Natural Science Foundation","doi-asserted-by":"crossref","award":["ZR2022YQ59"],"award-info":[{"award-number":["ZR2022YQ59"]}],"id":[{"id":"10.13039\/501100007129","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Science and Technology Innovation Program for Distinguished Young Scholars of Shandong Province Higher Education Institutions","award":["2023KJ128"],"award-info":[{"award-number":["2023KJ128"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Inf. Syst."],"published-print":{"date-parts":[[2024,11,30]]},"abstract":"<jats:p>Unleashing the power of image-text matching in real-world applications is hampered by noisy correspondence. Manually curating high-quality datasets is expensive and time-consuming, and datasets generated using diffusion models are not adequately well-aligned. The most promising way is to collect image-text pairs from the Internet, but it will inevitably introduce noisy correspondence. To reduce the negative impact of noisy correspondence, we propose a novel model that first transforms the noisy correspondence filtering problem into a similarity distribution modeling problem by exploiting the powerful capabilities of pre-trained models. Specifically, we use the Gaussian Mixture model to model the similarity obtained by CLIP as clean distribution and noisy distribution, to filter out most of the noisy correspondence in the dataset. Afterward, we used relatively clean data to fine-tune the model. To further reduce the negative impact of unfiltered noisy correspondence, i.e., a minimal part where two distributions intersect during the fine-tuning process, we propose a distribution-sensitive dynamic margin ranking loss, further increasing the distance between the two distributions. Through continuous iteration, the noisy correspondence gradually decreases and the model performance gradually improves. Our extensive experiments demonstrate the effectiveness and robustness of our model even under high noise rates.<\/jats:p>","DOI":"10.1145\/3662732","type":"journal-article","created":{"date-parts":[[2024,4,29]],"date-time":"2024-04-29T16:47:27Z","timestamp":1714409247000},"page":"1-26","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":11,"title":["Breaking Through the Noisy Correspondence: A Robust Model for Image-Text Matching"],"prefix":"10.1145","volume":"42","author":[{"ORCID":"https:\/\/orcid.org\/0009-0005-0028-1281","authenticated-orcid":false,"given":"Haitao","family":"Shi","sequence":"first","affiliation":[{"name":"Shandong University, Jinan, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-6093-6752","authenticated-orcid":false,"given":"Meng","family":"Liu","sequence":"additional","affiliation":[{"name":"Shandong Jianzhu University, Jinan, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0004-0048-0232","authenticated-orcid":false,"given":"Xiaoxuan","family":"Mu","sequence":"additional","affiliation":[{"name":"Shandong University, Jinan, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5274-4197","authenticated-orcid":false,"given":"Xuemeng","family":"Song","sequence":"additional","affiliation":[{"name":"Shandong University, Qingdao, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5653-8286","authenticated-orcid":false,"given":"Yupeng","family":"Hu","sequence":"additional","affiliation":[{"name":"Shandong University, Jinan, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1476-0273","authenticated-orcid":false,"given":"Liqiang","family":"Nie","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology (Shenzhen), Shenzhen, China"}]}],"member":"320","published-online":{"date-parts":[[2024,8,19]]},"reference":[{"key":"e_1_3_2_2_2","first-page":"312","volume-title":"Proceedings of the International Conference on Machine Learning","volume":"97","author":"Arazo Eric","year":"2019","unstructured":"Eric Arazo, Diego Ortego, Paul Albert, Noel E. O\u2019Connor, and Kevin McGuinness. 2019. Unsupervised Label Noise Modeling and Loss Correction. In Proceedings of the International Conference on Machine Learning, Vol. 97. 312\u2013321."},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1145\/3460426.3463615"},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01267"},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1145\/3499027"},{"key":"e_1_3_2_6_2","unstructured":"Junyoung Chung Caglar G\u00fclcehre KyungHyun Cho and Yoshua Bengio. 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv:2201.08239. Retrieved from https:\/\/arxiv.org\/abs\/1412.3555"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i2.16209"},{"key":"e_1_3_2_8_2","first-page":"8469","volume-title":"Proceedings of the International Conference on Machine Learning","volume":"202","author":"Driess Danny","year":"2023","unstructured":"Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. 2023. PaLM-E: An Embodied Multimodal Language Model. In Proceedings of the International Conference on Machine Learning, Vol. 202. 8469\u20138488."},{"issue":"5","key":"e_1_3_2_9_2","first-page":"162:1","article-title":"MKVSE: Multimodal Knowledge Enhanced Visual-Semantic Embedding for Image-Text Retrieval","volume":"19","author":"Feng Duoduo","year":"2023","unstructured":"Duoduo Feng, Xiangteng He, and Yuxin Peng. 2023. MKVSE: Multimodal Knowledge Enhanced Visual-Semantic Embedding for Image-Text Retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 19, 5 (2023), 162:1\u2013162:21.","journal-title":"ACM Transactions on Multimedia Computing, Communications, and Applications"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00750"},{"key":"e_1_3_2_11_2","first-page":"8536","volume-title":"Proceedings of the Neural Information Processing Systems Conference","author":"Han Bo","year":"2018","unstructured":"Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor W. Tsang, and Masashi Sugiyama. 2018. Co-teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels. In Proceedings of the Neural Information Processing Systems Conference. 8536\u20138546."},{"key":"e_1_3_2_12_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00726"},{"key":"e_1_3_2_13_2","first-page":"10477","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Hendrycks Dan","year":"2018","unstructured":"Dan Hendrycks, Mantas Mazeika, Duncan Wilson, and Kevin Gimpel. 2018. Using Trusted Data to Train Deep Networks on Labels Corrupted by Severe Noise. In Proceedings of the Advances in Neural Information Processing Systems. 10477\u201310486."},{"key":"e_1_3_2_14_2","first-page":"11135","volume-title":"Proceedings of the Neural Information Processing Systems Conference","author":"Herdade Simao","year":"2019","unstructured":"Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares. 2019. Image Captioning: Transforming Objects into Words. In Proceedings of the Neural Information Processing Systems Conference. 11135\u201311145."},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2023.3247939"},{"key":"e_1_3_2_16_2","first-page":"29406","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","volume":"34","author":"Huang Zhenyu","year":"2021","unstructured":"Zhenyu Huang, Guocheng Niu, Xiao Liu, Wenbiao Ding, Xinyan Xiao, Hua Wu, and Xi Peng. 2021. Learning with Noisy Correspondence for Cross-Modal Matching. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 34. 29406\u201329419."},{"key":"e_1_3_2_17_2","unstructured":"Zhicheng Huang Zhaoyang Zeng Bei Liu Dongmei Fu and Jianlong Fu. 2020. Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers. arXiv:2004.00849. Retrieved from https:\/\/arxiv.org\/abs\/2004.00849"},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2016.2598339"},{"key":"e_1_3_2_19_2","first-page":"1889","volume-title":"Proceedings of the Neural Information Processing Systems Conference","author":"Karpathy Andrej","year":"2014","unstructured":"Andrej Karpathy, Armand Joulin, and Li Fei-Fei. 2014. Deep Fragment Embeddings for Bidirectional Image Sentence Mapping. In Proceedings of the Neural Information Processing Systems Conference. 1889\u20131897."},{"key":"e_1_3_2_20_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Kingma Diederik P.","year":"2015","unstructured":"Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations. Retrieved from https:\/\/dblp.org\/db\/conf\/iclr\/iclr2015.html#KingmaB14"},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01225-0_13"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6795"},{"key":"e_1_3_2_23_2","first-page":"19730","volume-title":"Proceedings of the International Conference on Machine Learning","volume":"202","author":"Li Junnan","year":"2023","unstructured":"Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In Proceedings of the International Conference on Machine Learning, Vol. 202. 19730\u201319742."},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00475"},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01830"},{"key":"e_1_3_2_26_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.acl-long.202"},{"key":"e_1_3_2_27_2","unstructured":"Xuelong Li. 2022. Positive-Incentive Noise. IEEE Transactions on Neural Networks and Learning Systems (2022). 1\u20137. Retrieved from https:\/\/ieeexplore.ieee.org\/document\/10003114\/metrics{#}metrics"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/3343031.3350869"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.442"},{"key":"e_1_3_2_31_2","first-page":"1490","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Lyu Yueming","year":"2020","unstructured":"Yueming Lyu and Ivor W. Tsang. 2020. Curriculum Loss: Robust Learning and Generalization against Label Corruption. In Proceedings of the International Conference on Learning Representations. 1490\u20131500."},{"key":"e_1_3_2_32_2","first-page":"3361","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Ma Xingjun","year":"2018","unstructured":"Xingjun Ma, Yisen Wang, Michael E. Houle, Shuo Zhou, Sarah M. Erfani, Shu-Tao Xia, Sudanthi N. R. Wijewickrema, and James Bailey. 2018. Dimensionality-Driven Learning with Noisy Labels. In Proceedings of the International Conference on Machine Learning. 3361\u20133370."},{"key":"e_1_3_2_33_2","first-page":"15","volume-title":"Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval","author":"Meng Liu","year":"2018","unstructured":"Liu Meng, Wang Xiang, Nie Liqiang, Xiangnan He, Baoquan Chen, and Tat-Seng Chua. 2018. Attentive Moment Retrieval in Videos. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 15\u201324."},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00637"},{"key":"e_1_3_2_35_2","unstructured":"OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774. Retrieved from https:\/\/arxiv.org\/abs\/2303.08774"},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.01098"},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3547922"},{"key":"e_1_3_2_38_2","first-page":"24829","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","author":"Qin Yang","year":"2023","unstructured":"Yang Qin, Yuan Sun, Dezhong Peng, Joey Tianyi Zhou, Xi Peng, and Peng Hu. 2023. Cross-Modal Active Complementary Learning with Self-refining Correspondence. In Proceedings of the Advances in Neural Information Processing Systems. 24829\u201324840."},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1145\/3394171.3413961"},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/3404835.3462829"},{"key":"e_1_3_2_41_2","first-page":"8748","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the International Conference on Machine Learning. 8748\u20138763."},{"key":"e_1_3_2_42_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2016.2577031"},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-1238"},{"key":"e_1_3_2_44_2","first-page":"5907","volume-title":"Proceedings of the International Conference on Machine Learning","volume":"97","author":"Song Hwanjun","year":"2019","unstructured":"Hwanjun Song, Minseok Kim, and Jae-Gil Lee. 2019a. SELFIE: Refurbishing Unclean Samples for Robust Deep Learning. In Proceedings of the International Conference on Machine Learning, Vol. 97. 5907\u20135915."},{"key":"e_1_3_2_45_2","first-page":"5907","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Song Hwanjun","year":"2019","unstructured":"Hwanjun Song, Minseok Kim, and Jae-Gil Lee. 2019b. SELFIE: Refurbishing Unclean Samples for Robust Deep Learning. In Proceedings of the International Conference on Machine Learning. 5907\u20135915."},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1145\/3447548.3467222"},{"key":"e_1_3_2_47_2","doi-asserted-by":"publisher","DOI":"10.1145\/3447548.3467222"},{"key":"e_1_3_2_48_2","doi-asserted-by":"publisher","DOI":"10.1145\/3209978.3209996"},{"key":"e_1_3_2_49_2","unstructured":"Romal Thoppilan Daniel De Freitas Jamie Hall Noam Shazeer Apoorv Kulshreshtha Heng-Tze Cheng Alicia Jin Taylor Bos Leslie Baker Yu Du YaGuang Li Hongrae Lee Huaixiu Steven Zheng Amin Ghafouri Marcelo Menegali Yanping Huang Maxim Krikun Dmitry Lepikhin James Qin Dehao Chen Yuanzhong Xu Zhifeng Chen Adam Roberts Maarten Bosma Yanqi Zhou Chung-Ching Chang Igor Krivokon Will Rusch Marc Pickett Kathleen S. Meier-Hellstern Meredith Ringel Morris Tulsee Doshi Renelito Delos Santos Toju Duke Johnny Soraker Ben Zevenbergen Vinodkumar Prabhakaran Mark Diaz Ben Hutchinson Kristen Olson Alejandra Molina Erin Hoffman-John Josh Lee Lora Aroyo Ravi Rajakumar Alena Butryna Matthew Lamm Viktoriya Kuzmina Joe Fenton Aaron Cohen Rachel Bernstein Ray Kurzweil Blaise Ag\u00fcera y Arcas Claire Cui Marian Croak Ed H. Chi and Quoc Le. 2022. LaMDA: Language Models for Dialog Applications. arXiv:2201.08239. Retrieved from https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2021\/hash\/4f16c818875d9fcb6867c7bdc89be7eb-Abstract.html"},{"key":"e_1_3_2_50_2","first-page":"9534","volume-title":"Proceedings of the Neural Information Processing Systems Conference","author":"Wang Zhenyu","year":"2021","unstructured":"Zhenyu Wang, Ya-Li Li, Ye Guo, and Shengjin Wang. 2021. Combating Noise: Semi-supervised Learning by Region Uncertainty Quantification. In Proceedings of the Neural Information Processing Systems Conference. 9534\u20139545."},{"key":"e_1_3_2_51_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.acl-long.287"},{"key":"e_1_3_2_52_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2022.01.042"},{"key":"e_1_3_2_53_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.424"},{"key":"e_1_3_2_54_2","first-page":"6835","volume-title":"Proceedings of the Neural Information Processing Systems Conference","author":"Xia Xiaobo","year":"2019","unstructured":"Xiaobo Xia, Tongliang Liu, Nannan Wang, Bo Han, Chen Gong, Gang Niu, and Masashi Sugiyama. 2019. Are Anchor Points Really Indispensable in Label-Noise Learning? In Proceedings of the Neural Information Processing Systems Conference. 6835\u20136846."},{"key":"e_1_3_2_55_2","doi-asserted-by":"publisher","DOI":"10.1145\/3572833"},{"key":"e_1_3_2_56_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.01904"},{"key":"e_1_3_2_57_2","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00166"},{"key":"e_1_3_2_58_2","doi-asserted-by":"publisher","DOI":"10.1145\/3548455"},{"key":"e_1_3_2_59_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Zhang Chiyuan","year":"2020","unstructured":"Chiyuan Zhang, Samy Bengio, Moritz Hardt, Michael C. Mozer, and Yoram Singer. 2020. Identity Crisis: Memorization and Generalization Under Extreme Overparameterization. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_2_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01521"},{"key":"e_1_3_2_61_2","unstructured":"Yivan Zhang and Masashi Sugiyama. 2021. Approximating Instance-Dependent Noise via Instance-Confidence Embedding. arXiv:2103.13569. Retrieved from https:\/\/arxiv.org\/abs\/2103.13569"},{"key":"e_1_3_2_62_2","unstructured":"Deyao Zhu Jun Chen Xiaoqian Shen Xiang Li and Mohamed Elhoseiny. 2023. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv:2304.10592. Retrieved from https:\/\/arxiv.org\/abs\/2304.10592"}],"container-title":["ACM Transactions on Information Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3662732","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3662732","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T23:57:11Z","timestamp":1750291031000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3662732"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,8,19]]},"references-count":61,"journal-issue":{"issue":"6","published-print":{"date-parts":[[2024,11,30]]}},"alternative-id":["10.1145\/3662732"],"URL":"https:\/\/doi.org\/10.1145\/3662732","relation":{},"ISSN":["1046-8188","1558-2868"],"issn-type":[{"value":"1046-8188","type":"print"},{"value":"1558-2868","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,8,19]]},"assertion":[{"value":"2023-09-25","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-04-15","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-08-19","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}