{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,9]],"date-time":"2025-12-09T19:20:28Z","timestamp":1765308028201,"version":"3.46.0"},"publisher-location":"New York, NY, USA","reference-count":40,"publisher":"ACM","funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62272198, 62276277"],"award-info":[{"award-number":["62272198, 62276277"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Guangdong Provincial Science and Technology Plan Project","award":["2021B1111600001"],"award-info":[{"award-number":["2021B1111600001"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,10,27]]},"DOI":"10.1145\/3746027.3755028","type":"proceedings-article","created":{"date-parts":[[2025,10,25]],"date-time":"2025-10-25T05:47:42Z","timestamp":1761371262000},"page":"1210-1219","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Modal Symbiosis: Variational Alignment Unveils New Horizons in Multimodal Representation Learning"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6091-8062","authenticated-orcid":false,"given":"Zeyan","family":"Li","sequence":"first","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-7914-5269","authenticated-orcid":false,"given":"Cankun","family":"Guo","sequence":"additional","affiliation":[{"name":"Nanchang University, Nanchang, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7693-7543","authenticated-orcid":false,"given":"Yin","family":"Tang","sequence":"additional","affiliation":[{"name":"Jinan University, Guangzhou, China"}]}],"member":"320","published-online":{"date-parts":[[2025,10,27]]},"reference":[{"doi-asserted-by":"publisher","key":"e_1_3_2_2_1_1","DOI":"10.1007\/978-3-030-01219-9_9"},{"doi-asserted-by":"publisher","key":"e_1_3_2_2_2_1","DOI":"10.1109\/ICCV.2015.279"},{"doi-asserted-by":"publisher","key":"e_1_3_2_2_3_1","DOI":"10.1109\/TPAMI.2018.2798607"},{"key":"e_1_3_2_2_4_1","volume-title":"AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022","author":"Chen Shoufa","year":"2022","unstructured":"Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. 2022. AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (Eds.). Curran Associates Inc., New Orleans, LA, USA. http:\/\/papers.nips.cc\/paper_files\/paper\/2022\/hash\/69e2f49ab0837b71b0e0cb7c555990f8-Abstract-Conference.html"},{"key":"e_1_3_2_2_5_1","volume-title":"Proceedings of the 37th International Conference on Machine Learning, ICML 2020","volume":"1607","author":"Chen Ting","year":"2020","unstructured":"Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. 2020a. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020 (Proceedings of Machine Learning Research, Vol. 119). PMLR, Virtual Event, 1597-1607. http:\/\/proceedings.mlr.press\/v119\/chen20j.html"},{"doi-asserted-by":"publisher","key":"e_1_3_2_2_6_1","DOI":"10.1007\/978-3-030-58577-8_7"},{"doi-asserted-by":"publisher","key":"e_1_3_2_2_7_1","DOI":"10.1109\/CVPR46437.2021.00831"},{"doi-asserted-by":"publisher","key":"e_1_3_2_2_8_1","DOI":"10.18653\/V1\/N19-1423"},{"key":"e_1_3_2_2_9_1","first-page":"4188","volume-title":"Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI","author":"Hodosh Micah","year":"2015","unstructured":"Micah Hodosh, Peter Young, and Julia Hockenmaier. 2015. Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics (Extended Abstract). In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Qiang Yang and Michael J. Wooldridge (Eds.). AAAI Press, Buenos Aires, Argentina, 4188-4192. http:\/\/ijcai.org\/Abstract\/15\/593"},{"key":"e_1_3_2_2_10_1","volume-title":"Proceedings of the 36th International Conference on Machine Learning, ICML 2019","volume":"2799","author":"Houlsby Neil","year":"2019","unstructured":"Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-Efficient Transfer Learning for NLP. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019 (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, Long Beach, California, USA, 2790-2799. http:\/\/proceedings.mlr.press\/v97\/houlsby19a.html"},{"key":"e_1_3_2_2_11_1","volume-title":"Proceedings of the 38th International Conference on Machine Learning, ICML 2021","volume":"4916","author":"Jia Chao","year":"2021","unstructured":"Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021a. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021 (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, Virtual Event, 4904-4916. http:\/\/proceedings.mlr.press\/v139\/jia21b.html"},{"key":"e_1_3_2_2_12_1","volume-title":"Proceedings of the 38th International Conference on Machine Learning, ICML 2021","volume":"4916","author":"Jia Chao","year":"2021","unstructured":"Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021b. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021 (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, Virtual Event, 4904-4916. http:\/\/proceedings.mlr.press\/v139\/jia21b.html"},{"doi-asserted-by":"publisher","key":"e_1_3_2_2_13_1","DOI":"10.1073\/pnas.1611835114"},{"doi-asserted-by":"publisher","key":"e_1_3_2_2_14_1","DOI":"10.1007\/S11263-016-0981-7"},{"key":"e_1_3_2_2_15_1","volume-title":"Tiny imagenet visual recognition challenge. CS 231N","author":"Le Yann","year":"2015","unstructured":"Yann Le and Xuan Yang. 2015. Tiny imagenet visual recognition challenge. CS 231N, Vol. 7, 7 (2015), 3."},{"key":"e_1_3_2_2_16_1","first-page":"9694","volume-title":"Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021","author":"Li Junnan","year":"2021","unstructured":"Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Gotmare, Shafiq R. Joty, Caiming Xiong, and Steven Chu-Hong Hoi. 2021. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, Marc'Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (Eds.). Curran Associates Inc., virtual, 9694-9705. https:\/\/proceedings.neurips.cc\/paper\/2021\/hash\/505259756244493872b7709a8a01b536-Abstract.html"},{"key":"e_1_3_2_2_17_1","first-page":"121","volume-title":"Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In European Conference on Computer Vision. Springer","author":"Li Xiujun","year":"2020","unstructured":"Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. In European Conference on Computer Vision. Springer, Springer, Glasgow, UK, 121-137. https:\/\/link.springer.com\/chapter\/10.1007\/978-3-030-58577-8_8"},{"doi-asserted-by":"publisher","key":"e_1_3_2_2_18_1","DOI":"10.1145\/3656580"},{"key":"e_1_3_2_2_19_1","volume-title":"Zou","author":"Liang Weixin","year":"2022","unstructured":"Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Y. Zou. 2022. Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (Eds.). Curran Associates Inc., New Orleans, LA, USA. http:\/\/papers.nips.cc\/paper_files\/paper\/2022\/hash\/702f4db7543a7432431df588d57bc7c9-Abstract-Conference.html"},{"doi-asserted-by":"publisher","key":"e_1_3_2_2_20_1","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"e_1_3_2_2_21_1","volume-title":"Proceedings of the 33rd International Conference on Neural Information Processing Systems. Curran Associates Inc.","author":"Lu Jiasen","year":"2019","unstructured":"Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the 33rd International Conference on Neural Information Processing Systems. Curran Associates Inc., Red Hook, NY, USA, Article 2, 11 pages."},{"key":"e_1_3_2_2_22_1","first-page":"28823","volume-title":"On Contrastive Representations of Stochastic Processes. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021","author":"Mathieu Emile","year":"2021","unstructured":"Emile Mathieu, Adam Foster, and Yee Whye Teh. 2021. On Contrastive Representations of Stochastic Processes. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, Marc'Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (Eds.). Curran Associates Inc., virtual, 28823-28835. https:\/\/proceedings.neurips.cc\/paper\/2021\/hash\/f19c44d068fecac1d6d13a80df4f8e96-Abstract.html"},{"doi-asserted-by":"publisher","key":"e_1_3_2_2_23_1","DOI":"10.1109\/CVPR42600.2020.00990"},{"key":"e_1_3_2_2_24_1","first-page":"1143","volume-title":"Proceedings of the 25th International Conference on Neural Information Processing Systems","author":"Ordonez Vicente","year":"2011","unstructured":"Vicente Ordonez, Girish Kulkarni, and Tamara L Berg. 2011. Im2Text: describing images using 1 million captioned photographs. In Proceedings of the 25th International Conference on Neural Information Processing Systems (Granada, Spain) (NIPS'11). Curran Associates Inc., Red Hook, NY, USA, 1143-1151."},{"doi-asserted-by":"publisher","key":"e_1_3_2_2_25_1","DOI":"10.18653\/V1\/2021.EACL-MAIN.39"},{"key":"e_1_3_2_2_26_1","volume-title":"Proceedings of the 38th International Conference on Machine Learning, ICML 2021","volume":"8763","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021 (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, Virtual Event, 8748-8763. http:\/\/proceedings.mlr.press\/v139\/radford21a.html"},{"doi-asserted-by":"publisher","key":"e_1_3_2_2_27_1","DOI":"10.1609\/AAAI.V33I01.33014822"},{"key":"e_1_3_2_2_28_1","volume-title":"Proceedings of the 35th International Conference on Machine Learning, ICML 2018 (Proceedings of Machine Learning Research","volume":"4544","author":"Schwarz Jonathan","year":"2018","unstructured":"Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka Grabska-Barwinska, Yee Whye Teh, Razvan Pascanu, and Raia Hadsell. 2018. Progress & Compress: A scalable framework for continual learning. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018 (Proceedings of Machine Learning Research, Vol. 80), Jennifer G. Dy and Andreas Krause (Eds.). PMLR, Stockholmsm\u00e4ssan, Stockholm, Sweden, 4535-4544. http:\/\/proceedings.mlr.press\/v80\/schwarz18a.html"},{"key":"e_1_3_2_2_29_1","volume-title":"Proceedings of the 35th International Conference on Machine Learning, ICML 2018 (Proceedings of Machine Learning Research","volume":"4564","author":"Serr\u00e0 Joan","year":"2018","unstructured":"Joan Serr\u00e0, Didac Suris, Marius Miron, and Alexandros Karatzoglou. 2018. Overcoming Catastrophic Forgetting with Hard Attention to the Task. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018 (Proceedings of Machine Learning Research, Vol. 80), Jennifer G. Dy and Andreas Krause (Eds.). PMLR, Stockholmsm\u00e4ssan, Stockholm, Sweden, 4555-4564. http:\/\/proceedings.mlr.press\/v80\/serra18a.html"},{"doi-asserted-by":"publisher","key":"e_1_3_2_2_30_1","DOI":"10.18653\/V1\/P18-1238"},{"doi-asserted-by":"publisher","key":"e_1_3_2_2_31_1","DOI":"10.1109\/CVPR52688.2022.01519"},{"key":"e_1_3_2_2_32_1","volume-title":"VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In 8th International Conference on Learning Representations, ICLR","author":"Su Weijie","year":"2020","unstructured":"Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2020. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In 8th International Conference on Learning Representations, ICLR 2020. OpenReview.net, Addis Ababa, Ethiopia. https:\/\/openreview.net\/forum?id=SygXPaEYvH"},{"key":"e_1_3_2_2_33_1","volume-title":"AdaShare: Learning What To Share For Efficient Deep Multi-Task Learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020","author":"Sun Ximeng","year":"2020","unstructured":"Ximeng Sun, Rameswar Panda, Rog\u00e9rio Feris, and Kate Saenko. 2020. AdaShare: Learning What To Share For Efficient Deep Multi-Task Learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, Hugo Larochelle, Marc'Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). Curran Associates Inc., virtual. https:\/\/proceedings.neurips.cc\/paper\/2020\/hash\/634841a6831464b64c072c8510c7f35c-Abstract.html"},{"doi-asserted-by":"publisher","key":"e_1_3_2_2_34_1","DOI":"10.5555\/3295222.3295349"},{"key":"e_1_3_2_2_35_1","volume-title":"Proceedings of the 37th International Conference on Machine Learning, ICML 2020","volume":"9939","author":"Wang Tongzhou","year":"2020","unstructured":"Tongzhou Wang and Phillip Isola. 2020. Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020 (Proceedings of Machine Learning Research, Vol. 119). PMLR, Virtual Event, 9929-9939. http:\/\/proceedings.mlr.press\/v119\/wang20k.html"},{"doi-asserted-by":"publisher","key":"e_1_3_2_2_36_1","DOI":"10.1109\/CVPR42600.2020.01271"},{"key":"e_1_3_2_2_37_1","volume-title":"Visual Entailment: A Novel Task for Fine-Grained Image Understanding. CoRR","author":"Xie Ning","year":"2019","unstructured":"Ning Xie, Farley Lai, Derek Doran, and Asim Kadav. 2019. Visual Entailment: A Novel Task for Fine-Grained Image Understanding. CoRR, Vol. abs\/1901.06706 (2019). arXiv:1901.06706 http:\/\/arxiv.org\/abs\/1901.06706"},{"key":"e_1_3_2_2_38_1","volume-title":"Proceedings of the 11th USENIX Conference on Hot Topics in Cloud Computing","author":"Xing Jiarong","year":"2019","unstructured":"Jiarong Xing, Adam Morrison, and Ang Chen. 2019. NetWarden: mitigating network covert channels without performance loss. In Proceedings of the 11th USENIX Conference on Hot Topics in Cloud Computing (Renton, WA, USA) (HotCloud'19). USENIX Association, USA, 2."},{"doi-asserted-by":"publisher","key":"e_1_3_2_2_39_1","DOI":"10.1162\/TACL_A_00166"},{"key":"e_1_3_2_2_40_1","volume-title":"Proceedings of the 34th International Conference on Machine Learning, ICML 2017 (Proceedings of Machine Learning Research","volume":"3995","author":"Zenke Friedemann","year":"2017","unstructured":"Friedemann Zenke, Ben Poole, and Surya Ganguli. 2017. Continual Learning Through Synaptic Intelligence. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017 (Proceedings of Machine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, Sydney, NSW, Australia, 3987-3995. http:\/\/proceedings.mlr.press\/v70\/zenke17a.html"}],"event":{"sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"acronym":"MM '25","name":"MM '25: The 33rd ACM International Conference on Multimedia","location":"Dublin Ireland"},"container-title":["Proceedings of the 33rd ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3746027.3755028","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,9]],"date-time":"2025-12-09T19:16:14Z","timestamp":1765307774000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3746027.3755028"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,27]]},"references-count":40,"alternative-id":["10.1145\/3746027.3755028","10.1145\/3746027"],"URL":"https:\/\/doi.org\/10.1145\/3746027.3755028","relation":{},"subject":[],"published":{"date-parts":[[2025,10,27]]},"assertion":[{"value":"2025-10-27","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}