{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T05:02:45Z","timestamp":1750309365457,"version":"3.41.0"},"reference-count":60,"publisher":"Association for Computing Machinery (ACM)","issue":"11","license":[{"start":{"date-parts":[[2024,11,13]],"date-time":"2024-11-13T00:00:00Z","timestamp":1731456000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"ZTE-PKU","award":["IA20230629009"],"award-info":[{"award-number":["IA20230629009"]}]},{"DOI":"10.13039\/501100001809","name":"Natural Science Foundation of China","doi-asserted-by":"crossref","award":["U20B2052"],"award-info":[{"award-number":["U20B2052"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100002858","name":"China Postdoctoral Science Foundation","doi-asserted-by":"crossref","award":["2023M730056"],"award-info":[{"award-number":["2023M730056"]}],"id":[{"id":"10.13039\/501100002858","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100001809","name":"Natural Science Foundation of China","doi-asserted-by":"crossref","award":["61936011"],"award-info":[{"award-number":["61936011"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2024,11,30]]},"abstract":"<jats:p>\n            Recent advances in pre-trained vision-language models have successfully boosted the performance of unsupervised image representation in many vision tasks. Most of existing works focus on learning global visual features with Transformers and neglect detailed local cues, leading to suboptimal performance in fine-grained vision tasks. In this article, we propose a text-guided patch token exploitation framework to enhance the discriminative power of unsupervised representation by exploiting more detailed local features. Our text-guided decoder extracts local features with the guidance of texts or learned prompts describing discriminative object parts. We hence introduce a local-global relation distillation loss to promote the joint optimization of local and global features. The proposed method allows to flexibly extract either global or global-local features as the image representation. It significantly outperforms previous methods in fine-grained image retrieval and base-to-new fine-grained classification tasks. For instance, our Recall@1 metric surpasses the recent unsupervised retrieval method STML by 6.0% on the SOP dataset. The code is publicly available at\n            <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xlink:href=\"https:\/\/github.com\/maosnhehe\/TPTE\">https:\/\/github.com\/maosnhehe\/TPTE<\/jats:ext-link>\n            .\n          <\/jats:p>","DOI":"10.1145\/3673657","type":"journal-article","created":{"date-parts":[[2024,8,9]],"date-time":"2024-08-09T16:47:30Z","timestamp":1723222050000},"page":"1-18","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["TPTE: Text-Guided Patch Token Exploitation for Unsupervised Fine-Grained Representation Learning"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0009-0005-1736-8316","authenticated-orcid":false,"given":"Shunan","family":"Mao","sequence":"first","affiliation":[{"name":"School of Computer Science, Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6853-3298","authenticated-orcid":false,"given":"Hao","family":"Chen","sequence":"additional","affiliation":[{"name":"School of Computer Science, Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6110-4036","authenticated-orcid":false,"given":"Yaowei","family":"Wang","sequence":"additional","affiliation":[{"name":"Peng Cheng Laboratory, Shenzhen, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-2870-1178","authenticated-orcid":false,"given":"Wei","family":"Zeng","sequence":"additional","affiliation":[{"name":"School of Computer Science, Peking University, Beijing, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9053-9314","authenticated-orcid":false,"given":"Shiliang","family":"Zhang","sequence":"additional","affiliation":[{"name":"School of Computer Science, Peking University, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2024,11,13]]},"reference":[{"key":"e_1_3_1_2_2","unstructured":"baidu. [n. d.]. Retrieved from https:\/\/cloud.baidu.com\/product\/wenxinworkshop"},{"key":"e_1_3_1_3_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10599-4_29"},{"key":"e_1_3_1_4_2","first-page":"1877","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","volume":"33","author":"Brown Tom","year":"2020","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 33, 1877\u20131901."},{"key":"e_1_3_1_5_2","article-title":"Semantic and correlation disentangled graph convolutions for multilabel image recognition","author":"Cai Shaofei","year":"2023","unstructured":"Shaofei Cai, Liang Li, Xinzhe Han, Shan Huang, Qi Tian, and Qingming Huang. 2023. Semantic and correlation disentangled graph convolutions for multilabel image recognition. IEEE Transactions on Neural Networks and Learning Systems (2023).","journal-title":"IEEE Transactions on Neural Networks and Learning Systems"},{"key":"e_1_3_1_6_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"e_1_3_1_7_2","first-page":"14960","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Chen Hao","year":"2021","unstructured":"Hao Chen, Benoit Lagadec, and Fran\u00e7ois Bremond. 2021. ICE: Inter-instance contrastive encoding for unsupervised person re-identification. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, 14960\u201314969."},{"key":"e_1_3_1_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2022.3229526"},{"key":"e_1_3_1_9_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v36i1.19909"},{"key":"e_1_3_1_10_2","first-page":"1597","volume-title":"Proceedings of the International Conference on Machine Learning.","author":"Chen Ting","year":"2020","unstructured":"Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning. PMLR, 1597\u20131607."},{"key":"e_1_3_1_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2023.3272741"},{"key":"e_1_3_1_12_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Dosovitskiy Alexey","year":"2020","unstructured":"Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_1_13_2","doi-asserted-by":"publisher","DOI":"10.1145\/3633781"},{"key":"e_1_3_1_14_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-023-01891-x"},{"issue":"2","key":"e_1_3_1_15_2","first-page":"1","article-title":"Hierarchical multi-attention transfer for knowledge distillation","volume":"20","author":"Gou Jianping","year":"2023","unstructured":"Jianping Gou, Liyuan Sun, Baosheng Yu, Shaohua Wan, and Dacheng Tao. 2023. Hierarchical multi-attention transfer for knowledge distillation. ACM Transactions on Multimedia Computing, Communications and Applications (TOMM) 20, 2 (2023), 1\u201320.","journal-title":"ACM Transactions on Multimedia Computing, Communications and Applications (TOMM)"},{"key":"e_1_3_1_16_2","first-page":"16000","volume-title":"Proceedings of the IEEE\/CVF Computer Vision and Pattern Recognition Conference","author":"He Kaiming","year":"2022","unstructured":"Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll\u00e1r, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE\/CVF Computer Vision and Pattern Recognition Conference, 16000\u201316009."},{"key":"e_1_3_1_17_2","first-page":"9729","volume-title":"Proceedings of the IEEE\/CVF Computer Vision and Pattern Recognition Conference","author":"He Kaiming","year":"2020","unstructured":"Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE\/CVF Computer Vision and Pattern Recognition Conference, 9729\u20139738."},{"key":"e_1_3_1_18_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_1_19_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-019-01176-2"},{"key":"e_1_3_1_20_2","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475561"},{"key":"e_1_3_1_21_2","doi-asserted-by":"publisher","DOI":"10.1145\/3446208"},{"key":"e_1_3_1_22_2","first-page":"4483","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Huynh Dat","year":"2020","unstructured":"Dat Huynh and Ehsan Elhamifar. 2020. Fine-grained generalized zero-shot learning via dense attribute-based attention. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 4483\u20134493."},{"key":"e_1_3_1_23_2","first-page":"4904","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Jia Chao","year":"2021","unstructured":"Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning. PMLR, 4904\u20134916."},{"key":"e_1_3_1_24_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-19827-4_41"},{"key":"e_1_3_1_25_2","first-page":"15670","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Kan Baoshuo","year":"2023","unstructured":"Baoshuo Kan, Teng Wang, Wenpeng Lu, Xiantong Zhen, Weili Guan, and Feng Zheng. 2023. Knowledge-aware prompt tuning for generalizable vision-language models. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, 15670\u201315680."},{"key":"e_1_3_1_26_2","first-page":"13999","volume-title":"Proceedings of the IEEE\/CVF Computer Vision and Pattern Recognition Conference","author":"Kan Shichao","year":"2021","unstructured":"Shichao Kan, Yigang Cen, Yang Li, Vladimir Mladenovic, and Zhihai He. 2021. Relative order analysis and optimization for unsupervised deep metric learning. In Proceedings of the IEEE\/CVF Computer Vision and Pattern Recognition Conference, 13999\u201314008."},{"issue":"06","key":"e_1_3_1_27_2","doi-asserted-by":"crossref","first-page":"7220","DOI":"10.1109\/TPAMI.2022.3221486","article-title":"Contrastive Bayesian analysis for deep metric learning","volume":"45","author":"Kan Shichao","year":"2023","unstructured":"Shichao Kan, Zhiquan He, Yigang Cen, Yang Li, Vladimir Mladenovic, and Zhihai He. 2023. Contrastive Bayesian analysis for deep metric learning. IEEE Transactions on Pattern Analysis & Machine Intelligence 45, 06 (2023), 7220\u20137238.","journal-title":"IEEE Transactions on Pattern Analysis & Machine Intelligence"},{"key":"e_1_3_1_28_2","first-page":"19113","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Khattak Muhammad Uzair","year":"2023","unstructured":"Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. 2023. Maple: Multi-modal prompt learning. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 19113\u201319122."},{"key":"e_1_3_1_29_2","first-page":"7431","volume-title":"Proceedings of the IEEE\/CVF Computer Vision and Pattern Recognition Conference","author":"Kim Sungyeon","year":"2022","unstructured":"Sungyeon Kim, Dongwon Kim, Minsu Cho, and Suha Kwak. 2022. Self-taught metric learning without labels. In Proceedings of the IEEE\/CVF Computer Vision and Pattern Recognition Conference, 7431\u20137441."},{"key":"e_1_3_1_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCVW.2013.77"},{"key":"e_1_3_1_31_2","first-page":"243","volume-title":"Proceedings of the IEEE International Conference on Visual Communications and Image Processing (VCIP \u201920)","author":"Li Hao","year":"2020","unstructured":"Hao Li, Xiaopeng Zhang, Qi Tian, and Hongkai Xiong. 2020. Attribute mix: Semantic data augmentation for fine grained recognition. In Proceedings of the IEEE International Conference on Visual Communications and Image Processing (VCIP \u201920). IEEE, 243\u2013246."},{"key":"e_1_3_1_32_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Li Junnan","year":"2022","unstructured":"Junnan Li, Silvio Savarese, and Steven Hoi. 2022. Masked Unsupervised Self-training for Label-free Image Classification. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_1_33_2","doi-asserted-by":"publisher","DOI":"10.1145\/3468673"},{"issue":"3","key":"e_1_3_1_34_2","first-page":"3003","article-title":"Entity-enhanced adaptive reconstruction network for weakly supervised referring expression grounding","volume":"45","author":"Liu Xuejing","year":"2022","unstructured":"Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Zechao Li, Qi Tian, and Qingming Huang. 2022. Entity-enhanced adaptive reconstruction network for weakly supervised referring expression grounding. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 3 (2022), 3003\u20133018.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_1_35_2","first-page":"4190","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Liu Xiao","year":"2017","unstructured":"Xiao Liu, Jiang Wang, Shilei Wen, Errui Ding, and Yuanqing Lin. 2017. Localizing by describing: Attribute-guided attention localization for fine-grained recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, 4190\u20134196."},{"key":"e_1_3_1_36_2","first-page":"11313","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Ma Kang","year":"2023","unstructured":"Kang Ma, Ying Fu, Dezhi Zheng, Yunjie Peng, Chunshui Cao, and Yongzhen Huang. 2023. Fine-grained unsupervised domain adaptation for gait recognition. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, 11313\u201311322."},{"key":"e_1_3_1_37_2","doi-asserted-by":"publisher","DOI":"10.1145\/3451390"},{"issue":"4","key":"e_1_3_1_38_2","first-page":"1","article-title":"Complex scenario image retrieval via deep similarity-aware hashing","volume":"20","author":"Nie Xiushan","year":"2023","unstructured":"Xiushan Nie, Yang Shi, Ziyu Meng, Jin Huang, Weili Guan, and Yilong Yin. 2023. Complex scenario image retrieval via deep similarity-aware hashing. ACM Transactions on Multimedia Computing, Communications and Applications (TOMM) 20, 4 (2023), 1\u201324.","journal-title":"ACM Transactions on Multimedia Computing, Communications and Applications (TOMM)"},{"key":"e_1_3_1_39_2","first-page":"4004","volume-title":"Proceedings of the IEEE\/CVF Computer Vision and Pattern Recognition Conference","author":"Song Hyun Oh","year":"2016","unstructured":"Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. 2016. Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE\/CVF Computer Vision and Pattern Recognition Conference, 4004\u20134012."},{"key":"e_1_3_1_40_2","first-page":"3498","volume-title":"Proceedings of the IEEE\/CVF Computer Vision and Pattern Recognition Conference","author":"Parkhi Omkar M.","year":"2012","unstructured":"Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V. Jawahar. 2012. Cats and dogs. In Proceedings of the IEEE\/CVF Computer Vision and Pattern Recognition Conference. IEEE, 3498\u20133505."},{"key":"e_1_3_1_41_2","first-page":"8026","volume-title":"Proceedings of the Advances in Neural Information Processing Systems","volume":"32","author":"Paszke Adam","year":"2019","unstructured":"Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas K\u00f6pf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 32, 8026\u20138037."},{"key":"e_1_3_1_42_2","first-page":"777","volume-title":"Proceedings of the IEEE\/CVF Computer Vision and Pattern Recognition Conference","author":"Qin Danfeng","year":"2011","unstructured":"Danfeng Qin, Stephan Gammeter, Lukas Bossard, Till Quack, and Luc Van Gool. 2011. Hello neighbor: Accurate object retrieval with k-reciprocal nearest neighbors. In Proceedings of the IEEE\/CVF Computer Vision and Pattern Recognition Conference. IEEE, 777\u2013784."},{"key":"e_1_3_1_43_2","first-page":"8748","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. PMLR, 8748\u20138763."},{"issue":"8","key":"e_1_3_1_44_2","first-page":"9","article-title":"Language models are unsupervised multitask learners","volume":"1","author":"Radford Alec","year":"2019","unstructured":"Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.","journal-title":"OpenAI blog"},{"key":"e_1_3_1_45_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46475-6_30"},{"key":"e_1_3_1_46_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503161.3548308"},{"key":"e_1_3_1_47_2","doi-asserted-by":"publisher","DOI":"10.1145\/3492221"},{"key":"e_1_3_1_48_2","doi-asserted-by":"crossref","first-page":"3213","DOI":"10.1109\/TPAMI.2023.3339628","article-title":"Context disentangling and prototype inheriting for robust visual grounding","volume":"46","author":"Tang Wei","year":"2023","unstructured":"Wei Tang, Liang Li, Xuejing Liu, Lu Jin, Jinhui Tang, and Zechao Li. 2023. Context disentangling and prototype inheriting for robust visual grounding. IEEE Transactions on Pattern Analysis and Machine Intelligence 46 (2023), 3213\u20133229.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_1_49_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2024.3365104"},{"key":"e_1_3_1_50_2","unstructured":"Catherine Wah Steve Branson Peter Welinder Pietro Perona and Serge Belongie. 2011. The caltech-ucsd birds-200-2011 dataset. California Institute of Technology."},{"key":"e_1_3_1_51_2","first-page":"2513","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","volume":"36","author":"Wang Shijie","year":"2022","unstructured":"Shijie Wang, Zhihui Wang, Haojie Li, and Wanli Ouyang. 2022. Category-specific nuance exploration network for fine-grained object retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36, 2513\u20132521."},{"key":"e_1_3_1_52_2","first-page":"3733","volume-title":"Proceedings of the IEEE\/CVF Computer Vision and Pattern Recognition Conference","author":"Wu Zhirong","year":"2018","unstructured":"Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. 2018. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE\/CVF Computer Vision and Pattern Recognition Conference, 3733\u20133742."},{"key":"e_1_3_1_53_2","doi-asserted-by":"crossref","first-page":"1188","DOI":"10.1109\/TMM.2023.3277758","article-title":"Model-guided generative adversarial networks for unsupervised fine-grained image generation","volume":"26","author":"Xiao Jian","year":"2023","unstructured":"Jian Xiao and Xiaojun Bi. 2023. Model-guided generative adversarial networks for unsupervised fine-grained image generation. IEEE Transactions on Multimedia 26 (2023), 1188\u20131199.","journal-title":"IEEE Transactions on Multimedia"},{"key":"e_1_3_1_54_2","first-page":"21969","article-title":"Attribute prototype network for zero-shot learning","volume":"33","author":"Xu Wenjia","year":"2020","unstructured":"Wenjia Xu, Yongqin Xian, Jiuniu Wang, Bernt Schiele, and Zeynep Akata. 2020. Attribute prototype network for zero-shot learning. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 33, 21969\u201321980.","journal-title":"Proceedings of the Advances in Neural Information Processing Systems"},{"key":"e_1_3_1_55_2","first-page":"13838","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Xuan Shiyu","year":"2024","unstructured":"Shiyu Xuan, Qingpei Guo, Ming Yang, and Shiliang Zhang. 2024. Pink: Unveiling the power of referential comprehension for multi-modal llms. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 13838\u201313848."},{"key":"e_1_3_1_56_2","first-page":"5457","volume-title":"Proceedings of the IEEE\/CVF Computer Vision and Pattern Recognition Conference","author":"Ye Mang","year":"2020","unstructured":"Mang Ye and Jianbing Shen. 2020. Probabilistic structural latent representation for unsupervised embedding. In Proceedings of the IEEE\/CVF Computer Vision and Pattern Recognition Conference, 5457\u20135466."},{"key":"e_1_3_1_57_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2021.3114089"},{"key":"e_1_3_1_58_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Zhou Jinghao","year":"2022","unstructured":"Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. 2022. Image BERT pre-training with online tokenizer. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_1_59_2","first-page":"16816","volume-title":"Proceedings of the IEEE\/CVF Computer Vision and Pattern Recognition Conference","author":"Zhou Kaiyang","year":"2022","unstructured":"Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Conditional prompt learning for vision-language models. In Proceedings of the IEEE\/CVF Computer Vision and Pattern Recognition Conference, 16816\u201316825."},{"key":"e_1_3_1_60_2","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-022-01653-1"},{"key":"e_1_3_1_61_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.displa.2023.102468"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3673657","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3673657","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T00:06:07Z","timestamp":1750291567000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3673657"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,11,13]]},"references-count":60,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2024,11,30]]}},"alternative-id":["10.1145\/3673657"],"URL":"https:\/\/doi.org\/10.1145\/3673657","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"type":"print","value":"1551-6857"},{"type":"electronic","value":"1551-6865"}],"subject":[],"published":{"date-parts":[[2024,11,13]]},"assertion":[{"value":"2024-01-08","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-06-04","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-11-13","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}