{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,18]],"date-time":"2026-03-18T13:27:48Z","timestamp":1773840468429,"version":"3.50.1"},"reference-count":64,"publisher":"Association for Computing Machinery (ACM)","issue":"11","license":[{"start":{"date-parts":[[2024,11,13]],"date-time":"2024-11-13T00:00:00Z","timestamp":1731456000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2024,11,30]]},"abstract":"<jats:p>\n            Compositional Zero-shot Learning (CZSL) attempts to recognise images of new compositions of states and objects when images of only a subset of state-object compositions are available as training data. An example of CZSL is to recognise images of\n            <jats:italic>peeled apple<\/jats:italic>\n            by a model when it is trained using images of\n            <jats:italic>peeled orange<\/jats:italic>\n            ,\n            <jats:italic>ripe apple<\/jats:italic>\n            and\n            <jats:italic>ripe orange<\/jats:italic>\n            . There are two major challenges in solving CZSL. First, the visual features of a state vary depending on the context of a state-object composition. For example state like\n            <jats:italic>ripe<\/jats:italic>\n            produces distinct visual properties in the compositions\n            <jats:italic>ripe orange<\/jats:italic>\n            and\n            <jats:italic>ripe banana<\/jats:italic>\n            . Hence, understanding the context dependency of state features is a necessary requirement to solve CZSL. Second, the extent of association between the features of a state and an object varies significantly in different images of same composition. For example, in different images of\n            <jats:italic>peeled oranges<\/jats:italic>\n            , the\n            <jats:italic>oranges<\/jats:italic>\n            may be\n            <jats:italic>peeled<\/jats:italic>\n            to different extents. As a consequence, the visual features of images of the class\n            <jats:italic>peeled orange<\/jats:italic>\n            may vary. Hence, there exists a significant amount of intra-class variability among the visual features of different images of a composition. Existing approaches merely look for the existence or absence of features of particular state or object in a composition. Our approach not only looks for the existence of a particular state features or object features but also the extent of association of state features and object features to better tackle the intra-class variability in visual features of compositional images. The proposed architecture is constructed using a novel\n            <jats:italic>Knowledge Guided Transformer<\/jats:italic>\n            . The transformer-based framework is utilised for processing larger context dependency between the state and object. Extensive experiments on C-GQA, MIT-States and UT-Zappos50k datasets demonstrate the superiority of the proposed approach in comparison with the state-of-the-art in both open-world and closed-world CZSL settings.\n          <\/jats:p>","DOI":"10.1145\/3687129","type":"journal-article","created":{"date-parts":[[2024,8,9]],"date-time":"2024-08-09T16:47:30Z","timestamp":1723222050000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":2,"title":["Knowledge Guided Transformer Network for Compositional Zero-Shot Learning"],"prefix":"10.1145","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-5476-1839","authenticated-orcid":false,"given":"Aditya","family":"Panda","sequence":"first","affiliation":[{"name":"Indian Statistical Institute, Kolkata, India"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5553-2790","authenticated-orcid":false,"given":"Dipti Prasad","family":"Mukherjee","sequence":"additional","affiliation":[{"name":"Indian Statistical Institute, Kolkata, India"}]}],"member":"320","published-online":{"date-parts":[[2024,11,13]]},"reference":[{"key":"e_1_3_2_2_2","first-page":"29302","volume-title":"Proceedings of the National Academy of Sciences","volume":"117","author":"Allen Kelsey R.","year":"2020","unstructured":"Kelsey R. Allen, Kevin A. Smith, and Joshua B. Tenenbaum. 2020. Rapid trial-and-error learning with simulation supports flexible tool use and physical reasoning. Proceedings of the National Academy of Sciences 117, 47 (2020), 29302\u201329310."},{"key":"e_1_3_2_3_2","first-page":"1462","volume-title":"Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS \u201920)","volume":"33","author":"Atzmon Yuval","year":"2020","unstructured":"Yuval Atzmon, Felix Kreuk, Uri Shalit, and Gal Chechik. 2020. A causal view of compositional zero-shot recognition. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS \u201920), Vol. 33, 1462\u20131473."},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00051"},{"key":"e_1_3_2_5_2","first-page":"52","volume-title":"Proceedings of the European Conference on Computer Vision","author":"Chao Wei-Lun","year":"2016","unstructured":"Wei-Lun Chao, Soravit Changpinyo, Boqing Gong, and Fei Sha. 2016. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In Proceedings of the European Conference on Computer Vision. Springer, 52\u201368."},{"key":"e_1_3_2_6_2","first-page":"7069","volume-title":"Proceedings of the Association for the Advancement of Artificial Iintelligence Conference","volume":"37","author":"Chen Xi","year":"2023","unstructured":"Xi Chen, Cheng Ge, Ming Wang, and Jin Wang. 2023. Supervised contrastive few-shot learning for high-frequency time series. In Proceedings of the Association for the Advancement of Artificial Iintelligence Conference, Vol. 37, 7069\u20137077."},{"key":"e_1_3_2_7_2","first-page":"7612","volume-title":"Proceedings of the IEEE Computer Vision and Pattern Recognition","author":"Shiming Chen","year":"2022","unstructured":"Chen Shiming, Hong Ziming, Xie Guo-Sen, Yang Wenhan, Peng Qinmu, Wang Kai, Zhao Jian, and You Xinge. 2022. MSDN: Mutually semantic distillation network for zero-shot learning. In Proceedings of the IEEE Computer Vision and Pattern Recognition, 7612\u20137621."},{"issue":"4","key":"e_1_3_2_8_2","doi-asserted-by":"crossref","first-page":"4516","DOI":"10.1109\/TNNLS.2022.3155602","article-title":"GNDAN: Graph navigated dual attention network for zero-shot learning","volume":"35","author":"Shiming Chen","year":"2022","unstructured":"Chen Shiming, Hong Ziming, Xie Guosen, Peng Qinmu, You Xinge, Ding Weiping, and Shao Ling. 2022. GNDAN: Graph navigated dual attention network for zero-shot learning. IEEE Transactions on Neural Networks and Learning Systems 35, 4 (2022), 4516\u20134529.","journal-title":"IEEE Transactions on Neural Networks and Learning Systems"},{"key":"e_1_3_2_9_2","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly Jakob Uszkoreit and Neil Houlsby. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv: 2010.11929. Retrieved from https:\/\/arxiv.org\/abs\/2010.11929"},{"key":"e_1_3_2_10_2","doi-asserted-by":"publisher","DOI":"10.1016\/0893-6080(89)90003-8"},{"key":"e_1_3_2_11_2","first-page":"12259","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Graham Benjamin","year":"2021","unstructured":"Benjamin Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Herv\u00e9 J\u00e9gou, and Matthijs Douze. 2021. LeViT: A vision transformer in ConvNet\u2019s clothing for faster inference. In Proceedings of the IEEE International Conference on Computer Vision, 12259\u201312269."},{"key":"e_1_3_2_12_2","first-page":"15908","volume-title":"Proceedings of the Conference on Neural Information Processing Systems (NeurIPS \u201920)","volume":"34","author":"Han Kai","year":"2021","unstructured":"Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, and Yunhe Wang. 2021. Transformer in transformer. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS \u201920), Vol. 34, 15908\u201315919."},{"key":"e_1_3_2_13_2","first-page":"770","volume-title":"Proceedings of the IEEE Computer Vision and Pattern Recognition","author":"He Kaiming","year":"2016","unstructured":"Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Computer Vision and Pattern Recognition, 770\u2013778."},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.1016\/0893-6080(89)90020-8"},{"key":"e_1_3_2_15_2","first-page":"1383","volume-title":"Proceedings of the IEEE Computer Vision and Pattern Recognition","author":"Isola Phillip","year":"2015","unstructured":"Phillip Isola, Joseph J. Lim, and Edward H. Adelson. 2015. Discovering states and transformations in image collections. In Proceedings of the IEEE Computer Vision and Pattern Recognition, 1383\u20131391."},{"key":"e_1_3_2_16_2","first-page":"9336","volume-title":"Proceedings of the IEEE Computer Vision and Pattern Recognition","author":"Karthik Shyamgopal","year":"2022","unstructured":"Shyamgopal Karthik, Massimiliano Mancini, and Zeynep Akata. 2022. KG-SP: Knowledge guided simple primitives for open world compositional zero-shot learning. In Proceedings of the IEEE Computer Vision and Pattern Recognition, 9336\u20139345."},{"key":"e_1_3_2_17_2","first-page":"8282","volume-title":"Proceedings of the Association for the Advancement of Artificial Intelligence","volume":"37","author":"Kim Seong-Woong","year":"2023","unstructured":"Seong-Woong Kim and Dong-Wan Choi. 2023. Better generalized few-shot learning even without base data. In Proceedings of the Association for the Advancement of Artificial Intelligence, Vol. 37. 8282\u20138290."},{"key":"e_1_3_2_18_2","volume-title":"Proceedings of the International Conference on Learning Representations (Poster)","author":"Kingma Diederik P.","year":"2015","unstructured":"Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (Poster)."},{"key":"e_1_3_2_19_2","unstructured":"Thomas N. Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv: 1609.02907. Retrieved from https:\/\/arxiv.org\/abs\/1609.02907"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1017\/S0140525X16001837"},{"issue":"6","key":"e_1_3_2_21_2","first-page":"1","article-title":"Transformer-based visual grounding with cross-modality interaction","volume":"19","author":"Li Kun","year":"2023","unstructured":"Kun Li, Jiaxiu Li, Dan Guo, Xun Yang, and Meng Wang. 2023. Transformer-based visual grounding with cross-modality interaction. ACM Transactions on Multimedia Computing, Communications and Applications 19, 6 (2023), 1\u201319.","journal-title":"ACM Transactions on Multimedia Computing, Communications and Applications"},{"key":"e_1_3_2_22_2","unstructured":"Lin Li Guikun Chen Jun Xiao and Long Chen. 2023. Compositional zero-shot learning via progressive language-based observations. arXiv: 2311.14749. Retrieved from https:\/\/arxiv.org\/abs\/2311.14749"},{"key":"e_1_3_2_23_2","first-page":"9326","volume-title":"Proceedings of the IEEE Computer Vision and Pattern Recognition","author":"Li Xiangyu","year":"2022","unstructured":"Xiangyu Li, Xu Yang, Kun Wei, Cheng Deng, and Muli Yang. 2022. Siamese contrastive embedding network for compositional zero-shot learning. In Proceedings of the IEEE Computer Vision and Pattern Recognition, 9326\u20139335."},{"key":"e_1_3_2_24_2","first-page":"1782","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Li Yun","year":"2023","unstructured":"Yun Li, Zhe Liu, Saurav Jha, and Lina Yao. 2023. Distilled reverse attention network for open-world compositional zero-shot learning. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, 1782\u20131791."},{"key":"e_1_3_2_25_2","first-page":"11316","volume-title":"Proceedings of the IEEE Computer Vision and Pattern Recognition","author":"Li Yong-Lu","year":"2020","unstructured":"Yong-Lu Li, Yue Xu, Xiaohan Mao, and Cewu Lu. 2020. Symmetry and group in attribute-object compositions. In Proceedings of the IEEE Computer Vision and Pattern Recognition, 11316\u201311325."},{"issue":"1","key":"e_1_3_2_26_2","first-page":"543","article-title":"Simple primitives with feasibility-and contextuality-dependence for open-world compositional zero-shot learning","volume":"46","author":"Liu Zhe","year":"2023","unstructured":"Zhe Liu, Yun Li, Lina Yao, Xiaojun Chang, Wei Fang, Xiaojun Wu, and Abdulmotaleb El Saddik. 2023. Simple primitives with feasibility-and contextuality-dependence for open-world compositional zero-shot learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 46, 1 (2023), 543\u2013560.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_2_27_2","first-page":"23560","volume-title":"Proceedings of the IEEE\/CVF Computer Vision and Pattern Recognition","author":"Lu Xiaocheng","year":"2023","unstructured":"Xiaocheng Lu, Song Guo, Ziming Liu, and Jingcai Guo. 2023. Decomposed soft prompt guided fusion enhancing for compositional zero-shot learning. In Proceedings of the IEEE\/CVF Computer Vision and Pattern Recognition, 23560\u201323569."},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1145\/3503927"},{"key":"e_1_3_2_29_2","first-page":"5222","volume-title":"Proceedings of the IEEE Computer Vision and Pattern Recognition","author":"Mancini Massimiliano","year":"2021","unstructured":"Massimiliano Mancini, Muhammad Ferjad Naeem, Yongqin Xian, and Zeynep Akata. 2021. Open world compositional zero-shot learning. In Proceedings of the IEEE Computer Vision and Pattern Recognition, 5222\u20135230."},{"key":"e_1_3_2_30_2","first-page":"12042","volume-title":"Proceedings of the IEEE Computer Vision and Pattern Recognition","author":"Mao Xiaofeng","year":"2022","unstructured":"Xiaofeng Mao, Gege Qi, Yuefeng Chen, Xiaodan Li, Ranjie Duan, Shaokai Ye, Yuan He, and Hui Xue. 2022. Towards robust vision transformer. In Proceedings of the IEEE Computer Vision and Pattern Recognition, 12042\u201312051."},{"key":"e_1_3_2_31_2","unstructured":"Tomas Mikolov Kai Chen Greg Corrado and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space arXiv: 1301.3781. Retrieved from https:\/\/arxiv.org\/abs\/1301.3781"},{"key":"e_1_3_2_32_2","first-page":"1792","volume-title":"Proceedings of the IEEE Computer Vision and Pattern Recognition","author":"Misra Ishan","year":"2017","unstructured":"Ishan Misra, Abhinav Gupta, and Martial Hebert. 2017. From red wine to red tomato: Composition with context. In Proceedings of the IEEE Computer Vision and Pattern Recognition, 1792\u20131801."},{"key":"e_1_3_2_33_2","first-page":"953","volume-title":"Proceedings of the IEEE Computer Vision and Pattern Recognition","author":"Naeem Muhammad Ferjad","year":"2021","unstructured":"Muhammad Ferjad Naeem, Yongqin Xian, Federico Tombari, and Zeynep Akata. 2021. Learning graph embeddings for compositional zero-shot learning. In Proceedings of the IEEE Computer Vision and Pattern Recognition, 953\u2013962."},{"key":"e_1_3_2_34_2","first-page":"169","volume-title":"Proceedingsof the European Conference on Computer Vision","author":"Nagarajan Tushar","year":"2018","unstructured":"Tushar Nagarajan and Kristen Grauman. 2018. Attributes as operators: Factorizing unseen attribute-object compositions. In Proceedingsof the European Conference on Computer Vision, 169\u2013185."},{"key":"e_1_3_2_35_2","first-page":"8811","volume-title":"Proceedings of the Association for the Advancement of Artificial Intelligence","volume":"33","author":"Nan Zhixiong","year":"2019","unstructured":"Zhixiong Nan, Yang Liu, Nanning Zheng, and Song-Chun Zhu. 2019. Recognizing unseen attribute-object pair with generative model. In Proceedings of the Association for the Advancement of Artificial Intelligence, Vol. 33, 8811\u20138818."},{"key":"e_1_3_2_36_2","volume-title":"Proceedings of the 11th International Conference on Learning Representations","author":"Nayak Nihal V.","year":"2022","unstructured":"Nihal V. Nayak, Peilin Yu, and Stephen Bach. 2022. Learning to compose soft prompts for compositional zero-shot learning. In Proceedings of the 11th International Conference on Learning Representations."},{"key":"e_1_3_2_37_2","unstructured":"Thao Nguyen Maithra Raghu and Simon Kornblith. 2020. Do wide and deep networks learn the same things? Uncovering how neural network representations vary with width and depth. arXiv: 2010.15327. Retrieved from https:\/\/arxiv.org\/abs\/2010.15327"},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2023.10991"},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP46576.2022.9897457"},{"issue":"5","key":"e_1_3_2_40_2","doi-asserted-by":"crossref","first-page":"1571","DOI":"10.1109\/TETCI.2022.3232816","article-title":"Isolating features of object and its state for compositional zero-shot learning","volume":"7","author":"Panda Aditya","year":"2023","unstructured":"Aditya Panda, Bikash Santra, and Dipti Prasad Mukherjee. 2023. Isolating features of object and its state for compositional zero-shot learning. IEEE Transactions on Emerging Topics in Computational Intelligence 7, 5 (2023), 1571\u20131583.","journal-title":"IEEE Transactions on Emerging Topics in Computational Intelligence"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1162"},{"issue":"4","key":"e_1_3_2_42_2","first-page":"4051","article-title":"A review of generalized zero-shot learning methods","volume":"45","author":"Farhad Pourpanah","year":"2023","unstructured":"Pourpanah Farhad, Abdar Moloud, Luo Yuxuan, Zhou Xinlei, Wang Ran, Lim Chee Peng, Wang Xi-Zhao, and Wu Q. M. Jonathan. 2023. A review of generalized zero-shot learning methods. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 4 (2023), 4051\u20134070.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_2_43_2","first-page":"3593","volume-title":"Proceedings of the IEEE Inernational Conference on Computer Vision","author":"Purushwalkam Senthil","year":"2019","unstructured":"Senthil Purushwalkam, Maximilian Nickel, Abhinav Gupta, and Marc\u2019Aurelio Ranzato. 2019. Task-driven modular networks for zero-shot compositional learning. In Proceedings of the IEEE Inernational Conference on Computer Vision, 3593\u20133602."},{"key":"e_1_3_2_44_2","first-page":"8748","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. PMLR, 8748\u20138763."},{"key":"e_1_3_2_45_2","first-page":"12116","volume-title":"Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS \u201921)","volume":"34","author":"Raghu Maithra","year":"2021","unstructured":"Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. 2021. Do vision transformers see like convolutional neural networks? In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS \u201921), Vol. 34, 12116\u201312128."},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.1145\/3514247"},{"key":"e_1_3_2_47_2","first-page":"10641","volume-title":"Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS \u201921)","volume":"34","author":"Ruis Frank","year":"2021","unstructured":"Frank Ruis, Gertjan Burghouts, and Doina Bucur. 2021. Independent prototype propagation for zero-shot compositionality. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS \u201921), Vol. 34, 10641\u201310653."},{"key":"e_1_3_2_48_2","first-page":"13658","volume-title":"Proceedings of the IEEE Computer Vision and Pattern Recognition","author":"Saini Nirat","year":"2022","unstructured":"Nirat Saini, Khoi Pham, and Abhinav Shrivastava. 2022. Disentangling visual embeddings for attributes and objects. In Proceedings of the IEEE Computer Vision and Pattern Recognition, 13658\u201313667."},{"key":"e_1_3_2_49_2","first-page":"1","article-title":"Markov chain Monte Carlo with people","volume":"20","author":"Sanborn Adam","year":"2007","unstructured":"Adam Sanborn and Thomas Griffiths. 2007. Markov chain Monte Carlo with people. In Proceedings of the 21st Annual Conference on Neural Information Processing Systems (NeurIPS \u201921), Vol. 20, 1\u20138.","journal-title":"Proceedings of the 21st Annual Conference on Neural Information Processing Systems (NeurIPS \u201921)"},{"issue":"1","key":"e_1_3_2_50_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3603147","article-title":"DiRaC-I: Identifying diverse and rare training classes for zero-shot learning","volume":"20","author":"Sarma Sandipan","year":"2023","unstructured":"Sandipan Sarma and Arijit Sur. 2023. DiRaC-I: Identifying diverse and rare training classes for zero-shot learning. ACM Transactions on Multimedia Computing, Communications and Applications 20, 1 (2023), 1\u201323.","journal-title":"ACM Transactions on Multimedia Computing, Communications and Applications"},{"key":"e_1_3_2_51_2","doi-asserted-by":"crossref","unstructured":"Robyn Speer Joshua Chin and Catherine Havasi. 2017. ConceptNet 5.5: An open multilingual graph of general knowledge 4444\u20134451. Retrieved from http:\/\/aaai.org\/ocs\/index.php\/AAAI\/AAAI17\/paper\/view\/14972","DOI":"10.1609\/aaai.v31i1.11164"},{"issue":"6","key":"e_1_3_2_52_2","article-title":"Unifying dual-attention and siamese transformer network for full-reference image quality assessment","volume":"19","author":"Tang Zhenjun","year":"2023","unstructured":"Zhenjun Tang, Zhiyuan Chen, Zhixin Li, Bineng Zhong, Xianquan Zhang, and Xinpeng Zhang. 2023. Unifying dual-attention and siamese transformer network for full-reference image quality assessment. ACM Transactions on Multimedia Computing, Communications, and Applications, 19, 6 (November 2023).","journal-title":"ACM Transactions on Multimedia Computing, Communications, and Applications"},{"key":"e_1_3_2_53_2","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Tian Yonglong","year":"2019","unstructured":"Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2019. Contrastive representation distillation. In Proceedings of the International Conference on Learning Representations."},{"key":"e_1_3_2_54_2","first-page":"1","volume-title":"Proceedings of the 31st Conference on Neural Information Processing Systems (NeurIPS \u201917)","volume":"30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NeurIPS \u201917), Vol. 30, 1\u201311."},{"key":"e_1_3_2_55_2","first-page":"11197","volume-title":"Proceedings of the IEEE\/CVF Computer Vision and Pattern Recognition","author":"Wang Qingsheng","year":"2023","unstructured":"Qingsheng Wang, Lingqiao Liu, Chenchen Jing, Hao Chen, Guoqiang Liang, Peng Wang, and Chunhua Shen. 2023. Learning conditional attributes for compositional zero-shot learning. In Proceedings of the IEEE\/CVF Computer Vision and Pattern Recognition, 11197\u201311206."},{"key":"e_1_3_2_56_2","first-page":"3741","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Wei Kun","year":"2019","unstructured":"Kun Wei, Muli Yang, Hao Wang, Cheng Deng, and Xianglong Liu. 2019. Adversarial fine-grained composition learning for unseen attribute-object recognition. In Proceedings of the IEEE International Conference on Computer Vision, 3741\u20133749."},{"key":"e_1_3_2_57_2","first-page":"4582","volume-title":"Proceedings of the IEEE Computer Vision and Pattern Recognition","author":"Xian Yongqin","year":"2017","unstructured":"Yongqin Xian, Bernt Schiele, and Zeynep Akata. 2017. Zero-shot learning-the good, the bad and the ugly. In Proceedings of the IEEE Computer Vision and Pattern Recognition, 4582\u20134591."},{"key":"e_1_3_2_58_2","first-page":"10248","volume-title":"Proceedings of the IEEE Computer Vision and Pattern Recognition","author":"Yang Muli","year":"2020","unstructured":"Muli Yang, Cheng Deng, Junchi Yan, Xianglong Liu, and Dacheng Tao. 2020. Learning unseen concepts via hierarchical decomposition and composition. In Proceedings of the IEEE Computer Vision and Pattern Recognition, 10248\u201310256."},{"key":"e_1_3_2_59_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2022.3200578"},{"key":"e_1_3_2_60_2","first-page":"192","volume-title":"Proceedings of theIEEE Computer Vision and Pattern Recognition","author":"Yu Aron","year":"2014","unstructured":"Aron Yu and Kristen Grauman. 2014. Fine-grained visual comparisons with local learning. In Proceedings of theIEEE Computer Vision and Pattern Recognition, 192\u2013199."},{"key":"e_1_3_2_61_2","first-page":"5570","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Yu Aron","year":"2017","unstructured":"Aron Yu and Kristen Grauman. 2017. Semantic jitter: Dense supervision for visual comparisons via synthetic images. In Proceedings of the IEEE International Conference on Computer Vision, 5570\u20135579."},{"key":"e_1_3_2_62_2","first-page":"579","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Yuan Kun","year":"2021","unstructured":"Kun Yuan, Shaopeng Guo, Ziwei Liu, Aojun Zhou, Fengwei Yu, and Wei Wu. 2021. Incorporating convolution designs into visual transformers. In Proceedings of the IEEE International Conference on Computer Vision, 579\u2013588."},{"key":"e_1_3_2_63_2","first-page":"558","volume-title":"Proceedings of the IEEE International Conference on Computer Vision","author":"Yuan Li","year":"2021","unstructured":"Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zi-Hang Jiang, Francis E. H. Tay, Jiashi Feng, and Shuicheng Yan. 2021. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE International Conference on Computer Vision, 558\u2013567."},{"key":"e_1_3_2_64_2","first-page":"11304","volume-title":"Proceedings of the IEEE Computer Vision and Pattern Recognition","author":"Zhang Bowen","year":"2022","unstructured":"Bowen Zhang, Shuyang Gu, Bo Zhang, Jianmin Bao, Dong Chen, Fang Wen, Yong Wang, and Baining Guo. 2022. Styleswin: Transformer-based gan for high-resolution image generation. In Proceedings of the IEEE Computer Vision and Pattern Recognition, 11304\u201311314."},{"key":"e_1_3_2_65_2","first-page":"11461","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","volume":"37","author":"Zhu Lin","year":"2023","unstructured":"Lin Zhu, Xinbing Wang, Chenghu Zhou, and Nanyang Ye. 2023. Bayesian cross-modal alignment learning for few-shot out-of-distribution generalization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37, 11461\u201311469."}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3687129","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3687129","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:10:21Z","timestamp":1750295421000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3687129"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,11,13]]},"references-count":64,"journal-issue":{"issue":"11","published-print":{"date-parts":[[2024,11,30]]}},"alternative-id":["10.1145\/3687129"],"URL":"https:\/\/doi.org\/10.1145\/3687129","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,11,13]]},"assertion":[{"value":"2024-01-26","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-07-29","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-11-13","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}