{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T01:58:29Z","timestamp":1760147909810,"version":"build-2065373602"},"reference-count":47,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2023,3,14]],"date-time":"2023-03-14T00:00:00Z","timestamp":1678752000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science foundation of China","doi-asserted-by":"publisher","award":["62003065","CSTB2022NSCQ-MSX1417","KJZD-K202200513"],"award-info":[{"award-number":["62003065","CSTB2022NSCQ-MSX1417","KJZD-K202200513"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Chongqing Natural Science Foundation of China","award":["62003065","CSTB2022NSCQ-MSX1417","KJZD-K202200513"],"award-info":[{"award-number":["62003065","CSTB2022NSCQ-MSX1417","KJZD-K202200513"]}]},{"name":"Science and Technology Project of Chongqing Education Commission","award":["62003065","CSTB2022NSCQ-MSX1417","KJZD-K202200513"],"award-info":[{"award-number":["62003065","CSTB2022NSCQ-MSX1417","KJZD-K202200513"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Entropy"],"abstract":"<jats:p>Zero-shot sketch-based image retrieval (ZS-SBIR) is an important computer vision problem. The image category in the test phase is a new category that was not visible in the training stage. Because sketches are extremely abstract, the commonly used backbone networks (such as VGG-16 and ResNet-50) cannot handle both sketches and photos. Semantic similarities between the same features in photos and sketches are difficult to reflect in deep models without textual assistance. To solve this problem, we propose a novel and effective feature embedding model called Attention Map Feature Fusion (AMFF). The AMFF model combines the excellent feature extraction capability of the ResNet-50 network with the excellent representation ability of the attention network. By processing the residuals of the ResNet-50 network, the attention map is finally obtained without introducing external semantic knowledge. Most previous approaches treat the ZS-SBIR problem as a classification problem, which ignores the huge domain gap between sketches and photos. This paper proposes an effective method to optimize the entire network, called domain-aware triplets (DAT). Domain feature discrimination and semantic feature embedding can be learned through DAT. In this paper, we also use the classification loss function to stabilize the training process to avoid getting trapped in a local optimum. Compared with the state-of-the-art methods, our method shows a superior performance. For example, on the Tu-berlin dataset, we achieved 61.2 + 1.2% Prec200. On the Sketchy_c100 dataset, we achieved 62.3 + 3.3% mAPall and 75.5 + 1.5% Prec100.<\/jats:p>","DOI":"10.3390\/e25030502","type":"journal-article","created":{"date-parts":[[2023,3,15]],"date-time":"2023-03-15T05:22:59Z","timestamp":1678857779000},"page":"502","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":5,"title":["Feature Fusion and Metric Learning Network for Zero-Shot Sketch-Based Image Retrieval"],"prefix":"10.3390","volume":"25","author":[{"given":"Honggang","family":"Zhao","sequence":"first","affiliation":[{"name":"School of Computer and Information Science, Chongqing Normal University, Chongqing 401331, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Mingyue","family":"Liu","sequence":"additional","affiliation":[{"name":"School of Computer and Information Science, Chongqing Normal University, Chongqing 401331, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Mingyong","family":"Li","sequence":"additional","affiliation":[{"name":"School of Computer and Information Science, Chongqing Normal University, Chongqing 401331, China"},{"name":"Chongqing National Center for Applied Mathematics, Chongqing 401331, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2023,3,14]]},"reference":[{"key":"ref_1","unstructured":"Ribeiro, L.S.F., Bui, T., Collomosse, J., and Ponti, M. (2021, January 19\u201325). Scene designer: A unified model for scene search and synthesis from sketch. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Virtual."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"29561","DOI":"10.1007\/s11042-021-11045-1","article-title":"State of the art content based image retrieval techniques using deep learning: A survey","volume":"80","author":"Kapoor","year":"2021","journal-title":"Multimed. Tools Appl."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Yelamarthi, S.K., Reddy, S.K., Mishra, A., and Mittal, A. (2018, January 8\u201314). A zero-shot framework for sketch based image retrieval. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01225-0_19"},{"key":"ref_4","unstructured":"Simonyan, K., and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv."},{"key":"ref_5","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (July, January 26). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA."},{"key":"ref_6","unstructured":"Leal-Taix\u00e9, L., Canton-Ferrer, C., and Schindler, K. (July, January 26). Learning by tracking: Siamese CNN for robust target association. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Las Vegas, NV, USA."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Dey, S., Riba, P., Dutta, A., Llados, J., and Song, Y.Z. (2019, January 15\u201320). Doodle to search: Practical zero-shot sketch-based image retrieval. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00228"},{"key":"ref_8","unstructured":"Liu, Q., Xie, L., Wang, H., and Yuille, A.L. (November, January 27). Semantic-aware knowledge preservation for zero-shot sketch-based image retrieval. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Republic of Korea."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Zhang, Y., Feng, R., Zhang, T., and Fan, W. (2020, January 7\u201312). Zero-shot sketch-based image retrieval via graph convolution network. Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA.","DOI":"10.1609\/aaai.v34i07.6993"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Zhu, J., Xu, X., Shen, F., Lee, R.K.W., Wang, Z., and Shen, H.T. (2020, January 6\u201310). Ocean: A dual learning approach for generalized zero-shot sketch-based image retrieval. Proceedings of the 2020 IEEE International Conference on Multimedia and Expo (ICME), London, UK.","DOI":"10.1109\/ICME46284.2020.9102940"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"104003","DOI":"10.1016\/j.imavis.2020.104003","article-title":"CrossATNet-a novel cross-attention based framework for sketch-based image retrieval","volume":"104","author":"Chaudhuri","year":"2020","journal-title":"Image Vis. Comput."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"8892","DOI":"10.1109\/TIP.2020.3020383","article-title":"Progressive cross-modal semantic network for zero-shot sketch-based image retrieval","volume":"29","author":"Deng","year":"2020","journal-title":"IEEE Trans. Image Process."},{"key":"ref_13","unstructured":"Le, Q., and Mikolov, T. (2014, January 22\u201324). Distributed representations of sentences and documents. Proceedings of the International Conference on Machine Learning, Beijing, China."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Liu, L., Shen, F., Shen, Y., Liu, X., and Shao, L. (2017, January 21\u201326). Deep sketch hashing: Fast free-hand sketch-based image retrieval. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.247"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Shen, Y., Liu, L., Shen, F., and Shao, L. (2018, January 18\u201323). Zero-shot sketch-image hashing. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00379"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Dutta, A., and Akata, Z. (2019, January 16\u201317). Semantically tied paired cycle consistency for zero-shot sketch-based image retrieval. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00523"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Wang, W., Shi, Y., Chen, S., Peng, Q., Zheng, F., and You, X. (2021, January 19\u201327). Norm-guided Adaptive Visual Embedding for Zero-Shot Sketch-Based Image Retrieval. Proceedings of the IJCAI, Montreal, QC, Canada.","DOI":"10.24963\/ijcai.2021\/153"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"108528","DOI":"10.1016\/j.patcog.2022.108528","article-title":"An efficient framework for zero-shot sketch-based image retrieval","volume":"126","author":"Tursun","year":"2022","journal-title":"Pattern Recognit."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"5711","DOI":"10.1007\/s11063-022-10881-y","article-title":"Energy-Guided Feature Fusion for Zero-Shot Sketch-Based Image Retrieval","volume":"54","author":"Ren","year":"2022","journal-title":"Neural Process. Lett."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Zhang, Z., Zhang, X., Peng, C., Xue, X., and Sun, J. (2018, January 8\u201314). Exfuse: Enhancing feature fusion for semantic segmentation. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01249-6_17"},{"key":"ref_21","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \u0141., and Polosukhin, I. (2017). Advances in Neural Information Processing Systems, MIT Press."},{"key":"ref_22","unstructured":"Yang, L., Zhang, R.Y., Li, L., and Xie, X. (2021, January 18\u201324). Simam: A simple, parameter-free attention module for convolutional neural networks. Proceedings of the International Conference on Machine Learning, Virtual Event."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Hu, J., Shen, L., and Sun, G. (2018, January 18\u201323). Squeeze-and-excitation networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00745"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Woo, S., Park, J., Lee, J.Y., and Kweon, I.S. (2018, January 8\u201314). Cbam: Convolutional block attention module. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01234-2_1"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., and Hu, Q. (2020, January 13\u201319). Supplementary material for \u2018ECA-Net: Efficient channel attention for deep convolutional neural networks. Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01155"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Deng, J., Guo, J., Xue, N., and Zafeiriou, S. (2019, January 15\u201320). Arcface: Additive angular margin loss for deep face recognition. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00482"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Luo, H., Gu, Y., Liao, X., Lai, S., and Jiang, W. (2019, January 15\u201320). Bag of tricks and a strong baseline for deep person re-identification. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA.","DOI":"10.1109\/CVPRW.2019.00190"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Sun, Y., Cheng, C., Zhang, Y., Zhang, C., Zheng, L., Wang, Z., and Wei, Y. (2020, January 13\u201319). Circle loss: A unified perspective of pair similarity optimization. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.00643"},{"key":"ref_29","unstructured":"Hermans, A., Beyer, L., and Leibe, B. (2017). In defense of the triplet loss for person re-identification. arXiv."},{"key":"ref_30","first-page":"18661","article-title":"Supervised contrastive learning","volume":"33","author":"Khosla","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Jaiswal, A., Babu, A.R., Zadeh, M.Z., Banerjee, D., and Makedon, F. (2020). A survey on contrastive self-supervised learning. Technologies, 9.","DOI":"10.3390\/technologies9010002"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1109\/CVPR.2006.100","article-title":"Dimensionality reduction by learning an invariant mapping","volume":"Volume 2","author":"Hadsell","year":"2006","journal-title":"Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR\u201906)"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Schroff, F., Kalenichenko, D., and Philbin, J. (2015, January 7\u201312). Facenet: A unified embedding for face recognition and clustering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298682"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Oh Song, H., Xiang, Y., Jegelka, S., and Savarese, S. (2016, January 21\u201326). Deep metric learning via lifted structured feature embedding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2016.434"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Parkhi, O.M., Vedaldi, A., and Zisserman, A. (2015). Deep Face Recognition, University of Oxford.","DOI":"10.5244\/C.29.41"},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"103412","DOI":"10.1016\/j.cviu.2022.103412","article-title":"Zero-shot sketch-based image retrieval with structure-aware asymmetric disentanglement","volume":"218","author":"Li","year":"2022","journal-title":"Comput. Vis. Image Underst."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Liu, R., Yu, Q., and Yu, S.X. (2020, January 23\u201328). Unsupervised sketch to photo synthesis. Proceedings of the European Conference on Computer Vision, Glasgow, UK.","DOI":"10.1007\/978-3-030-58580-8_3"},{"key":"ref_38","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv."},{"key":"ref_39","unstructured":"Zhai, A., and Wu, H.Y. (2018). Classification is a strong baseline for deep metric learning. arXiv."},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Wang, Z., Wang, H., Yan, J., Wu, A., and Deng, C. (2021). Domain-smoothing network for zero-shot sketch-based image retrieval. arXiv.","DOI":"10.24963\/ijcai.2021\/158"},{"key":"ref_41","unstructured":"Huang, Z., Sun, Y., Han, C., Gao, C., and Sang, N. (2021). Modality-Aware Triplet Hard Mining for Zero-shot Sketch-Based Image Retrieval. arXiv."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/2897824.2925954","article-title":"The sketchy database: Learning to retrieve badly drawn bunnies","volume":"35","author":"Sangkloy","year":"2016","journal-title":"ACM Trans. Graph. (TOG)"},{"key":"ref_43","first-page":"1","article-title":"How do humans sketch objects?","volume":"31","author":"Eitz","year":"2012","journal-title":"ACM Trans. Graph. (TOG)"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Felix, R., Reid, I., and Carneiro, G. (2018, January 8\u201314). Multi-modal cycle-consistent generalized zero-shot learning. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.","DOI":"10.1007\/978-3-030-01231-1_2"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., and Fei-Fei, L. (2009, January 20\u201325). Imagenet: A large-scale hierarchical image database. Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA.","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Kodirov, E., Xiang, T., and Gong, S. (2017, January 21\u201326). Semantic autoencoder for zero-shot learning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.473"},{"key":"ref_47","doi-asserted-by":"crossref","first-page":"245","DOI":"10.1016\/j.neucom.2022.09.104","article-title":"BDA-SketRet: Bi-level domain adaptation for zero-shot SBIR","volume":"514","author":"Chaudhuri","year":"2022","journal-title":"Neurocomputing"}],"container-title":["Entropy"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1099-4300\/25\/3\/502\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T18:54:52Z","timestamp":1760122492000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1099-4300\/25\/3\/502"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,3,14]]},"references-count":47,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2023,3]]}},"alternative-id":["e25030502"],"URL":"https:\/\/doi.org\/10.3390\/e25030502","relation":{},"ISSN":["1099-4300"],"issn-type":[{"type":"electronic","value":"1099-4300"}],"subject":[],"published":{"date-parts":[[2023,3,14]]}}}