{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,6]],"date-time":"2025-12-06T17:17:45Z","timestamp":1765041465934,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":39,"publisher":"ACM","license":[{"start":{"date-parts":[[2023,12,6]],"date-time":"2023-12-06T00:00:00Z","timestamp":1701820800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2023,12,6]]},"DOI":"10.1145\/3595916.3626394","type":"proceedings-article","created":{"date-parts":[[2024,1,1]],"date-time":"2024-01-01T16:34:41Z","timestamp":1704126881000},"page":"1-7","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["Adapting Hierarchical Transformer for Scene-Level Sketch-Based Image Retrieval"],"prefix":"10.1145","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8386-5012","authenticated-orcid":false,"given":"Jie","family":"Yang","sequence":"first","affiliation":[{"name":"Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, CN"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3638-7983","authenticated-orcid":false,"given":"Aihua","family":"Ke","sequence":"additional","affiliation":[{"name":"Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, CN"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5261-0191","authenticated-orcid":false,"given":"Bo","family":"Cai","sequence":"additional","affiliation":[{"name":"Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, CN"}]}],"member":"320","published-online":{"date-parts":[[2024,1]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"Layer normalization. arXiv preprint arXiv:1607.06450","author":"Ba Jimmy\u00a0Lei","year":"2016","unstructured":"Jimmy\u00a0Lei Ba , Jamie\u00a0Ryan Kiros , and Geoffrey\u00a0 E Hinton . 2016. Layer normalization. arXiv preprint arXiv:1607.06450 ( 2016 ). Jimmy\u00a0Lei Ba, Jamie\u00a0Ryan Kiros, and Geoffrey\u00a0E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016)."},{"key":"e_1_3_2_1_2_1","volume-title":"Speeded-up robust features (SURF). Computer vision and image understanding 110, 3","author":"Bay Herbert","year":"2008","unstructured":"Herbert Bay , Andreas Ess , Tinne Tuytelaars , and Luc Van\u00a0Gool . 2008. Speeded-up robust features (SURF). Computer vision and image understanding 110, 3 ( 2008 ), 346\u2013359. Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van\u00a0Gool. 2008. Speeded-up robust features (SURF). Computer vision and image understanding 110, 3 (2008), 346\u2013359."},{"key":"e_1_3_2_1_3_1","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 999\u20131008","author":"Bhunia Ayan\u00a0Kumar","year":"2022","unstructured":"Ayan\u00a0Kumar Bhunia , Subhadeep Koley , Abdullah Faiz Ur\u00a0Rahman Khilji , Aneeshan Sain , Pinaki\u00a0Nath Chowdhury , Tao Xiang , and Yi-Zhe Song . 2022 . Sketching without worrying: Noise-tolerant sketch-based image retrieval . In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 999\u20131008 . Ayan\u00a0Kumar Bhunia, Subhadeep Koley, Abdullah Faiz Ur\u00a0Rahman Khilji, Aneeshan Sain, Pinaki\u00a0Nath Chowdhury, Tao Xiang, and Yi-Zhe Song. 2022. Sketching without worrying: Noise-tolerant sketch-based image retrieval. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 999\u20131008."},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00132"},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2011.5995460"},{"key":"e_1_3_2_1_6_1","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 2395\u20132405","author":"Chowdhury Pinaki\u00a0Nath","year":"2022","unstructured":"Pinaki\u00a0Nath Chowdhury , Ayan\u00a0Kumar Bhunia , Viswanatha\u00a0Reddy Gajjala , Aneeshan Sain , Tao Xiang , and Yi-Zhe Song . 2022 . Partially does it: Towards scene-level fg-sbir with partial input . In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 2395\u20132405 . Pinaki\u00a0Nath Chowdhury, Ayan\u00a0Kumar Bhunia, Viswanatha\u00a0Reddy Gajjala, Aneeshan Sain, Tao Xiang, and Yi-Zhe Song. 2022. Partially does it: Towards scene-level fg-sbir with partial input. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition. 2395\u20132405."},{"key":"e_1_3_2_1_7_1","unstructured":"MMEngine Contributors. 2022. MMEngine: OpenMMLab Foundational Library for Training Deep Learning Models. https:\/\/github.com\/open-mmlab\/mmengine. (2022). MMEngine Contributors. 2022. MMEngine: OpenMMLab Foundational Library for Training Deep Learning Models. https:\/\/github.com\/open-mmlab\/mmengine. (2022)."},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"crossref","unstructured":"Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detection. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR\u201905) Vol.\u00a01. Ieee 886\u2013893. Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detection. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR\u201905) Vol.\u00a01. Ieee 886\u2013893.","DOI":"10.1109\/CVPR.2005.177"},{"key":"e_1_3_2_1_9_1","volume-title":"An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929","author":"Dosovitskiy Alexey","year":"2020","unstructured":"Alexey Dosovitskiy , Lucas Beyer , Alexander Kolesnikov , Dirk Weissenborn , Xiaohua Zhai , Thomas Unterthiner , Mostafa Dehghani , Matthias Minderer , Georg Heigold , Sylvain Gelly , 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 ( 2020 ). Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)."},{"key":"e_1_3_2_1_10_1","volume-title":"How do humans sketch objects?ACM Transactions on graphics (TOG) 31, 4","author":"Eitz Mathias","year":"2012","unstructured":"Mathias Eitz , James Hays , and Marc Alexa . 2012. How do humans sketch objects?ACM Transactions on graphics (TOG) 31, 4 ( 2012 ), 1\u201310. Mathias Eitz, James Hays, and Marc Alexa. 2012. How do humans sketch objects?ACM Transactions on graphics (TOG) 31, 4 (2012), 1\u201310."},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cag.2010.07.002"},{"key":"e_1_3_2_1_12_1","volume-title":"Sketch-based image retrieval: Benchmark and bag-of-features descriptors","author":"Eitz Mathias","year":"2010","unstructured":"Mathias Eitz , Kristian Hildebrand , Tamy Boubekeur , and Marc Alexa . 2010. Sketch-based image retrieval: Benchmark and bag-of-features descriptors . IEEE transactions on visualization and computer graphics 17, 11 ( 2010 ), 1624\u20131636. Mathias Eitz, Kristian Hildebrand, Tamy Boubekeur, and Marc Alexa. 2010. Sketch-based image retrieval: Benchmark and bag-of-features descriptors. IEEE transactions on visualization and computer graphics 17, 11 (2010), 1624\u20131636."},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00522"},{"key":"e_1_3_2_1_14_1","volume-title":"A neural representation of sketch drawings. arXiv preprint arXiv:1704.03477","author":"Ha David","year":"2017","unstructured":"David Ha and Douglas Eck . 2017. A neural representation of sketch drawings. arXiv preprint arXiv:1704.03477 ( 2017 ). David Ha and Douglas Eck. 2017. A neural representation of sketch drawings. arXiv preprint arXiv:1704.03477 (2017)."},{"key":"e_1_3_2_1_15_1","volume-title":"Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415","author":"Hendrycks Dan","year":"2016","unstructured":"Dan Hendrycks and Kevin Gimpel . 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 ( 2016 ). Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016)."},{"key":"e_1_3_2_1_16_1","volume-title":"International Conference on Machine Learning. PMLR, 2790\u20132799","author":"Houlsby Neil","year":"2019","unstructured":"Neil Houlsby , Andrei Giurgiu , Stanislaw Jastrzebski , Bruna Morrone , Quentin De\u00a0Laroussilhe , Andrea Gesmundo , Mona Attariyan , and Sylvain Gelly . 2019 . Parameter-efficient transfer learning for NLP . In International Conference on Machine Learning. PMLR, 2790\u20132799 . Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De\u00a0Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning. PMLR, 2790\u20132799."},{"key":"e_1_3_2_1_17_1","volume-title":"Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685","author":"Hu J","year":"2021","unstructured":"Edward\u00a0 J Hu , Yelong Shen , Phillip Wallis , Zeyuan Allen-Zhu , Yuanzhi Li , Shean Wang , Lu Wang , and Weizhu Chen . 2021 . Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021). Edward\u00a0J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)."},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2013.02.005"},{"key":"e_1_3_2_1_19_1","volume-title":"2011 18th IEEE International Conference on Image Processing. IEEE, 3661\u20133664","author":"Hu Rui","year":"2011","unstructured":"Rui Hu , Tinghuai Wang , and John Collomosse . 2011 . A bag-of-regions approach to sketch-based image retrieval . In 2011 18th IEEE International Conference on Image Processing. IEEE, 3661\u20133664 . Rui Hu, Tinghuai Wang, and John Collomosse. 2011. A bag-of-regions approach to sketch-based image retrieval. In 2011 18th IEEE International Conference on Image Processing. IEEE, 3661\u20133664."},{"key":"e_1_3_2_1_20_1","volume-title":"E-Learning and Games: 10th International Conference, Edutainment 2016","author":"Hu Shijie","year":"2016","unstructured":"Shijie Hu , Hongxin Zhang , Sanyuan Zhang , Zishuo Fang , and Qi Huang . 2016 . Sketch-Based Retrieval in Large-Scale Image Database via Position-Aware Silhouette Matching . In E-Learning and Games: 10th International Conference, Edutainment 2016 , Hangzhou, China , April 14-16, 2016, Revised Selected Papers 10. Springer, 243\u2013256. Shijie Hu, Hongxin Zhang, Sanyuan Zhang, Zishuo Fang, and Qi Huang. 2016. Sketch-Based Retrieval in Large-Scale Image Database via Position-Aware Silhouette Matching. In E-Learning and Games: 10th International Conference, Edutainment 2016, Hangzhou, China, April 14-16, 2016, Revised Selected Papers 10. Springer, 243\u2013256."},{"key":"e_1_3_2_1_21_1","volume-title":"How much position information do convolutional neural networks encode?arXiv preprint arXiv:2001.08248","author":"Islam Md\u00a0Amirul","year":"2020","unstructured":"Md\u00a0Amirul Islam , Sen Jia , and Neil\u00a0 DB Bruce . 2020. How much position information do convolutional neural networks encode?arXiv preprint arXiv:2001.08248 ( 2020 ). Md\u00a0Amirul Islam, Sen Jia, and Neil\u00a0DB Bruce. 2020. How much position information do convolutional neural networks encode?arXiv preprint arXiv:2001.08248 (2020)."},{"key":"e_1_3_2_1_22_1","volume-title":"Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190","author":"Li Xiang\u00a0Lisa","year":"2021","unstructured":"Xiang\u00a0Lisa Li and Percy Liang . 2021 . Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190 (2021). Xiang\u00a0Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190 (2021)."},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"crossref","unstructured":"Yi Li Timothy\u00a0M Hospedales Yi-Zhe Song and Shaogang Gong. 2014. Fine-grained sketch-based image retrieval by matching deformable part models. (2014). Yi Li Timothy\u00a0M Hospedales Yi-Zhe Song and Shaogang Gong. 2014. Fine-grained sketch-based image retrieval by matching deformable part models. (2014).","DOI":"10.5244\/C.28.115"},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2022.3175403"},{"key":"e_1_3_2_1_25_1","volume-title":"Proceedings, Part XIX 16","author":"Liu Fang","year":"2020","unstructured":"Fang Liu , Changqing Zou , Xiaoming Deng , Ran Zuo , Yu-Kun Lai , Cuixia Ma , Yong-Jin Liu , and Hongan Wang . 2020 . Scenesketcher: Fine-grained image retrieval with scene sketches. In Computer Vision\u2013ECCV 2020: 16th European Conference, Glasgow, UK, August 23\u201328, 2020 , Proceedings, Part XIX 16 . Springer, 718\u2013734. Fang Liu, Changqing Zou, Xiaoming Deng, Ran Zuo, Yu-Kun Lai, Cuixia Ma, Yong-Jin Liu, and Hongan Wang. 2020. Scenesketcher: Fine-grained image retrieval with scene sketches. In Computer Vision\u2013ECCV 2020: 16th European Conference, Glasgow, UK, August 23\u201328, 2020, Proceedings, Part XIX 16. Springer, 718\u2013734."},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1023\/B:VISI.0000029664.99615.94"},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"crossref","unstructured":"Bryan\u00a0James Prosser Wei-Shi Zheng Shaogang Gong Tao Xiang Q Mary 2010. Person re-identification by support vector ranking.. In Bmvc Vol.\u00a02. 6. Bryan\u00a0James Prosser Wei-Shi Zheng Shaogang Gong Tao Xiang Q Mary 2010. Person re-identification by support vector ranking.. In Bmvc Vol.\u00a02. 6.","DOI":"10.5244\/C.24.21"},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICIP.2016.7532801"},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-015-0816-y"},{"key":"e_1_3_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1145\/2897824.2925954"},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.592"},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.308"},{"key":"e_1_3_2_1_33_1","first-page":"12077","article-title":"SegFormer: Simple and efficient design for semantic segmentation with transformers","volume":"34","author":"Xie Enze","year":"2021","unstructured":"Enze Xie , Wenhai Wang , Zhiding Yu , Anima Anandkumar , Jose\u00a0 M Alvarez , and Ping Luo . 2021 . SegFormer: Simple and efficient design for semantic segmentation with transformers . Advances in Neural Information Processing Systems 34 (2021), 12077 \u2013 12090 . Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose\u00a0M Alvarez, and Ping Luo. 2021. SegFormer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems 34 (2021), 12077\u201312090.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.93"},{"key":"e_1_3_2_1_35_1","volume-title":"Sketch-a-net: A deep neural network that beats humans. International journal of computer vision 122","author":"Yu Qian","year":"2017","unstructured":"Qian Yu , Yongxin Yang , Feng Liu , Yi-Zhe Song , Tao Xiang , and Timothy\u00a0 M Hospedales . 2017 . Sketch-a-net: A deep neural network that beats humans. International journal of computer vision 122 (2017), 411\u2013425. Qian Yu, Yongxin Yang, Feng Liu, Yi-Zhe Song, Tao Xiang, and Timothy\u00a0M Hospedales. 2017. Sketch-a-net: A deep neural network that beats humans. International journal of computer vision 122 (2017), 411\u2013425."},{"key":"e_1_3_2_1_36_1","volume-title":"Sketch-a-net that beats humans. arXiv preprint arXiv:1501.07873","author":"Yu Qian","year":"2015","unstructured":"Qian Yu , Yongxin Yang , Yi-Zhe Song , Tao Xiang , and Timothy Hospedales . 2015. Sketch-a-net that beats humans. arXiv preprint arXiv:1501.07873 ( 2015 ). Qian Yu, Yongxin Yang, Yi-Zhe Song, Tao Xiang, and Timothy Hospedales. 2015. Sketch-a-net that beats humans. arXiv preprint arXiv:1501.07873 (2015)."},{"key":"e_1_3_2_1_37_1","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence, Vol.\u00a034","author":"Zhang Zhaolong","year":"2020","unstructured":"Zhaolong Zhang , Yuejie Zhang , Rui Feng , Tao Zhang , and Weiguo Fan . 2020 . Zero-shot sketch-based image retrieval via graph convolution network . In Proceedings of the AAAI Conference on Artificial Intelligence, Vol.\u00a034 . 12943\u201312950. Zhaolong Zhang, Yuejie Zhang, Rui Feng, Tao Zhang, and Weiguo Fan. 2020. Zero-shot sketch-based image retrieval via graph convolution network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol.\u00a034. 12943\u201312950."},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-018-1140-0"},{"key":"e_1_3_2_1_39_1","volume-title":"Proceedings of the european conference on computer vision (ECCV). 421\u2013436","author":"Zou Changqing","year":"2018","unstructured":"Changqing Zou , Qian Yu , Ruofei Du , Haoran Mo , Yi-Zhe Song , Tao Xiang , Chengying Gao , Baoquan Chen , and Hao Zhang . 2018 . Sketchyscene: Richly-annotated scene sketches . In Proceedings of the european conference on computer vision (ECCV). 421\u2013436 . Changqing Zou, Qian Yu, Ruofei Du, Haoran Mo, Yi-Zhe Song, Tao Xiang, Chengying Gao, Baoquan Chen, and Hao Zhang. 2018. Sketchyscene: Richly-annotated scene sketches. In Proceedings of the european conference on computer vision (ECCV). 421\u2013436."}],"event":{"name":"MMAsia '23: ACM Multimedia Asia","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Tainan Taiwan","acronym":"MMAsia '23"},"container-title":["ACM Multimedia Asia 2023"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3595916.3626394","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3595916.3626394","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:35:55Z","timestamp":1750178155000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3595916.3626394"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,12,6]]},"references-count":39,"alternative-id":["10.1145\/3595916.3626394","10.1145\/3595916"],"URL":"https:\/\/doi.org\/10.1145\/3595916.3626394","relation":{},"subject":[],"published":{"date-parts":[[2023,12,6]]},"assertion":[{"value":"2024-01-01","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}