{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,1]],"date-time":"2025-10-01T15:18:53Z","timestamp":1759331933494,"version":"3.41.0"},"reference-count":65,"publisher":"Association for Computing Machinery (ACM)","issue":"1","license":[{"start":{"date-parts":[[2023,5,26]],"date-time":"2023-05-26T00:00:00Z","timestamp":1685059200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["No. U21B2009, 62172039 and L1924068"],"award-info":[{"award-number":["No. U21B2009, 62172039 and L1924068"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"name":"National Key R\\&D Plan","award":["No. 2020AAA0106600"],"award-info":[{"award-number":["No. 2020AAA0106600"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2023,5,26]]},"abstract":"<jats:p>Recently, to improve the unsupervised image retrieval performance, plenty of unsupervised hashing methods have been proposed by designing a semantic similarity matrix, which is based on the similarities between image features extracted by a pre-trained CNN model. However, most of these methods tend to ignore high-level abstract semantic concepts contained in images. Intuitively, concepts play an important role in calculating the similarity among images. In real-world scenarios, each image is associated with some concepts, and the similarity between two images will be larger if they share more identical concepts. Inspired by the above intuition, in this work, we propose a novel Unsupervised Hashing with Semantic Concept Mining, called UHSCM, which leverages a VLP model to construct a high-quality similarity matrix. Specifically, a set of randomly chosen concepts is first collected. Then, by employing a vision-language pretraining (VLP) model with the prompt engineering which has shown strong power in visual representation learning, the set of concepts is denoised according to the training images. Next, the proposed method UHSCM applies the VLP model with prompting again to mine the concept distribution of each image and construct a high-quality semantic similarity matrix based on the mined concept distributions. Finally, with the semantic similarity matrix as guiding information, a novel hashing loss with a modified contrastive loss based regularization item is proposed to optimize the hashing network. Extensive experiments on three benchmark datasets show that the proposed method outperforms the state-of-the-art baselines in the image retrieval task.<\/jats:p>","DOI":"10.1145\/3588683","type":"journal-article","created":{"date-parts":[[2023,5,30]],"date-time":"2023-05-30T17:42:05Z","timestamp":1685468525000},"page":"1-19","source":"Crossref","is-referenced-by-count":9,"title":["Unsupervised Hashing with Semantic Concept Mining"],"prefix":"10.1145","volume":"1","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9567-159X","authenticated-orcid":false,"given":"Rong-Cheng","family":"Tu","sequence":"first","affiliation":[{"name":"Beijing Institute of Technology, Beijing Engineering Research Center of High Volume Language Information Processing and Cloud Computing Applications, &amp; Beijing Institute of Technology Southeast Academy of Information Technology, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-6795-2311","authenticated-orcid":false,"given":"Xian-Ling","family":"Mao","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-2568-2346","authenticated-orcid":false,"given":"Kevin Qinghong","family":"Lin","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-7785-4118","authenticated-orcid":false,"given":"Chengfei","family":"Cai","sequence":"additional","affiliation":[{"name":"Zhejiang University, Zhejiang, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7005-7614","authenticated-orcid":false,"given":"Weize","family":"Qin","sequence":"additional","affiliation":[{"name":"Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4488-0102","authenticated-orcid":false,"given":"Wei","family":"Wei","sequence":"additional","affiliation":[{"name":"Huazhong University of Science and Technology, Wuhan, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8230-9471","authenticated-orcid":false,"given":"Hongfa","family":"Wang","sequence":"additional","affiliation":[{"name":"Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0320-7520","authenticated-orcid":false,"given":"Heyan","family":"Huang","sequence":"additional","affiliation":[{"name":"Beijing Institute of Technology, Beijing, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2023,5,30]]},"reference":[{"key":"e_1_2_2_1_1","volume-title":"Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. arXiv preprint arXiv:2104.11178","author":"Akbari Hassan","year":"2021","unstructured":"Hassan Akbari, Linagzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. 2021. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. arXiv preprint arXiv:2104.11178 (2021)."},{"key":"e_1_2_2_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00134"},{"key":"e_1_2_2_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.598"},{"key":"e_1_2_2_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/1646396.1646452"},{"key":"e_1_2_2_5_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)."},{"key":"e_1_2_2_6_1","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)."},{"key":"e_1_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.01196"},{"key":"e_1_2_2_8_1","first-page":"518","article-title":"Similarity search in high dimensions via hashing","volume":"99","author":"Gionis Aristides","year":"1999","unstructured":"Aristides Gionis, Piotr Indyk, Rajeev Motwani, et al. 1999. Similarity search in high dimensions via hashing. In Vldb, Vol. 99. 518--529.","journal-title":"Vldb"},{"key":"e_1_2_2_9_1","volume-title":"Proceedings of the thirteenth international conference on artificial intelligence and statistics. 249--256","author":"Glorot Xavier","year":"2010","unstructured":"Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics. 249--256."},{"key":"e_1_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2012.193"},{"key":"e_1_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/1835804.1835946"},{"key":"e_1_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00537"},{"key":"e_1_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/3126686.3126773"},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/1460096.1460104"},{"key":"e_1_2_2_15_1","volume-title":"Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918","author":"Jia Chao","year":"2021","unstructured":"Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918 (2021)."},{"key":"e_1_2_2_16_1","unstructured":"Alex Krizhevsky Geoffrey Hinton et al. 2009. Learning multiple layers of features from tiny images. Technical Report. Citeseer."},{"key":"e_1_2_2_17_1","volume-title":"Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems","author":"Krizhevsky Alex","year":"2012","unstructured":"Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, Vol. 25 (2012), 1097--1105."},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00446"},{"key":"e_1_2_2_19_1","volume-title":"Advances in Neural Information Processing Systems","volume":"34","author":"Li Junnan","year":"2021","unstructured":"Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. 2021. Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, Vol. 34 (2021)."},{"key":"e_1_2_2_20_1","volume-title":"Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557","author":"Li Liunian Harold","year":"2019","unstructured":"Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)."},{"key":"e_1_2_2_21_1","volume-title":"Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence. AAAI Press, 1711--1717","author":"Li Wu-Jun","year":"2016","unstructured":"Wu-Jun Li, Sheng Wang, and Wang-Cheng Kang. 2016. Feature learning based deep supervised hashing with pairwise labels. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence. AAAI Press, 1711--1717."},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.133"},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3459637.3482247"},{"key":"e_1_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/3459637.3482247"},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"e_1_2_2_26_1","volume-title":"Discrete graph hashing. Advances in neural information processing systems","author":"Liu Wei","year":"2014","unstructured":"Wei Liu, Cun Mu, Sanjiv Kumar, and Shih-Fu Chang. 2014. Discrete graph hashing. Advances in neural information processing systems, Vol. 27 (2014)."},{"key":"e_1_2_2_27_1","doi-asserted-by":"publisher","DOI":"10.5555\/2354409.2355047"},{"key":"e_1_2_2_28_1","unstructured":"Wei Liu Jun Wang Sanjiv Kumar and Shih-Fu Chang. 2011. Hashing with graphs. In Icml."},{"key":"e_1_2_2_29_1","volume-title":"Compact hyperplane hashing with bilinear functions. arXiv preprint arXiv:1206.4618","author":"Liu Wei","year":"2012","unstructured":"Wei Liu, Jun Wang, Yadong Mu, Sanjiv Kumar, and Shih-Fu Chang. 2012b. Compact hyperplane hashing with bilinear functions. arXiv preprint arXiv:1206.4618 (2012)."},{"key":"e_1_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/MMUL.2016.39"},{"key":"e_1_2_2_31_1","volume-title":"A survey on deep hashing methods. ACM Transactions on Knowledge Discovery from Data (TKDD)","author":"Luo Xiao","year":"2020","unstructured":"Xiao Luo, Haixin Wang, Daqing Wu, Chong Chen, Minghua Deng, Jianqiang Huang, and Xian-Sheng Hua. 2020a. A survey on deep hashing methods. ACM Transactions on Knowledge Discovery from Data (TKDD) (2020)."},{"key":"e_1_2_2_32_1","volume-title":"Cimon: Towards high-quality hash codes. arXiv preprint arXiv:2010.07804","author":"Luo Xiao","year":"2020","unstructured":"Xiao Luo, Daqing Wu, Zeyu Ma, Chong Chen, Minghua Deng, Jinwen Ma, Zhongming Jin, Jianqiang Huang, and Xian-Sheng Hua. 2020b. Cimon: Towards high-quality hash codes. arXiv preprint arXiv:2010.07804 (2020)."},{"key":"e_1_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2019.2892703"},{"key":"e_1_2_2_34_1","volume-title":"5th Berkeley Symp. Math. Statist. Probability. 281--297","author":"MacQueen J","year":"1967","unstructured":"J MacQueen. 1967. Classification and analysis of multivariate observations. In 5th Berkeley Symp. Math. Statist. Probability. 281--297."},{"key":"e_1_2_2_35_1","unstructured":"Zexuan Qiu Qinliang Su Zijing Ou Jianxing Yu and Changyou Chen. 2020. Unsupervised Hashing with Contrastive Information Bottleneck. In IJCAI."},{"key":"e_1_2_2_36_1","volume-title":"Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al.","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021)."},{"key":"e_1_2_2_37_1","doi-asserted-by":"crossref","unstructured":"Olga Russakovsky Jia Deng Hao Su Jonathan Krause Sanjeev Satheesh Sean Ma Zhiheng Huang Andrej Karpathy Aditya Khosla Michael Bernstein et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision Vol. 115 3 (2015) 211--252.","DOI":"10.1007\/s11263-015-0816-y"},{"key":"e_1_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298598"},{"key":"e_1_2_2_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00289"},{"key":"e_1_2_2_40_1","volume-title":"Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)."},{"key":"e_1_2_2_41_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v32i1.11276"},{"key":"e_1_2_2_42_1","volume-title":"Proceedings of the 32nd International Conference on Neural Information Processing Systems. 806--815","author":"Su Shupeng","year":"2018","unstructured":"Shupeng Su, Chao Zhang, Kai Han, and Yonghong Tian. 2018. Greedy hash: Towards fast optimization for accurate hash coding in cnn. In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 806--815."},{"key":"e_1_2_2_43_1","volume-title":"Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530","author":"Su Weijie","year":"2019","unstructured":"Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019)."},{"key":"e_1_2_2_44_1","volume-title":"Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490","author":"Tan Hao","year":"2019","unstructured":"Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490 (2019)."},{"key":"e_1_2_2_45_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.findings-acl.66"},{"key":"e_1_2_2_46_1","doi-asserted-by":"crossref","unstructured":"Rong-Cheng Tu Xianling Mao and Wei Wei. 2020a. MLS3RDUH: Deep Unsupervised Hashing via Manifold based Local Semantic Similarity Structure Reconstructing.. In IJCAI. 3466--3472.","DOI":"10.24963\/ijcai.2020\/479"},{"key":"e_1_2_2_47_1","doi-asserted-by":"crossref","unstructured":"Rong-Cheng Tu Xian-Ling Mao Bo-Si Feng and Yu Shu-Ying. 2018. Object detection based deep unsupervised hashing. In IJCAI. 3606--3612.","DOI":"10.24963\/ijcai.2019\/500"},{"key":"e_1_2_2_48_1","doi-asserted-by":"publisher","DOI":"10.1145\/3442381.3449825"},{"key":"e_1_2_2_49_1","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475498"},{"key":"e_1_2_2_50_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2020.2987312"},{"key":"e_1_2_2_51_1","unstructured":"Rong-Cheng Tu Xian-Ling Mao Rong-Xin Tu Binbin Bian Chengfei Cai Wei Wei Heyan Huang et al. 2022. Deep cross-modal proxy hashing. IEEE Transactions on Knowledge and Data Engineering (2022)."},{"key":"e_1_2_2_52_1","article-title":"Visualizing data using t-SNE","volume":"9","author":"der Maaten Laurens Van","year":"2008","unstructured":"Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research, Vol. 9, 11 (2008).","journal-title":"Journal of machine learning research"},{"key":"e_1_2_2_53_1","volume-title":"Supervised deep hashing for hierarchical labeled data. arXiv preprint arXiv:1704.02088","author":"Wang Dan","year":"2017","unstructured":"Dan Wang, Heyan Huang, Chi Lu, Bo-Si Feng, Liqiang Nie, Guihua Wen, and Xian-Ling Mao. 2017. Supervised deep hashing for hierarchical labeled data. arXiv preprint arXiv:1704.02088 (2017)."},{"key":"e_1_2_2_54_1","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2015.2487976"},{"key":"e_1_2_2_55_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2013.377"},{"key":"e_1_2_2_56_1","unstructured":"Yair Weiss Antonio Torralba and Rob Fergus. 2009. Spectral hashing. In Advances in neural information processing systems. 1753--1760."},{"key":"e_1_2_2_57_1","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2018.2793863"},{"key":"e_1_2_2_58_1","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2018\/148"},{"key":"e_1_2_2_59_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v31i1.10719"},{"key":"e_1_2_2_60_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00306"},{"key":"e_1_2_2_61_1","volume-title":"International conference on machine learning. 946--954","author":"Yu Felix","year":"2014","unstructured":"Felix Yu, Sanjiv Kumar, Yunchao Gong, and Shih-Fu Chang. 2014. Circulant binary embedding. In International conference on machine learning. 946--954."},{"key":"e_1_2_2_62_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR42600.2020.00315"},{"key":"e_1_2_2_63_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jvcir.2016.08.016"},{"key":"e_1_2_2_64_1","doi-asserted-by":"publisher","DOI":"10.1145\/2911451.2911502"},{"key":"e_1_2_2_65_1","volume-title":"Chen Change Loy, and Ziwei Liu","author":"Zhou Kaiyang","year":"2021","unstructured":"Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2021. Learning to prompt for vision-language models. arXiv preprint arXiv:2109.01134 (2021)."}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3588683","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3588683","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:47:13Z","timestamp":1750178833000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3588683"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,5,26]]},"references-count":65,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2023,5,26]]}},"alternative-id":["10.1145\/3588683"],"URL":"https:\/\/doi.org\/10.1145\/3588683","relation":{},"ISSN":["2836-6573"],"issn-type":[{"type":"electronic","value":"2836-6573"}],"subject":[],"published":{"date-parts":[[2023,5,26]]}}}