{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,11]],"date-time":"2025-12-11T20:52:46Z","timestamp":1765486366391,"version":"build-2065373602"},"reference-count":55,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2019,12,25]],"date-time":"2019-12-25T00:00:00Z","timestamp":1577232000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Remote Sensing"],"abstract":"<jats:p>With the rapid progress of remote sensing (RS) observation technologies, cross-modal RS image-sound retrieval has attracted some attention in recent years. However, these methods perform cross-modal image-sound retrieval by leveraging high-dimensional real-valued features, which can require more storage than low-dimensional binary features (i.e., hash codes). Moreover, these methods cannot directly encode relative semantic similarity relationships. To tackle these issues, we propose a new, deep, cross-modal RS image-sound hashing approach, called deep triplet-based hashing (DTBH), to integrate hash code learning and relative semantic similarity relationship learning into an end-to-end network. Specially, the proposed DTBH method designs a triplet selection strategy to select effective triplets. Moreover, in order to encode relative semantic similarity relationships, we propose the objective function, which makes sure that that the anchor images are more similar to the positive sounds than the negative sounds. In addition, a triplet regularized loss term leverages approximate l1-norm of hash-like codes and hash codes and can effectively reduce the information loss between hash-like codes and hash codes. Extensive experimental results showed that the DTBH method could achieve a superior performance to other state-of-the-art cross-modal image-sound retrieval methods. For a sound query RS image task, the proposed approach achieved a mean average precision (mAP) of up to 60.13% on the UCM dataset, 87.49% on the Sydney dataset, and 22.72% on the RSICD dataset. For RS image query sound task, the proposed approach achieved a mAP of 64.27% on the UCM dataset, 92.45% on the Sydney dataset, and 23.46% on the RSICD dataset. Future work will focus on how to consider the balance property of hash codes to improve image-sound retrieval performance.<\/jats:p>","DOI":"10.3390\/rs12010084","type":"journal-article","created":{"date-parts":[[2019,12,25]],"date-time":"2019-12-25T11:07:48Z","timestamp":1577272068000},"page":"84","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":41,"title":["A Deep Hashing Technique for Remote Sensing Image-Sound Retrieval"],"prefix":"10.3390","volume":"12","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2903-6723","authenticated-orcid":false,"given":"Yaxiong","family":"Chen","sequence":"first","affiliation":[{"name":"Key Laboratory of Spectral Imaging Technology CAS, Xi\u2019an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi\u2019an 710119, China"},{"name":"University of Chinese Academy of Sciences, Beijing 100049, China"}]},{"given":"Xiaoqiang","family":"Lu","sequence":"additional","affiliation":[{"name":"Key Laboratory of Spectral Imaging Technology CAS, Xi\u2019an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi\u2019an 710119, China"}]}],"member":"1968","published-online":{"date-parts":[[2019,12,25]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"47","DOI":"10.1016\/j.future.2014.10.029","article-title":"Remote Sensing Big Data Computing: Challenges and Opportunities","volume":"51","author":"Ma","year":"2015","journal-title":"Future Gener. Comput. Syst."},{"doi-asserted-by":"crossref","unstructured":"Mandal, D., Chaudhury, K.N., and Biswas, S. (2017, January 21\u201326). Generalized semantic preserving hashing for n-label cross-modal retrieval. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","key":"ref_2","DOI":"10.1109\/CVPR.2017.282"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"1603","DOI":"10.1109\/TGRS.2010.2088404","article-title":"Entropy-Balanced Bitmap Tree for Shape-Based Object Retrieval From Large-Scale Satellite Imagery Databases","volume":"49","author":"Scott","year":"2011","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"892","DOI":"10.1109\/TGRS.2015.2469138","article-title":"Hashing-Based Scalable Remote Sensing Image Search and Retrieval in Large Archives","volume":"54","author":"Demir","year":"2016","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"464","DOI":"10.1109\/LGRS.2017.2651056","article-title":"Partial Randomness Hashing for Large-Scale Remote Sensing Image Retrieval","volume":"14","author":"Peng","year":"2017","journal-title":"IEEE Geosci. Remote Sens. Lett."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"950","DOI":"10.1109\/TGRS.2017.2756911","article-title":"Large-Scale Remote Sensing Image Retrieval by Deep Hashing Neural Networks","volume":"54","author":"Li","year":"2018","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"818","DOI":"10.1109\/TGRS.2012.2205158","article-title":"Geographic image retrieval using local invariant features","volume":"51","author":"Yang","year":"2012","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"3023","DOI":"10.1109\/TGRS.2013.2268736","article-title":"Remote sensing image retrieval with global morphological texture descriptors","volume":"52","author":"Aptoula","year":"2013","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"1465","DOI":"10.1109\/TIP.2008.925367","article-title":"Indexing of satellite images with different resolutions by wavelet features","volume":"17","author":"Luo","year":"2008","journal-title":"IEEE Trans. Image Process."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"248","DOI":"10.1109\/TGRS.2016.2604680","article-title":"Structure tensor Riemannian statistical models for CBIR and classification of remote sensing images","volume":"55","author":"Rosu","year":"2016","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"531","DOI":"10.14358\/PERS.72.5.531","article-title":"Automated Feature Generation in Large-Scale Geospatial Libraries for Content-Based Indexing","volume":"72","author":"Tobin","year":"2006","journal-title":"Photogramm. Eng. Remote Sens."},{"key":"ref_12","first-page":"2564","article-title":"GeoIRIS: Geospatial Information Retrieval and Indexing System-Content Mining, Semantics Modeling, and Complex Queries","volume":"102","author":"Shyu","year":"2013","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"doi-asserted-by":"crossref","unstructured":"Ye, D., Li, Y., Tao, C., Xie, X., and Wang, X. (2017). Multiple feature hashing learning for large-scale remote sensing image retrieval. ISPRS Int. J. Geo-Inf., 6.","key":"ref_13","DOI":"10.3390\/ijgi6110364"},{"unstructured":"Guo, M., Yuan, Y., and Lu, X. (2018, January 19\u201320). Deep Cross-Modal Retrieval for Remote Sensing Image and Audio. Proceedings of the 2018 10th IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS), Beijing, China.","key":"ref_14"},{"doi-asserted-by":"crossref","unstructured":"Jiang, Q.Y., and Li, W.J. (2017, January 21\u201326). Deep Cross-Modal Hashing. Proceedings of the Computer Vision and Pattern Recognition, Honolulu, HI, USA.","key":"ref_15","DOI":"10.1109\/CVPR.2017.348"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"1825","DOI":"10.1109\/TPAMI.2016.2610969","article-title":"Linear subspace ranking hashing for cross-modal retrieval","volume":"39","author":"Li","year":"2016","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"3157","DOI":"10.1109\/TIP.2016.2564638","article-title":"Supervised matrix factorization hashing for cross-modal retrieval","volume":"25","author":"Tang","year":"2016","journal-title":"IEEE Trans. Image Process."},{"doi-asserted-by":"crossref","unstructured":"Li, D., Dimitrova, N., Li, M., and Sethi, I.K. (2003, January 2\u20138). Multimedia content processing through cross-modal association. Proceedings of the Eleventh ACM International Conference on Multimedia, Berkeley, CA, USA.","key":"ref_18","DOI":"10.1145\/957013.957143"},{"doi-asserted-by":"crossref","unstructured":"Zhang, H., Zhuang, Y., and Wu, F. (2007, January 24\u201329). Cross-modal correlation learning for clustering on image-audio dataset. Proceedings of the 15th ACM international conference on Multimedia, Augsburg, Germany.","key":"ref_19","DOI":"10.1145\/1291233.1291290"},{"doi-asserted-by":"crossref","unstructured":"Song, Y., Morency, L.P., and Davis, R. (2012, January 22\u201326). Multimodal human behavior analysis: Learning correlation and interaction across modalities. Proceedings of the 14th ACM International Conference on Multimodal Interaction, New York, NY, USA.","key":"ref_20","DOI":"10.1145\/2388676.2388684"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"22081","DOI":"10.1109\/ACCESS.2017.2761539","article-title":"3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition","volume":"5","author":"Torfi","year":"2017","journal-title":"IEEE Access"},{"doi-asserted-by":"crossref","unstructured":"Arandjelovi, R., and Zisserman, A. (2017, January 21\u201326). Look, Listen and Learn. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","key":"ref_22","DOI":"10.1109\/ICCV.2017.73"},{"doi-asserted-by":"crossref","unstructured":"Nagrani, A., Albanie, S., and Zisserman, A. (2018, January 18\u201322). Seeing Voices and Hearing Faces: Cross-modal biometric matching. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","key":"ref_23","DOI":"10.1109\/CVPR.2018.00879"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"6521","DOI":"10.1109\/TGRS.2018.2839705","article-title":"Learning source-invariant deep hashing convolutional neural networks for cross-source remote sensing image retrieval","volume":"56","author":"Li","year":"2018","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"doi-asserted-by":"crossref","unstructured":"Salem, T., Zhai, M., Workman, S., and Jacobs, N. (2018, January 18\u201322). A multimodal approach to mapping soundscapes. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","key":"ref_25","DOI":"10.1109\/IGARSS.2018.8517977"},{"doi-asserted-by":"crossref","unstructured":"Gu, J., Cai, J., Joty, S., Niu, L., and Wang, G. (2018, January 18\u201322). Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","key":"ref_26","DOI":"10.1109\/CVPR.2018.00750"},{"doi-asserted-by":"crossref","unstructured":"Lin, K., Yang, H.F., Hsiao, J.H., and Chen, C.S. (2015, January 7\u201312). Deep learning of binary hash codes for fast image retrieval. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","key":"ref_27","DOI":"10.1109\/CVPRW.2015.7301269"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"1824","DOI":"10.1109\/JSTARS.2017.2664119","article-title":"SAR Image Content Retrieval Based on Fuzzy Similarity and Relevance Feedback","volume":"10","author":"Xu","year":"2017","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"2979","DOI":"10.1109\/JSTARS.2018.2849408","article-title":"Preliminary Analysis of the Potential and Limitations of MICAP for the Retrieval of Sea Surface Salinity","volume":"11","author":"Zhang","year":"2018","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"3782","DOI":"10.1109\/JSTARS.2018.2861828","article-title":"High-Resolution Three-Dimensional Displacement Retrieval of Mining Areas From a Single SAR Amplitude Pair Using the SPIKE Algorithm","volume":"11","author":"Yang","year":"2018","journal-title":"IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens."},{"doi-asserted-by":"crossref","unstructured":"Lu, X., Chen, Y., and Li, X. (2019). Siamese Dilated Inception Hashing With Intra-Group Correlation Enhancement for Image Retrieval. IEEE Trans. Neural Netw. Learn. Syst.","key":"ref_31","DOI":"10.1109\/TNNLS.2019.2935118"},{"doi-asserted-by":"crossref","unstructured":"Chen, Y., Lu, X., and Feng, Y. (2019, January 8\u201311). Deep Voice-Visual Cross-Modal Retrieval with Deep Feature Similarity Learning. Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision, Xi\u2019an, China.","key":"ref_32","DOI":"10.1007\/978-3-030-31726-3_39"},{"key":"ref_33","first-page":"615","article-title":"Hybrid voice controller for intelligent wheelchair and rehabilitation robot using voice recognition and embedded technologies","volume":"20","author":"Ruzaij","year":"2016","journal-title":"J. Adv. Comput. Intell."},{"unstructured":"Harwath, D., and Glass, J.R. (August, January 30). Learning Word-Like Units from Joint Audio-Visual Analysis. Proceedings of the Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada.","key":"ref_34"},{"key":"ref_35","first-page":"6","article-title":"Spoken digits recognition using weighted MFCC and improved features for dynamic time warping","volume":"40","author":"Chapaneri","year":"2012","journal-title":"Int. J. Comput. Appl."},{"key":"ref_36","first-page":"444","article-title":"Heart Rate Monitoring using Human Speech Features Extraction: A Review","volume":"4","author":"Chahal","year":"2017","journal-title":"Heart"},{"key":"ref_37","first-page":"74","article-title":"Recent Survey on Feature Extraction Methods for Voice Pathology and Voice Disorder","volume":"6","author":"Selvakumari","year":"2017","journal-title":"Int. J. Comput. Math. Sci."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"132","DOI":"10.1016\/j.csl.2014.02.001","article-title":"Automatic intelligibility classification of sentence-level pathological speech","volume":"29","author":"Kim","year":"2015","journal-title":"Comput. Speech Lang."},{"unstructured":"Harwath, D.F., Torralba, A., and Glass, J.R. (2016, January 5\u201310). Unsupervised Learning of Spoken Language with Visual Context. Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain.","key":"ref_39"},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"4766","DOI":"10.1109\/TIP.2015.2467315","article-title":"Bit-Scalable Deep Hashing With Regularized Similarity Learning for Image Retrieval and Person Re-Identification","volume":"24","author":"Zhang","year":"2015","journal-title":"IEEE Trans. Image Process."},{"unstructured":"Liu, H., Wang, R., Shan, S., and Chen, X. (July, January 26). Deep Supervised Hashing for Fast Image Retrieval. Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","key":"ref_41"},{"doi-asserted-by":"crossref","unstructured":"Gong, Y., and Lazebnik, S. (2011, January 20\u201325). Iterative quantization: A procrustean approach to learning binary codes. Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, USA.","key":"ref_42","DOI":"10.1109\/CVPR.2011.5995432"},{"doi-asserted-by":"crossref","unstructured":"Zhu, H., Long, M., Wang, J., and Cao, Y. (2016, January 12\u201317). Deep hashing network for efficient similarity retrieval. Proceedings of the AAAI, Phoenix, AZ, USA.","key":"ref_43","DOI":"10.1609\/aaai.v30i1.10235"},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"106","DOI":"10.1109\/TIP.2017.2755766","article-title":"Hierarchical Recurrent Neural Hashing for Image Retrieval with Hierarchical Convolutional Features","volume":"27","author":"Lu","year":"2018","journal-title":"IEEE Trans. Image Process."},{"doi-asserted-by":"crossref","unstructured":"Hyv\u00e4rinen, A., Hurri, J., and Hoyer, P.O. (2009). Natural Image Statistics: A Probabilistic Approach to Early Computational Vision, Springer Science&Business Media.","key":"ref_45","DOI":"10.1007\/978-1-84882-491-1"},{"unstructured":"Kingma, D.P., and Ba, J. (2015, January 7\u20139). Adam: A Method for Stochastic Optimization. Proceedings of the International Conference for Learning Representations, San Diego, CA, USA.","key":"ref_46"},{"doi-asserted-by":"crossref","unstructured":"Qu, B., Li, X., Tao, D., and Lu, X. (2016, January 6\u20138). Deep semantic understanding of high resolution remote sensing image. Proceedings of the International Conference on Computer, Information and Telecommunication Systems, Kunming, China.","key":"ref_47","DOI":"10.1109\/CITS.2016.7546397"},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"2183","DOI":"10.1109\/TGRS.2017.2776321","article-title":"Exploring Models and Data for Remote Sensing Image Caption Generation","volume":"56","author":"Lu","year":"2017","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_49","doi-asserted-by":"crossref","first-page":"2173","DOI":"10.1109\/TGRS.2018.2871830","article-title":"An Approach to Improve Leaf Pigment Content Retrieval by Removing Specular Reflectance Through Polarization Measurements","volume":"57","author":"Li","year":"2018","journal-title":"IEEE Trans. Geosci. Remote Sens."},{"key":"ref_50","doi-asserted-by":"crossref","first-page":"202","DOI":"10.1016\/j.rse.2017.04.020","article-title":"Soil moisture retrieval from AMSR-E and ASCAT microwave observation synergy. Part 2: Product evaluation","volume":"195","author":"Kolassa","year":"2017","journal-title":"Remote Sens. Environ."},{"key":"ref_51","first-page":"406","article-title":"Retrieval of remote sensing images based on semisupervised deep learning","volume":"21","author":"Zhang","year":"2017","journal-title":"J. Remote Sens."},{"doi-asserted-by":"crossref","unstructured":"Imbriaco, R., Sebastian, C., Bondarev, E., and De With, P.H.N. (2019). Aggregated Deep Local Features for Remote Sensing Image Retrieval. Remote Sens., 11.","key":"ref_52","DOI":"10.3390\/rs11050493"},{"key":"ref_53","doi-asserted-by":"crossref","first-page":"870","DOI":"10.1109\/TR.2018.2847247","article-title":"Finding bugs in cryptographic hash function implementations","volume":"67","author":"Mouha","year":"2018","journal-title":"IEEE Trans. Reliab."},{"doi-asserted-by":"crossref","unstructured":"Guo, M., Zhou, C., and Liu, J. (2019). Jointly Learning of Visual and Auditory: A New Approach for RS Image and Audio Cross-Modal Retrieval. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.","key":"ref_54","DOI":"10.1109\/JSTARS.2019.2949220"},{"unstructured":"Pan, P., Xu, Z., Yang, Y., Wu, F., and Zhuang, Y. (July, January 26). Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning. Proceedings of the International Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","key":"ref_55"}],"container-title":["Remote Sensing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2072-4292\/12\/1\/84\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T13:45:32Z","timestamp":1760190332000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2072-4292\/12\/1\/84"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,12,25]]},"references-count":55,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2020,1]]}},"alternative-id":["rs12010084"],"URL":"https:\/\/doi.org\/10.3390\/rs12010084","relation":{},"ISSN":["2072-4292"],"issn-type":[{"type":"electronic","value":"2072-4292"}],"subject":[],"published":{"date-parts":[[2019,12,25]]}}}