{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,27]],"date-time":"2026-02-27T16:08:34Z","timestamp":1772208514016,"version":"3.50.1"},"reference-count":40,"publisher":"Springer Science and Business Media LLC","issue":"3","license":[{"start":{"date-parts":[[2021,5,31]],"date-time":"2021-05-31T00:00:00Z","timestamp":1622419200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,5,31]],"date-time":"2021-05-31T00:00:00Z","timestamp":1622419200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["61802116"],"award-info":[{"award-number":["61802116"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62072157"],"award-info":[{"award-number":["62072157"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Training Plan of Young Backbone Teachers in Universities of Henan Province","award":["2020GGJS263"],"award-info":[{"award-number":["2020GGJS263"]}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Data Sci. Eng."],"published-print":{"date-parts":[[2021,9]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Cross-modal similarity query has become a highlighted research topic for managing multimodal datasets such as images and texts. Existing researches generally focus on query accuracy by designing complex deep neural network models and hardly consider query efficiency and interpretability simultaneously, which are vital properties of cross-modal semantic query processing system on large-scale datasets. In this work, we investigate multi-grained common semantic embedding representations of images and texts and integrate interpretable query index into the deep neural network by developing a novel Multi-grained Cross-modal Query with Interpretability (MCQI) framework. The main contributions are as follows: (1) By integrating coarse-grained and fine-grained semantic learning models, a multi-grained cross-modal query processing architecture is proposed to ensure the adaptability and generality of query processing. (2) In order to capture the latent semantic relation between images and texts, the framework combines LSTM and attention mode, which enhances query accuracy for the cross-modal query and constructs the foundation for interpretable query processing. (3) Index structure and corresponding nearest neighbor query algorithm are proposed to boost the efficiency of interpretable queries. (4) A distributed query algorithm is proposed to improve the scalability of our framework. Comparing with state-of-the-art methods on widely used cross-modal datasets, the experimental results show the effectiveness of our MCQI approach.<\/jats:p>","DOI":"10.1007\/s41019-021-00162-4","type":"journal-article","created":{"date-parts":[[2021,5,31]],"date-time":"2021-05-31T17:04:52Z","timestamp":1622480692000},"page":"280-293","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":13,"title":["Scalable Multi-grained Cross-modal Similarity Query with Interpretability"],"prefix":"10.1007","volume":"6","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-1019-6997","authenticated-orcid":false,"given":"Mingdong","family":"Zhu","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Derong","family":"Shen","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Lixin","family":"Xu","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Xianfang","family":"Wang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2021,5,31]]},"reference":[{"issue":"9","key":"162_CR1","doi-asserted-by":"publisher","first-page":"2372","DOI":"10.1109\/TCSVT.2017.2705068","volume":"28","author":"Y Peng","year":"2018","unstructured":"Peng Y, Huang X, Zhao Y (2018) An over view of cross-media retrieval: Concepts, methodologies, benchmarks and challenges. IEEE Trans Circuits Syst Video Technol 28(9):2372\u20132385","journal-title":"IEEE Trans Circuits Syst Video Technol"},{"key":"162_CR2","doi-asserted-by":"crossref","unstructured":"He X, Peng Y, Xi L (2019) A new benchmark and approach for fine-grained cross-media retrieval. In: 27th ACM international conference on multimedia, ACM. pp 1740\u20131748","DOI":"10.1145\/3343031.3350974"},{"key":"162_CR3","doi-asserted-by":"crossref","unstructured":"Rasiwasia N, Pereira J, Coviello E et al (2010) A new approach to cross-modal multimedia retrieval. In: 18th international conference on multimedia, ACM. pp 251\u2013260","DOI":"10.1145\/1873951.1873987"},{"issue":"6","key":"162_CR4","doi-asserted-by":"publisher","first-page":"965","DOI":"10.1109\/TCSVT.2013.2276704","volume":"24","author":"X Zhai","year":"2014","unstructured":"Zhai X, Peng Y, Xiao J (2014) Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Trans Circuits Syst Video Technol 24(6):965\u2013978","journal-title":"IEEE Trans Circuits Syst Video Technol"},{"issue":"3","key":"162_CR5","doi-asserted-by":"publisher","first-page":"583","DOI":"10.1109\/TCSVT.2015.2400779","volume":"26","author":"Y Peng","year":"2016","unstructured":"Peng Y, Zhai X, Zhao Y, Huang X (2016) Semi-supervised cross-media feature learning with unified patch graph regularization. IEEE Trans Circuits Syst Video Technol 26(3):583\u2013596","journal-title":"IEEE Trans Circuits Syst Video Technol"},{"key":"162_CR6","doi-asserted-by":"crossref","unstructured":"Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text. In: IEEE conference on computer vision and pattern recognition, IEEE. pp 3441\u20133450","DOI":"10.1109\/CVPR.2015.7298966"},{"key":"162_CR7","doi-asserted-by":"crossref","unstructured":"He L, Xu X, Lu H et al (2017) Unsupervised cross-modal retrieval through adversarial learning. In: IEEE international conference on multimedia and expo, IEEE. pp 1153\u20131158","DOI":"10.1109\/ICME.2017.8019549"},{"issue":"4","key":"162_CR8","doi-asserted-by":"publisher","first-page":"1173","DOI":"10.1109\/TCSVT.2019.2900171","volume":"30","author":"J Chi","year":"2020","unstructured":"Chi J, Peng Y (2020) Zero-shot cross-media embedding learning with dual adversarial distribution network. IEEE Trans Circuits Syst Video Technol 30(4):1173\u20131187","journal-title":"IEEE Trans Circuits Syst Video Technol"},{"key":"162_CR9","unstructured":"Andrej K, Armand J, Li F (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: 27th international conference on neural information processing systems, ACM. pp 1889\u20131897"},{"issue":"4","key":"162_CR10","doi-asserted-by":"publisher","first-page":"664","DOI":"10.1109\/TPAMI.2016.2598339","volume":"39","author":"K Andrej","year":"2017","unstructured":"Andrej K, Li F (2017) Deep Visual-Semantic Alignments for Generating Image Descriptions. IEEE Trans Pattern Anal Mach Intell 39(4):664\u2013676","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"key":"162_CR11","unstructured":"Xu K, Ba J, Kiros R et al (2015) Show, attend and tell: neural image caption generation with visual attention. In: 2015 international conference on machine learning, IEEE. pp 2048\u20132057"},{"key":"162_CR12","doi-asserted-by":"crossref","unstructured":"Wang X, Wang Y, Wan W (2018) Watch, listen and describe: globally and locally aligned cross-modal attentions for video captioning. In: Proceedings of 2018 conference of the North American chapter of the association for computational linguistics, ACL. pp 795\u2013801","DOI":"10.18653\/v1\/N18-2125"},{"key":"162_CR13","doi-asserted-by":"crossref","unstructured":"Jiang Q, Li W (2017) Deep cross-modal hashing. In: 2017 IEEE conference on computer vision and pattern recognition, IEEE. pp 3270\u20133278","DOI":"10.1109\/CVPR.2017.348"},{"key":"162_CR14","doi-asserted-by":"crossref","unstructured":"Cao Y, Long M, Wang J et al (2016) Correlation autoencoder hashing for supervised cross-modal search. In: international conference on multimedia retrieval, ACM. pp 197\u2013204","DOI":"10.1145\/2911996.2912000"},{"key":"162_CR15","doi-asserted-by":"crossref","unstructured":"Cao Y, Long M, Wang J (2017) Correlation hashing network for efficient cross-modal retrieval. In: 28th British machine vision conference, BMVA. pp 1\u201312","DOI":"10.5244\/C.31.128"},{"key":"162_CR16","doi-asserted-by":"crossref","unstructured":"Yang E, Deng C, Liu W et al (2017) Pairwise relationship guided deep hashing for cross-modal retrieval. In: 31st conference on artificial intelligence, AAAI. pp 1618\u20131625","DOI":"10.1609\/aaai.v31i1.10719"},{"key":"162_CR17","doi-asserted-by":"crossref","unstructured":"Zhang J, Peng Y, Yuan M et al (2018) Unsupervised generative adversarial cross-modal hashing. In 32nd conference on artificial intelligence, AAAI. pp 539\u2013546","DOI":"10.1609\/aaai.v32i1.11263"},{"issue":"4","key":"162_CR18","first-page":"1","volume":"4","author":"K Yang","year":"2019","unstructured":"Yang K, Ding X, Zhang Y et al (2019) Distributed similarity queries in metric spaces. Data Science and Engineering 4(4):1\u201316","journal-title":"Data Science and Engineering"},{"key":"162_CR19","doi-asserted-by":"crossref","unstructured":"Batko M (2004) Distributed and scalable similarity searching in metric spaces. In: 9th EDBT, ACM. pp 44\u2013153","DOI":"10.1007\/978-3-540-30192-9_5"},{"issue":"4","key":"162_CR20","doi-asserted-by":"publisher","first-page":"721","DOI":"10.1016\/j.is.2010.10.002","volume":"36","author":"D Novak","year":"2011","unstructured":"Novak D, Batko M (2011) Zezula P, Metric index: An efficient and scalable solution for precise and approximate similarity search. Inf Syst 36(4):721\u2013733","journal-title":"Inf Syst"},{"key":"162_CR21","doi-asserted-by":"crossref","unstructured":"Wang J, Wu S, Gao H et al (2010) Indexing multi-dimensional data in a cloud system. In: SIGMOD, ACM. pp 591\u2013602","DOI":"10.1145\/1807167.1807232"},{"key":"162_CR22","doi-asserted-by":"crossref","unstructured":"Wu S, Jiang D, Ooi B, Wu K (2010) Efficient B-tree based indexing for cloud data processing. In: 36th VLDB, ACM. pp 1207\u20131218","DOI":"10.14778\/1920841.1920991"},{"issue":"2","key":"162_CR23","doi-asserted-by":"publisher","first-page":"165","DOI":"10.1007\/s00778-005-0001-y","volume":"16","author":"E Tanin","year":"2007","unstructured":"Tanin E, Harwood A, Samet H (2007) Using a distributed quadtree index in peer-to-peer networks. VLDB J 16(2):165\u2013178","journal-title":"VLDB J"},{"key":"162_CR24","doi-asserted-by":"crossref","unstructured":"Bennanismires K, Musat C, Hossmann A et al (2018) Simple Unsupervised Keyphrase Extraction using Sentence Embeddings. In: conference on computational natural language learning, ACL. pp 221\u2013229","DOI":"10.18653\/v1\/K18-1022"},{"key":"162_CR25","doi-asserted-by":"crossref","unstructured":"Shen Y, He X, Gao, J et al (2014) A latent semantic model with convolutional-pooling structure for information retrieval. In: conference on information and knowledge management, ACM. pp 101\u2013110","DOI":"10.1145\/2661829.2661935"},{"key":"162_CR26","doi-asserted-by":"crossref","unstructured":"Cheng B, Wei Y, Shi H et al (2018) Revisiting RCNN: On awakening the classification power of faster RCNN. In: European conference on computer vision, Springer. pp 473\u2013490","DOI":"10.1007\/978-3-030-01267-0_28"},{"key":"162_CR27","doi-asserted-by":"crossref","unstructured":"Cer D, Yang Y, Kong S et al (2018) Universal Sentence Encoder. arXiv: Computation and Language. https:\/\/arxiv.org\/abs\/1803.11175v2. Accessed 12 April 2018","DOI":"10.18653\/v1\/D18-2029"},{"key":"162_CR28","unstructured":"Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: 13th international conference on artificial intelligence and statistics, JMLR. pp 249\u2013256"},{"issue":"1","key":"162_CR29","first-page":"49","volume":"12","author":"M Zhu","year":"2018","unstructured":"Zhu M, Xu L, Shen D et al (2018) Methods for similarity query on uncertain data with cosine similarity constraints. Journal of Frontiers of Computer Science and Technology 12(1):49\u201364","journal-title":"Journal of Frontiers of Computer Science and Technology"},{"issue":"1","key":"162_CR30","doi-asserted-by":"publisher","first-page":"853","DOI":"10.1613\/jair.3994","volume":"47","author":"M Hodosh","year":"2013","unstructured":"Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. Journal of Artificial Intelligence Research 47(1):853\u2013899","journal-title":"Journal of Artificial Intelligence Research"},{"issue":"2","key":"162_CR31","doi-asserted-by":"publisher","first-page":"67","DOI":"10.1162\/tacl_a_00166","volume":"7","author":"P Young","year":"2014","unstructured":"Young P, Lai A, Hodosh M et al (2014) From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 7(2):67\u201378","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"162_CR32","doi-asserted-by":"crossref","unstructured":"Chua T, Tan J, Hong R et al (2009) NUS-WIDE: a real-world web image database from national university of Singapore. In: 8th conference on image and video retrieval, ACM. pp 1\u20139","DOI":"10.1145\/1646396.1646452"},{"key":"162_CR33","doi-asserted-by":"crossref","unstructured":"Lin T, Maire M, Belongie S (2014) Microsoft coco: Common objects in context. In: 13th European conference on Computer Vision (ECCV), Springer. pp 740\u2013755","DOI":"10.1007\/978-3-319-10602-1_48"},{"issue":"2","key":"162_CR34","doi-asserted-by":"publisher","first-page":"405","DOI":"10.1109\/TMM.2017.2742704","volume":"20","author":"Y Peng","year":"2018","unstructured":"Peng Y, Qi J, Huang X et al (2018) CCL: Cross-modal correlation learning with multigrained fusion by hierarchical network. IEEE Trans Multimedia 20(2):405\u2013420","journal-title":"IEEE Trans Multimedia"},{"key":"162_CR35","doi-asserted-by":"crossref","unstructured":"Chen T, Wu W, Gao Y et al (2018) Fine-grained representation learning and recognition by exploiting hierarchical semantic embedding. In: 26th ACM multimedia, ACM. pp 2023\u20132031","DOI":"10.1145\/3240508.3240523"},{"key":"162_CR36","doi-asserted-by":"crossref","unstructured":"Lee K, Chen X, Hua G et al (2018) Stacked cross attention for image-text matching. In: European conference on computer vision, Springer. pp 212\u2013228","DOI":"10.1007\/978-3-030-01225-0_13"},{"issue":"3","key":"162_CR37","doi-asserted-by":"publisher","first-page":"370","DOI":"10.1109\/TMM.2015.2390499","volume":"17","author":"C Kang","year":"2015","unstructured":"Kang C, Xiang S, Liao S et al (2015) Learning Consistent Feature Representation for Cross-Modal Multimedia Retrieval. IEEE Trans Multimedia 17(3):370\u2013381","journal-title":"IEEE Trans Multimedia"},{"issue":"12","key":"162_CR38","doi-asserted-by":"publisher","first-page":"2639","DOI":"10.1162\/0899766042321814","volume":"16","author":"D Hardoon","year":"2004","unstructured":"Hardoon D, Szedmak S, Shawetaylor J et al (2004) Canonical correlation analysis: An overview with application to learning methods. Neural Comput 16(12):2639\u20132664","journal-title":"Neural Comput"},{"key":"162_CR39","doi-asserted-by":"crossref","unstructured":"Akdogan A, Demiryurek U, Kashani FB et al (2010) Voronoi-based geospatial query processing with mapreduce. In: 2nd international conference of cloud Computing(CloudCom), IEEE. pp 9\u201316","DOI":"10.1109\/CloudCom.2010.92"},{"key":"162_CR40","unstructured":"Abadi M, Barham, P, Chen J et al (2016) TensorFlow: A system for large-scale machine learning. In: 12th USENIX conference on operating systems design and implementation, ACM. pp 265\u2013283"}],"container-title":["Data Science and Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s41019-021-00162-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s41019-021-00162-4\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s41019-021-00162-4.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,29]],"date-time":"2022-12-29T05:33:59Z","timestamp":1672292039000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s41019-021-00162-4"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,5,31]]},"references-count":40,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2021,9]]}},"alternative-id":["162"],"URL":"https:\/\/doi.org\/10.1007\/s41019-021-00162-4","relation":{},"ISSN":["2364-1185","2364-1541"],"issn-type":[{"value":"2364-1185","type":"print"},{"value":"2364-1541","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,5,31]]},"assertion":[{"value":"19 January 2021","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"3 April 2021","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"19 April 2021","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"31 May 2021","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Avoid reviewers from Henan Institute of Technology and Northeastern University.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}},{"value":"All authors consent to participate in this work.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent to participate"}},{"value":"All authors consent to publish the paper.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}}]}}