{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,9]],"date-time":"2026-04-09T08:04:45Z","timestamp":1775721885891,"version":"3.50.1"},"reference-count":27,"publisher":"MDPI AG","issue":"8","license":[{"start":{"date-parts":[[2021,7,26]],"date-time":"2021-07-26T00:00:00Z","timestamp":1627257600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["J. Imaging"],"abstract":"<jats:p>Visual-semantic embedding (VSE) networks create joint image\u2013text representations to map images and texts in a shared embedding space to enable various information retrieval-related tasks, such as image\u2013text retrieval, image captioning, and visual question answering. The most recent state-of-the-art VSE-based networks are: VSE++, SCAN, VSRN, and UNITER. This study evaluates the performance of those VSE networks for the task of image-to-text retrieval and identifies and analyses their strengths and limitations to guide future research on the topic. The experimental results on Flickr30K revealed that the pre-trained network, UNITER, achieved 61.5% on average Recall@5 for the task of retrieving all relevant descriptions. The traditional networks, VSRN, SCAN, and VSE++, achieved 50.3%, 47.1%, and 29.4% on average Recall@5, respectively, for the same task. An additional analysis was performed on image\u2013text pairs from the top 25 worst-performing classes using a subset of the Flickr30K-based dataset to identify the limitations of the performance of the best-performing models, VSRN and UNITER. These limitations are discussed from the perspective of image scenes, image objects, image semantics, and basic functions of neural networks. This paper discusses the strengths and limitations of VSE networks to guide further research into the topic of using VSE networks for cross-modal information retrieval tasks.<\/jats:p>","DOI":"10.3390\/jimaging7080125","type":"journal-article","created":{"date-parts":[[2021,7,26]],"date-time":"2021-07-26T04:19:30Z","timestamp":1627273170000},"page":"125","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":13,"title":["On the Limitations of Visual-Semantic Embedding Networks for Image-to-Text Information Retrieval"],"prefix":"10.3390","volume":"7","author":[{"ORCID":"https:\/\/orcid.org\/0000-0003-2853-2108","authenticated-orcid":false,"given":"Yan","family":"Gong","sequence":"first","affiliation":[{"name":"Department of Computer Science, School of Science, Loughborough University, Loughborough LE11 3TT, UK"}]},{"given":"Georgina","family":"Cosma","sequence":"additional","affiliation":[{"name":"Department of Computer Science, School of Science, Loughborough University, Loughborough LE11 3TT, UK"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9365-7420","authenticated-orcid":false,"given":"Hui","family":"Fang","sequence":"additional","affiliation":[{"name":"Department of Computer Science, School of Science, Loughborough University, Loughborough LE11 3TT, UK"}]}],"member":"1968","published-online":{"date-parts":[[2021,7,26]]},"reference":[{"key":"ref_1","unstructured":"Faghri, F., Fleet, D.J., Kiros, J.R., and Fidler, S. (2018, January 3\u20136). VSE++: Improving visual-semantic embeddings with hard negatives. Proceedings of the British Machine Vision Conference, Newcastle, UK."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Lee, K.H., Chen, X., Hua, G., Hu, H., and He, X. (2018, January 8\u201314). Stacked cross attention for image-text matching. Proceedings of the European Conference on Computer Vision, Munich, Germany.","DOI":"10.1007\/978-3-030-01225-0_13"},{"key":"ref_3","unstructured":"Li, K., Zhang, Y., Li, K., Li, Y., and Fu, Y. (November, January 27). Visual semantic reasoning for image-text matching. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Korea."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Chen, Y.C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Y., and Liu, J. (2020, January 23\u201328). UNITER: Universal image-text representation learning. Proceedings of the European Conference on Computer Vision, Glasgow, UK.","DOI":"10.1007\/978-3-030-58577-8_7"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015, January 7\u201312). Show and tell: A neural image caption generator. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"ref_6","unstructured":"Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. (2015, January 6\u201311). Show, attend and tell: Neural image caption generation with visual attention. Proceedings of the International Conference on Machine Learning, PMLR, Lille, France."},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Karpathy, A., and Fei-Fei, L. (2015, January 7\u201312). Deep visual-semantic alignments for generating image descriptions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"291","DOI":"10.1016\/j.neucom.2018.05.080","article-title":"A survey on automatic image caption generation","volume":"311","author":"Bai","year":"2018","journal-title":"Neurocomputing"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., and Parikh, D. (2015, January 7\u201313). VQA: Visual question answering. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.279"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. (2017, January 21\u201326). Making the V in VQA matter: Elevating the role of image understanding in visual question answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.670"},{"key":"ref_11","unstructured":"Kipf, T.N., and Welling, M. (2017, January 24\u201326). Semi-supervised classification with graph convolutional networks. Proceedings of the 5th International Conference on Learning Representations, Conference Track Proceedings (2017), Toulon, France."},{"key":"ref_12","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2019, January 2\u20137). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics (Volume 1: Long and Short Papers), Minneapolis, MN, USA."},{"key":"ref_13","unstructured":"Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll\u00e1r, P., and Zitnick, C.L. Microsoft COCO: Common objects in context. Proceedings of the European Conference on Computer Vision."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"67","DOI":"10.1162\/tacl_a_00166","article-title":"From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions","volume":"2","author":"Young","year":"2014","journal-title":"Trans. Assoc. Comput. Linguist."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Wang, Y., Yang, H., Bai, X., Qian, X., Ma, L., Lu, J., Li, B., and Fan, X. (2020). PFAN++: Bi-Directional Image-Text retrieval with position focused attention network. IEEE Trans. Multimed.","DOI":"10.1109\/TMM.2020.3024822"},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"1137","DOI":"10.1109\/TPAMI.2016.2577031","article-title":"Faster R-CNN: Towards real-time object detection with region proposal networks","volume":"39","author":"Ren","year":"2016","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016, January 27\u201330). You Only Look Once: Unified, real-time object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.91"},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Lin, T.Y., Goyal, P., Girshick, R., He, K., and Doll\u00e1r, P. (2017, January 22\u201329). Focal loss for dense object detection. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.324"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., and Bengio, Y. (2014, January 25\u201329). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar.","DOI":"10.3115\/v1\/D14-1179"},{"key":"ref_20","unstructured":"Simonyan, K., and Zisserman, A. (2015, January 7\u20139). Very deep convolutional networks for large-scale image recognition. Proceedings of the 3rd International Conference on Learning Representations, Conference Track Proceedings (2015), San Diego, CA, USA."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27\u201330). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.90"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"32","DOI":"10.1007\/s11263-016-0981-7","article-title":"Visual genome: Connecting language and vision using crowdsourced dense image annotations","volume":"123","author":"Krishna","year":"2017","journal-title":"Int. J. Comput. Vis."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Sharma, P., Ding, N., Goodman, S., and Soricut, R. (2018, January 15\u201320). Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia.","DOI":"10.18653\/v1\/P18-1238"},{"key":"ref_24","first-page":"1143","article-title":"Im2text: Describing images using 1 million captioned photographs","volume":"24","author":"Ordonez","year":"2011","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Manning, C.D., Raghavan, P., and Sch\u00fctze, H. (2008). Introduction to Information Retrieval, Cambridge University Press.","DOI":"10.1017\/CBO9780511809071"},{"key":"ref_26","first-page":"35","article-title":"Evaluation of information retrieval systems","volume":"4","author":"Zuva","year":"2012","journal-title":"Int. J. Comput. Sci. Inf. Technol."},{"key":"ref_27","first-page":"1097","article-title":"Imagenet classification with deep convolutional neural networks","volume":"25","author":"Krizhevsky","year":"2012","journal-title":"Adv. Neural Inf. Process. Syst."}],"container-title":["Journal of Imaging"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2313-433X\/7\/8\/125\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T06:34:50Z","timestamp":1760164490000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2313-433X\/7\/8\/125"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,7,26]]},"references-count":27,"journal-issue":{"issue":"8","published-online":{"date-parts":[[2021,8]]}},"alternative-id":["jimaging7080125"],"URL":"https:\/\/doi.org\/10.3390\/jimaging7080125","relation":{},"ISSN":["2313-433X"],"issn-type":[{"value":"2313-433X","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,7,26]]}}}