{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,29]],"date-time":"2026-04-29T15:51:22Z","timestamp":1777477882442,"version":"3.51.4"},"reference-count":54,"publisher":"Cambridge University Press (CUP)","issue":"3","license":[{"start":{"date-parts":[[2018,3,28]],"date-time":"2018-03-28T00:00:00Z","timestamp":1522195200000},"content-version":"unspecified","delay-in-days":0,"URL":"https:\/\/www.cambridge.org\/core\/terms"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Nat. Lang. Eng."],"published-print":{"date-parts":[[2018,5]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>A growing body of recent work focuses on the challenging problem of scene understanding using a variety of cross-modal methods which fuse techniques from image and text processing. In this paper, we develop representations for the semantics of scenes by explicitly encoding the objects detected in them and their spatial relations. We represent image content via two well-known types of tree representations, namely constituents and dependencies. Our representations are created deterministically, can be applied to any image dataset irrespective of the task at hand, and are amenable to standard NLP tools developed for tree-based structures. We show that we can apply syntax-based SMT and tree kernel methods in order to build models for image description generation and image-based retrieval. Experimental results on real-world images demonstrate the effectiveness of the framework.<\/jats:p>","DOI":"10.1017\/s1351324918000104","type":"journal-article","created":{"date-parts":[[2018,3,28]],"date-time":"2018-03-28T10:17:50Z","timestamp":1522232270000},"page":"441-465","source":"Crossref","is-referenced-by-count":3,"title":["Understanding visual scenes"],"prefix":"10.1017","volume":"24","author":[{"given":"CARINA","family":"SILBERER","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"JASPER","family":"UIJLINGS","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"MIRELLA","family":"LAPATA","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"56","published-online":{"date-parts":[[2018,3,28]]},"reference":[{"key":"S1351324918000104_ref012","unstructured":"Elliott D. , Lavrenko V. and Keller F. 2014. Query-by-example image retrieval using visual dependency representations. In COLING 2014, 25th International Conference on Computational Linguistics, pp. 109\u201320."},{"key":"S1351324918000104_ref022","doi-asserted-by":"crossref","unstructured":"Koehn P. , Hoang H. , Birch A. , Callison-Burch C. , Federico M. , Bertoldi N. , Cowan B. , Shen W. , Moran C. , Zens R. , Dyer C. , Bojar O. , Constantin A. , and Herbst E. 2007. Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177\u201380.","DOI":"10.3115\/1557769.1557821"},{"key":"S1351324918000104_ref021","doi-asserted-by":"publisher","DOI":"10.1145\/201019.201022"},{"key":"S1351324918000104_ref017","unstructured":"Huang T.-H. (Kenneth), Ferraro F. , Mostafazadeh N. , Misra I. , Agrawal A. , Devlin J. , Girshick R. , He X. , Kohli P. , Batra D. , Zitnick C. L. , Parikh D. , Vanderwende L. , Galley M. , and Mitchell M. 2016. Visual storytelling. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: HLT, pp. 1233\u20139."},{"key":"S1351324918000104_ref016","unstructured":"Huang L. 2006. Statistical syntax-directed translation with extended domain of locality. In Proceedings of the Association for Machine Translation in the Americas, pp. 66\u201373."},{"key":"S1351324918000104_ref049","doi-asserted-by":"crossref","unstructured":"Yatskar M. , Ordonez V. and Farhadi A. 2016a. Stating the obvious: extracting visual common sense knowledge. In Proceedings of the 2016 Conference of the NAACL: Human Language Technologies, pp. 193\u20138.","DOI":"10.18653\/v1\/N16-1023"},{"key":"S1351324918000104_ref037","doi-asserted-by":"crossref","unstructured":"Philbin J. , Chum O. , Isard M. , Sivic J. , and Zisserman A. 2008. Lost in quantization: improving particular object retrieval in large scale image databases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.","DOI":"10.1109\/CVPR.2008.4587635"},{"key":"S1351324918000104_ref005","doi-asserted-by":"crossref","unstructured":"Coyne B. and Sproat R. 2001. WordsEye: an automatic text-to-scene conversion system. In SIGGRAPH '01: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques.","DOI":"10.1145\/383259.383316"},{"key":"S1351324918000104_ref028","doi-asserted-by":"crossref","unstructured":"Lin D. , Fidler S. , Kong C. and Urtasun R. 2015. Generating multi-sentence lingual descriptions of indoor scenes. In Proceedings of the British Machine Vision Conference.","DOI":"10.5244\/C.29.93"},{"key":"S1351324918000104_ref038","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-016-0965-7"},{"key":"S1351324918000104_ref008","unstructured":"Devlin J. , Gupta S. , Girshick R. B. , Mitchell M. and Zitnick C. L. 2015. Exploring nearest neighbor approaches for image captioning. ArXiv e-prints, abs\/1505.04467."},{"key":"S1351324918000104_ref015","unstructured":"Heafield K. , Pouzyrevsky I. , Clark J. H. and Koehn P. 2013. Scalable modified Kneser\u2013Ney language model estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pp. 690\u20136."},{"key":"S1351324918000104_ref011","unstructured":"Elliott D. and Keller F. 2013 (October). Image description using visual dependency representations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1292\u20131302."},{"key":"S1351324918000104_ref010","doi-asserted-by":"crossref","unstructured":"Elliott D. and de Vries A. 2015. Describing images using inferred visual dependency representations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pp. 42\u201352.","DOI":"10.3115\/v1\/P15-1005"},{"key":"S1351324918000104_ref034","doi-asserted-by":"publisher","DOI":"10.1162\/0891201053630264"},{"key":"S1351324918000104_ref004","unstructured":"Collins M. and Duffy N. 2001. Convolution kernels for natural language. In Proceedings of the 14th International Conference on Advances in Neural Information Processing Systems: Natural and Synthetic, pp. 625\u201332."},{"key":"S1351324918000104_ref029","unstructured":"Mitchell M. , Han X. , Dodge J. , Mensch A. , Goyal A. , Berg A. , Yamaguchi K. , Berg T. , Stratos K. , and Daum\u00e9 H. III 2012. Midge: generating image descriptions from computer vision detections. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 747\u201356."},{"key":"S1351324918000104_ref019","doi-asserted-by":"crossref","unstructured":"Johnson J. , Krishna R. , Stark M. , Li L.-J. , Shamma D. A. , Bernstein M. S. , and Fei-Fei L. 2015. Image retrieval using scene graphs. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, pp. 3668\u201378.","DOI":"10.1109\/CVPR.2015.7298990"},{"key":"S1351324918000104_ref054","first-page":"627","article-title":"Adopting abstract images for semantic scene understanding","volume":"38","author":"Zitnick","year":"2016","journal-title":"IEEE"},{"key":"S1351324918000104_ref006","doi-asserted-by":"crossref","unstructured":"Culotta A. and Sorensen J. 2004. Dependency tree kernels for relation extraction. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics.","DOI":"10.3115\/1218955.1219009"},{"key":"S1351324918000104_ref001","unstructured":"Aditya S. , Baral C. , Yang Y. , Aloimonos Y. , and Fermuller C. 2016. DeepIU: an Architecture for image understanding. In Proceedings of Advances in Cognitive Systems."},{"key":"S1351324918000104_ref002","doi-asserted-by":"crossref","unstructured":"Antol S. , Agrawal A. , Lu J. , Mitchell M. , Batra D. , Lawrence Zitnick C. , and Parikh D. 2015. VQA: visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).","DOI":"10.1109\/ICCV.2015.279"},{"key":"S1351324918000104_ref003","unstructured":"Chen X. , Fang H. , Lin T.-Y. , Vedantam R. , Gupta S. , Doll\u00e1r P. , and Zitnick C. L. 2015. Microsoft COCO captions: data collection and evaluation server. ArXiv e-prints, abs\/1504.00325v2."},{"key":"S1351324918000104_ref007","unstructured":"Deng Y. , Kanervisto A. , Ling J. and Rush A. M. 2017. Image-to-markup generation with coarse-to-fine attention. In Proceedings of the 34th International Conference on Machine Learning, pp. 980\u201389."},{"key":"S1351324918000104_ref009","volume-title":"Structured Representation of Images for Language Generation and Image Retrieval","author":"Elliott","year":"2015"},{"key":"S1351324918000104_ref036","doi-asserted-by":"crossref","unstructured":"Philbin J. , Chum O. , Isard M. , Sivic J. , and Zisserman A. 2007. Object retrieval with large vocabularies and fast spatial matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.","DOI":"10.1109\/CVPR.2007.383172"},{"key":"S1351324918000104_ref013","doi-asserted-by":"crossref","unstructured":"Girshick R. 2015. Fast R-CNN. In Proceedings of the International Conference on Computer Vision (ICCV), pp. 1440\u20138.","DOI":"10.1109\/ICCV.2015.169"},{"key":"S1351324918000104_ref014","unstructured":"Gupta S. and Malik J. 2015. Visual Semantic Role Labeling. ArXiv e-prints, abs\/1505.04474v1."},{"key":"S1351324918000104_ref018","doi-asserted-by":"crossref","unstructured":"J\u00e9gou H. , Douze M. and Schmid C. 2008. Hamming Embedding and Weak Geometry Consistency for Large Scale Image Search \u2013 Extended Version. Research Report 6709. Inria Grenoble, Rh\u00f4ne-Alpes, France.","DOI":"10.1007\/978-3-540-88682-2_24"},{"key":"S1351324918000104_ref032","doi-asserted-by":"publisher","DOI":"10.1162\/089120103321337421"},{"key":"S1351324918000104_ref023","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-016-0981-7"},{"key":"S1351324918000104_ref024","doi-asserted-by":"crossref","unstructured":"Kruskal J. B. 1956. On the shortest spanning subtree of a graph and the traveling salesman problem. In Proceedings of the American Mathematical Society, 7.","DOI":"10.1090\/S0002-9939-1956-0078686-7"},{"key":"S1351324918000104_ref025","doi-asserted-by":"crossref","unstructured":"Kulkarni G. , Premraj V. , Dhar S. , Li S. , Choi Y. , Berg A. C. , and Berg T. L. 2011. Baby talk: understanding and generating simple image descriptions. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, pp. 1601\u20138.","DOI":"10.1109\/CVPR.2011.5995466"},{"key":"S1351324918000104_ref026","doi-asserted-by":"crossref","unstructured":"Lan T. , Yang W. , Wang Y. and Mori G. 2012. Image retrieval with structured object queries using latent ranking SVM. In Proceedings of the 12th European Conference on Computer Vision, pp. 129\u201342.","DOI":"10.1007\/978-3-642-33783-3_10"},{"key":"S1351324918000104_ref027","unstructured":"Li S. , Kulkarni G. , Berg T. L. , Berg A. C. and Choi Y. 2011. Composing simple image descriptions using web-scale N-grams. In Proceedings of the 15th Conference on Computational Natural Language Learning, pp. 220\u20138."},{"key":"S1351324918000104_ref033","unstructured":"Ortiz L. G. M. , Wolff C. and Lapata M. 2015. Learning to interpret and describe abstract scenes. In Proceedings of the 2015 North American Chapter of the Association for Computational Linguistics: HLT, pp. 1505\u201315."},{"key":"S1351324918000104_ref039","doi-asserted-by":"publisher","DOI":"10.1002\/j.1538-7305.1957.tb01515.x"},{"key":"S1351324918000104_ref040","doi-asserted-by":"crossref","unstructured":"Roth M. and Lapata M. 2016. Neural semantic role labeling with dependency path embeddings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), Berlin, Germany, pp. 1192\u20131202.","DOI":"10.18653\/v1\/P16-1113"},{"key":"S1351324918000104_ref041","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-015-0816-y"},{"key":"S1351324918000104_ref042","doi-asserted-by":"crossref","unstructured":"Schuster S. , Krishna R. , Chang A. , Fei-Fei L. , and Manning C. D. 2015. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the 4th Workshop on Vision and Language, pp. 70\u201380.","DOI":"10.18653\/v1\/W15-2812"},{"key":"S1351324918000104_ref043","unstructured":"Simonyan K. and Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition. ArXiv e-prints, abs\/1409.1556v6."},{"key":"S1351324918000104_ref044","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-013-0620-5"},{"key":"S1351324918000104_ref045","doi-asserted-by":"crossref","unstructured":"Vedantam R. , Zitnick C. L. and Parikh D. 2015. CIDEr: consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566\u201375.","DOI":"10.1109\/CVPR.2015.7299087"},{"key":"S1351324918000104_ref046","doi-asserted-by":"crossref","unstructured":"Vinyals O. , Toshev A. , Bengio S. and Erhan D. 2015. Show and tell: a neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156\u201364.","DOI":"10.1109\/CVPR.2015.7298935"},{"key":"S1351324918000104_ref047","unstructured":"Vishwanathan S. V. N. and Smola A. J. 2002. Fast kernels for string and tree matching. Advances in Neural Information Processing Systems 15: Annual Conference on Neural Information Processing Systems, pp. 569\u201376."},{"key":"S1351324918000104_ref048","unstructured":"Xu K. , Ba J. , Kiros R. , Cho K. , Courville A. , Salakhudinov R. , Zemel R. , and Bengio Y. 2015. Show, attend and tell: neural image caption generation with visual attention. In D. Blei , and F. Bach (eds.), Proceedings of the 32nd International Conference on Machine Learning, pp. 2048\u201357. JMLR Workshop and Conference Proceedings."},{"key":"S1351324918000104_ref051","doi-asserted-by":"crossref","first-page":"67","DOI":"10.1162\/tacl_a_00166","article-title":"From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions","volume":"2","author":"Young","year":"2014","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"S1351324918000104_ref052","doi-asserted-by":"crossref","unstructured":"Yu L. , Poirson P. , Yang S. , Berg A. C. , and Berg T. L. 2016. Modeling context in referring expressions. In ECCV.","DOI":"10.1007\/978-3-319-46475-6_5"},{"key":"S1351324918000104_ref053","unstructured":"Zampogiannis K. , Yang Y. , Ferm\u00fcller C. , and Aloimonos Y. 2015. Learning the spatial semantics of manipulation actions through preposition grounding. In Proceedigs of the IEEE International Conference on Robotics and Automation, pp. 1389\u201396."},{"key":"S1351324918000104_ref030","doi-asserted-by":"crossref","unstructured":"Moschitti A. 2006a. Efficient convolution kernels for dependency and constituent syntactic trees. In Proceedings of the 17th European Conference on Machine Learning, pp. 318\u201329.","DOI":"10.1007\/11871842_32"},{"key":"S1351324918000104_ref020","doi-asserted-by":"crossref","unstructured":"Kafle K. and Kanan C. 2016. Answer-type prediction for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4976\u201384.","DOI":"10.1109\/CVPR.2016.538"},{"key":"S1351324918000104_ref035","unstructured":"Papineni K. , Roukos S. , Ward T. and Zhu W.-J. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311\u20138."},{"key":"S1351324918000104_ref031","unstructured":"Moschitti A. 2006b. Making tree kernels practical for natural language learning. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, pp. 113\u201320."},{"key":"S1351324918000104_ref050","doi-asserted-by":"crossref","unstructured":"Yatskar M. , Zettlemoyer L. and Farhadi A. 2016b. Situation recognition: visual semantic role labeling for image understanding. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).","DOI":"10.1109\/CVPR.2016.597"}],"container-title":["Natural Language Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.cambridge.org\/core\/services\/aop-cambridge-core\/content\/view\/S1351324918000104","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2019,10,13]],"date-time":"2019-10-13T16:09:29Z","timestamp":1570982969000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.cambridge.org\/core\/product\/identifier\/S1351324918000104\/type\/journal_article"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,3,28]]},"references-count":54,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2018,5]]}},"alternative-id":["S1351324918000104"],"URL":"https:\/\/doi.org\/10.1017\/s1351324918000104","relation":{},"ISSN":["1351-3249","1469-8110"],"issn-type":[{"value":"1351-3249","type":"print"},{"value":"1469-8110","type":"electronic"}],"subject":[],"published":{"date-parts":[[2018,3,28]]}}}