{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,2]],"date-time":"2026-05-02T10:09:30Z","timestamp":1777716570356,"version":"3.51.4"},"reference-count":71,"publisher":"SAGE Publications","issue":"1-3","license":[{"start":{"date-parts":[[2015,10,13]],"date-time":"2015-10-13T00:00:00Z","timestamp":1444694400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["The International Journal of Robotics Research"],"published-print":{"date-parts":[[2016,1]]},"abstract":"<jats:p>We address the problem of retrieving and detecting objects based on open-vocabulary natural language queries: given a phrase describing a specific object, for example \u201cthe corn flakes box\u201d, the task is to find the best match in a set of images containing candidate objects. When naming objects, humans tend to use natural language with rich semantics, including basic-level categories, fine-grained categories, and instance-level concepts such as brand names. Existing approaches to large-scale object recognition fail in this scenario, as they expect queries that map directly to a fixed set of pre-trained visual categories, for example ImageNet synset tags. We address this limitation by introducing a novel object retrieval method. Given a candidate object image, we first map it to a set of words that are likely to describe it, using several learned image-to-text projections. We also propose a method for handling open vocabularies, that is, words not contained in the training data. We then compare the natural language query to the sets of words predicted for each candidate and select the best match. Our method can combine category- and instance-level semantics in a common representation. We present extensive experimental results on several datasets using both instance-level and category-level matching and show that our approach can accurately retrieve objects based on extremely varied open-vocabulary queries. Furthermore, we show how to process queries referring to objects within scenes, using state-of-the-art adapted detectors. The source code of our approach will be publicly available together with pre-trained models at http:\/\/openvoc.berkeleyvision.org and could be directly used for robotics applications.<\/jats:p>","DOI":"10.1177\/0278364915602059","type":"journal-article","created":{"date-parts":[[2015,10,13]],"date-time":"2015-10-13T21:39:04Z","timestamp":1444772344000},"page":"265-280","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":5,"title":["Understanding object descriptions in robotics by open-vocabulary object retrieval and detection"],"prefix":"10.1177","volume":"35","author":[{"given":"Sergio","family":"Guadarrama","sequence":"first","affiliation":[{"name":"EECS Department, University of California at Berkeley, CA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Erik","family":"Rodner","sequence":"additional","affiliation":[{"name":"Friedrich Schiller University of Jena, Jena, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Kate","family":"Saenko","sequence":"additional","affiliation":[{"name":"CS Department, University of Massachusetts Lowell, MA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Trevor","family":"Darrell","sequence":"additional","affiliation":[{"name":"EECS Department, University of California at Berkeley, CA, USA"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"179","published-online":{"date-parts":[[2015,10,13]]},"reference":[{"key":"bibr1-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1145\/2488388.2488391"},{"key":"bibr2-0278364915602059","author":"Arandjelovic R","year":"2012","journal-title":"BMVC"},{"key":"bibr3-0278364915602059","author":"Arbel\u00e1ez P","year":"2014","journal-title":"Computer vision and pattern recognition (CVPR)"},{"key":"bibr4-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2001.937654"},{"key":"bibr5-0278364915602059","unstructured":"Berg A, Farrell R, Khosla A, (2013) Fine-grained challenge 2013. Available at: https:\/\/sites.google.com\/site\/fgcomp2013\/."},{"key":"bibr6-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1145\/860435.860460"},{"key":"bibr7-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2013.433"},{"key":"bibr8-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1145\/1376616.1376746"},{"key":"bibr9-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1145\/2071389.2071390"},{"key":"bibr10-0278364915602059","first-page":"432","author":"Chatfield K","year":"2013","journal-title":"ACCV 2012"},{"key":"bibr11-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2007.4408891"},{"key":"bibr12-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1145\/2505515.2507880"},{"key":"bibr13-0278364915602059","author":"Dean T","year":"2013","journal-title":"Computer vision and pattern recognition (CVPR)"},{"key":"bibr14-0278364915602059","author":"Deng J","year":"2010","journal-title":"ECCV"},{"key":"bibr15-0278364915602059","author":"Deng J","year":"2012","journal-title":"Computer vision and pattern recognition (CVPR)"},{"key":"bibr16-0278364915602059","unstructured":"Donahue J, Jia Y, Vinyals O, (2013) DeCAF: A deep convolutional activation feature for generic visual recognition. arXiv:1310.1531."},{"key":"bibr17-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-009-0275-4"},{"key":"bibr18-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-15561-1_2"},{"key":"bibr19-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2011.6126238"},{"key":"bibr20-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10593-2_37"},{"key":"bibr21-0278364915602059","author":"Frome A","year":"2013","journal-title":"Advances in neural information processing systems (NIPS)"},{"key":"bibr22-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1145\/32206.32212"},{"key":"bibr23-0278364915602059","author":"Girshick R","year":"2014","journal-title":"Computer vision and pattern recognition (CVPR)"},{"key":"bibr24-0278364915602059","author":"Gordoa A","year":"2012","journal-title":"Computer vision and pattern recognition (CVPR)"},{"key":"bibr25-0278364915602059","author":"Grangier D","year":"2007","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)"},{"key":"bibr26-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1109\/IROS.2013.6696569"},{"key":"bibr27-0278364915602059","volume-title":"Proceedings of robotics: Science and systems (RSS)","author":"Guadarrama S","year":"2014"},{"key":"bibr28-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10584-0_23"},{"key":"bibr29-0278364915602059","first-page":"5035","volume":"1407","author":"Hoffman J","year":"2014","journal-title":"arXiv:"},{"key":"bibr30-0278364915602059","doi-asserted-by":"publisher","DOI":"10.5244\/C.28.24"},{"key":"bibr31-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2011.190"},{"key":"bibr32-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1145\/775047.775067"},{"key":"bibr33-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7299063"},{"key":"bibr34-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_47"},{"key":"bibr35-0278364915602059","author":"Krapac J","year":"2010","journal-title":"Computer vision and pattern recognition (CVPR)"},{"key":"bibr36-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00220"},{"key":"bibr37-0278364915602059","author":"Krizhevsky A","year":"2012","journal-title":"Advances in neural information processing systems (NIPS)"},{"key":"bibr38-0278364915602059","author":"Kulkarni G","year":"2011","journal-title":"Computer vision and pattern recognition (CVPR)"},{"key":"bibr39-0278364915602059","author":"Kuznetsova P","year":"2013","journal-title":"ACL"},{"key":"bibr40-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2013.179"},{"key":"bibr41-0278364915602059","author":"Lin Y","year":"2011","journal-title":"Computer vision and pattern recognition (CVPR)"},{"key":"bibr42-0278364915602059","author":"Liu Y","year":"2009","journal-title":"ACM-MM"},{"key":"bibr43-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1145\/1282280.1282366"},{"key":"bibr44-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.1999.790410"},{"key":"bibr45-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-642-33718-5_10"},{"key":"bibr46-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511809071"},{"key":"bibr47-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-540-24670-1_5"},{"key":"bibr48-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1145\/1291233.1291448"},{"key":"bibr49-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1007\/11788034_15"},{"key":"bibr50-0278364915602059","author":"Nister D","year":"2006","journal-title":"Computer vision and pattern recognition (CVPR)"},{"key":"bibr51-0278364915602059","author":"Ordonez V","year":"2013","journal-title":"ICCV"},{"key":"bibr52-0278364915602059","author":"Parkhi OM","year":"2012","journal-title":"Computer vision and pattern recognition (CVPR)"},{"key":"bibr53-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1016\/j.cviu.2013.12.006"},{"key":"bibr54-0278364915602059","author":"Philbin J","year":"2007","journal-title":"Computer vision and pattern recognition (CVPR)"},{"key":"bibr55-0278364915602059","author":"Platt JC","year":"1999","journal-title":"Advances in large-margin classifiers"},{"key":"bibr56-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1109\/TRO.2008.915445"},{"key":"bibr57-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2007.900138"},{"key":"bibr58-0278364915602059","first-page":"1","author":"Russakovsky O","year":"2014","journal-title":"International Journal of Computer Vision"},{"key":"bibr59-0278364915602059","author":"Sharma A","year":"2012","journal-title":"Computer vision and pattern recognition (CVPR)"},{"key":"bibr60-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2003.1238663"},{"key":"bibr61-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2007.900156"},{"key":"bibr62-0278364915602059","author":"Socher R","year":"2012","journal-title":"EMNLP"},{"key":"bibr63-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1109\/ICRA.2012.6224891"},{"key":"bibr64-0278364915602059","author":"Tellex S","year":"2011","journal-title":"AAAI"},{"key":"bibr65-0278364915602059","author":"Tellex S","year":"2012","journal-title":"Toward a probabilistic approach to acquiring information from human partners using language"},{"key":"bibr66-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-013-0620-5"},{"key":"bibr67-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2009.2019809"},{"key":"bibr68-0278364915602059","unstructured":"Vedaldi A (2014) VisualIndex \u2013 A simple image indexing engine in MATLAB. Available at: https:\/\/github.com\/vedaldi\/visualindex."},{"key":"bibr69-0278364915602059","author":"Weston J","year":"2011","journal-title":"IJCAI"},{"key":"bibr70-0278364915602059","first-page":"2214","volume-title":"2013 IEEE\/RSJ international conference on intelligent robots and systems (IROS)","author":"Xie Z","year":"2013"},{"key":"bibr71-0278364915602059","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_26"}],"container-title":["The International Journal of Robotics Research"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/0278364915602059","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.1177\/0278364915602059","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/0278364915602059","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,4,29]],"date-time":"2026-04-29T10:18:56Z","timestamp":1777457936000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/0278364915602059"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2015,10,13]]},"references-count":71,"journal-issue":{"issue":"1-3","published-print":{"date-parts":[[2016,1]]}},"alternative-id":["10.1177\/0278364915602059"],"URL":"https:\/\/doi.org\/10.1177\/0278364915602059","relation":{},"ISSN":["0278-3649","1741-3176"],"issn-type":[{"value":"0278-3649","type":"print"},{"value":"1741-3176","type":"electronic"}],"subject":[],"published":{"date-parts":[[2015,10,13]]}}}