{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T04:07:23Z","timestamp":1750306043109,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":32,"publisher":"ACM","license":[{"start":{"date-parts":[[2017,6,6]],"date-time":"2017-06-06T00:00:00Z","timestamp":1496707200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Defense Advanced Research Projects Agency","award":["FA8750-16-2-0204"],"award-info":[{"award-number":["FA8750-16-2-0204"]}]},{"name":"Air Force Research Laboratory","award":["FA8750-16-2-0204"],"award-info":[{"award-number":["FA8750-16-2-0204"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2017,6,6]]},"DOI":"10.1145\/3078971.3078976","type":"proceedings-article","created":{"date-parts":[[2017,5,25]],"date-time":"2017-05-25T16:27:32Z","timestamp":1495729652000},"page":"23-31","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":24,"title":["MSRC"],"prefix":"10.1145","author":[{"given":"Kan","family":"Chen","sequence":"first","affiliation":[{"name":"University of Southern California, Los Angeles, CA, USA"}]},{"given":"Rama","family":"Kovvuri","sequence":"additional","affiliation":[{"name":"University of Southern California, Los Angeles, CA, USA"}]},{"given":"Jiyang","family":"Gao","sequence":"additional","affiliation":[{"name":"University of Southern California, Los Angeles, CA, USA"}]},{"given":"Ram","family":"Nevatia","sequence":"additional","affiliation":[{"name":"University of Southern California, Los Angeles, CA, USA"}]}],"member":"320","published-online":{"date-parts":[[2017,6,6]]},"reference":[{"key":"e_1_3_2_1_1_1","unstructured":"K. Andrej and F.-F. Li. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR.  K. Andrej and F.-F. Li. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR."},{"key":"e_1_3_2_1_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.279"},{"key":"e_1_3_2_1_3_1","unstructured":"K. Chen J. Wang L.-C. Chen H. Gao W. Xu and R. Nevatia. 2015. ABC-CNN: An attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960 (2015).  K. Chen J. Wang L.-C. Chen H. Gao W. Xu and R. Nevatia. 2015. ABC-CNN: An attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960 (2015)."},{"key":"e_1_3_2_1_4_1","volume-title":"Imagenet: A large-scale hierarchical image database. In CVPR.","author":"Deng J.","year":"2009","unstructured":"J. Deng , W. Dong , R. Socher , L.-J. Li , K. Li , and F.-F Li . 2009 . Imagenet: A large-scale hierarchical image database. In CVPR. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F Li. 2009. Imagenet: A large-scale hierarchical image database. In CVPR."},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-009-0275-4"},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"crossref","unstructured":"H. Fang S. Gupta F. Iandola R. K. Srivastava L. Deng P. Doll\u00e1r J. Gao X. He M. Mitchell J. C Platt and others. 2015. From captions to visual concepts and back. In CVPR.  H. Fang S. Gupta F. Iandola R. K. Srivastava L. Deng P. Doll\u00e1r J. Gao X. He M. Mitchell J. C Platt and others. 2015. From captions to visual concepts and back. In CVPR.","DOI":"10.1109\/CVPR.2015.7298754"},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"crossref","unstructured":"A. Fukui D. H Park D. Yang A. Rohrbach T. Darrell and M. Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. EMNLP (2016).  A. Fukui D. H Park D. Yang A. Rohrbach T. Darrell and M. Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. EMNLP (2016).","DOI":"10.18653\/v1\/D16-1044"},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.169"},{"key":"e_1_3_2_1_9_1","unstructured":"X. Glorot and Y. Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks.. In Aistats.  X. Glorot and Y. Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks.. In Aistats."},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"crossref","unstructured":"A. Gordo J. Almaz\u00e1n J. Revaud and D. Larlus. 2016. Deep image retrieval: Learning global representations for image search. In ECCV.  A. Gordo J. Almaz\u00e1n J. Revaud and D. Larlus. 2016. Deep image retrieval: Learning global representations for image search. In ECCV.","DOI":"10.1007\/978-3-319-46466-4_15"},{"key":"e_1_3_2_1_11_1","volume":"201","author":"He K.","unstructured":"K. He , X. Zhang , S. Ren , and J. Sun. 201 5. Delving deep into recitifers: Surpassing human-level performance on imagenet classification. In CVPR. K. He, X. Zhang, S. Ren, and J. Sun. 2015. Delving deep into recitifers: Surpassing human-level performance on imagenet classification. In CVPR.","journal-title":"J. Sun."},{"key":"e_1_3_2_1_12_1","volume":"199","author":"Hochreiter S.","unstructured":"S. Hochreiter and J. Schmidhuber. 199 7. Long short-term memory. Neural computation (1997). S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural computation (1997).","journal-title":"J. Schmidhuber."},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"crossref","unstructured":"R. Hu H. Xu M. Rohrbach J. Feng K. Saenko and T. Darrell. 2016. Natural language object retrieval. In CVPR.  R. Hu H. Xu M. Rohrbach J. Feng K. Saenko and T. Darrell. 2016. Natural language object retrieval. In CVPR.","DOI":"10.1109\/CVPR.2016.493"},{"key":"e_1_3_2_1_14_1","volume-title":"Densecap: Fully convolutional localization networks for dense captioning. In CVPR.","author":"Justin J.","year":"2016","unstructured":"J. Justin , K. Andrej , and F.-F. Li . 2016 . Densecap: Fully convolutional localization networks for dense captioning. In CVPR. J. Justin, K. Andrej, and F.-F. Li. 2016. Densecap: Fully convolutional localization networks for dense captioning. In CVPR."},{"key":"e_1_3_2_1_15_1","unstructured":"Sahar K. Vicente O. Mark M. and Tamara L. B. 2014. ReferIt Game: Referring to Objects in Photographs of Natural Scenes. In EMNLP.  Sahar K. Vicente O. Mark M. and Tamara L. B. 2014. ReferIt Game: Referring to Objects in Photographs of Natural Scenes. In EMNLP."},{"key":"e_1_3_2_1_16_1","volume-title":"Contextlocnet: Context-aware deep network models for weakly supervised localization. In ECCV.","author":"Kantorov V.","year":"2016","unstructured":"V. Kantorov , M. Oquab , M. Cho , and I. Laptev . 2016 . Contextlocnet: Context-aware deep network models for weakly supervised localization. In ECCV. V. Kantorov, M. Oquab, M. Cho, and I. Laptev. 2016. Contextlocnet: Context-aware deep network models for weakly supervised localization. In ECCV."},{"key":"e_1_3_2_1_17_1","unstructured":"A. Karpathy A. Joulin and F.-F. Li. 2014. Deep fragment embeddings for bidirec- tional image sentence mapping. In NIPS.   A. Karpathy A. Joulin and F.-F. Li. 2014. Deep fragment embeddings for bidirec- tional image sentence mapping. In NIPS."},{"key":"e_1_3_2_1_18_1","volume":"201","author":"Kingma D.","unstructured":"D. Kingma and J. Ba. 201 4. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). D. Kingma and J. Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).","journal-title":"J. Ba."},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"crossref","unstructured":"J. Krishnamurthy and T. Kollar. 2013. Jointly learning to parse and perceive: Connecting natural language to the physical world. TACL (2013).  J. Krishnamurthy and T. Kollar. 2013. Jointly learning to parse and perceive: Connecting natural language to the physical world. TACL (2013).","DOI":"10.1162\/tacl_a_00220"},{"key":"e_1_3_2_1_20_1","volume-title":"SSD: Single shot multibox detector. In ECCV.","author":"Liu W.","year":"2016","unstructured":"W. Liu , D. Anguelov , D. Erhan , C. Szegedy , S. Reed , C.-Y. Fu , and A. C. Berg . 2016 . SSD: Single shot multibox detector. In ECCV. W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. 2016. SSD: Single shot multibox detector. In ECCV."},{"key":"e_1_3_2_1_21_1","unstructured":"C. Matuszek N. FitzGerald L. Zettlemoyer L. Bo and D. Fox. 2012. A joint model of language and perception for grounded attribute learning. ICML (2012).   C. Matuszek N. FitzGerald L. Zettlemoyer L. Bo and D. Fox. 2012. A joint model of language and perception for grounded attribute learning. ICML (2012)."},{"key":"e_1_3_2_1_22_1","doi-asserted-by":"crossref","unstructured":"V. K Nagaraja V. I Morariu and L. S Davis. 2016. Modeling context between objects for referring expression understanding. In ECCV.  V. K Nagaraja V. I Morariu and L. S Davis. 2016. Modeling context between objects for referring expression understanding. In ECCV.","DOI":"10.1007\/978-3-319-46493-0_48"},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-016-0965-7"},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"crossref","unstructured":"F. Radenovi\u0107 G. Tolias and O. Chum. 2016. CNN image retrieval learns from BoW: Unsupervised fine-tuning with hard examples. In ECCV.  F. Radenovi\u0107 G. Tolias and O. Chum. 2016. CNN image retrieval learns from BoW: Unsupervised fine-tuning with hard examples. In ECCV.","DOI":"10.1007\/978-3-319-46448-0_1"},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"crossref","unstructured":"J. Redmon S. Divvala R. Girshick and A. Farhadi. 2016. You only look once: Unified real-time object detection. In CVPR.  J. Redmon S. Divvala R. Girshick and A. Farhadi. 2016. You only look once: Unified real-time object detection. In CVPR.","DOI":"10.1109\/CVPR.2016.91"},{"key":"e_1_3_2_1_26_1","volume":"201","author":"Ren S.","unstructured":"S. Ren , K. He , R. Girshick , and J. Sun. 201 5. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS. S. Ren, K. He, R. Girshick, and J. Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS.","journal-title":"J. Sun."},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"crossref","unstructured":"A. Rohrbach M. Rohrbach R. Hu T. Darrell and B. Schiele. 2016. Grounding of textual phrases in images by reconstruction. In ECCV.  A. Rohrbach M. Rohrbach R. Hu T. Darrell and B. Schiele. 2016. Grounding of textual phrases in images by reconstruction. In ECCV.","DOI":"10.1007\/978-3-319-46448-0_49"},{"key":"e_1_3_2_1_28_1","unstructured":"K. Simonyan and A. Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR (2014).  K. Simonyan and A. Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR (2014)."},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-013-0620-5"},{"key":"e_1_3_2_1_30_1","volume":"201","author":"Wang M.","unstructured":"M. Wang , M. Azab , N. Kojima , R. Mihalcea , and J. Deng. 201 6. Structured matching for phrase localization. In ECCV. M. Wang, M. Azab, N. Kojima, R. Mihalcea, and J. Deng. 2016. Structured matching for phrase localization. In ECCV.","journal-title":"J. Deng."},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"crossref","unstructured":"L. Yu P. Poirson S. Yang A. C Berg and T. L Berg. 2016. Modeling context in referring expressions. In ECCV.  L. Yu P. Poirson S. Yang A. C Berg and T. L Berg. 2016. Modeling context in referring expressions. In ECCV.","DOI":"10.1007\/978-3-319-46475-6_5"},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"crossref","unstructured":"C. L. Zitnick and P. Doll\u00e1r. 2014. Edge boxes: Locating object proposals from edges. In ECCV  C. L. Zitnick and P. Doll\u00e1r. 2014. Edge boxes: Locating object proposals from edges. In ECCV","DOI":"10.1007\/978-3-319-10602-1_26"}],"event":{"name":"ICMR '17: International Conference on Multimedia Retrieval","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Bucharest Romania","acronym":"ICMR '17"},"container-title":["Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3078971.3078976","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3078971.3078976","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3078971.3078976","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T03:03:09Z","timestamp":1750215789000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3078971.3078976"}},"subtitle":["Multimodal Spatial Regression with Semantic Context for Phrase Grounding"],"short-title":[],"issued":{"date-parts":[[2017,6,6]]},"references-count":32,"alternative-id":["10.1145\/3078971.3078976","10.1145\/3078971"],"URL":"https:\/\/doi.org\/10.1145\/3078971.3078976","relation":{},"subject":[],"published":{"date-parts":[[2017,6,6]]},"assertion":[{"value":"2017-06-06","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}