{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,15]],"date-time":"2026-05-15T05:50:18Z","timestamp":1778824218903,"version":"3.51.4"},"publisher-location":"New York, NY, USA","reference-count":45,"publisher":"ACM","license":[{"start":{"date-parts":[[2019,6,5]],"date-time":"2019-06-05T00:00:00Z","timestamp":1559692800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/100011039","name":"Intelligence Advanced Research Projects Activity","doi-asserted-by":"publisher","award":["60NANB17D156"],"award-info":[{"award-number":["60NANB17D156"]}],"id":[{"id":"10.13039\/100011039","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000185","name":"Defense Advanced Research Projects Agency","doi-asserted-by":"publisher","award":["FA8750-18-2-0018"],"award-info":[{"award-number":["FA8750-18-2-0018"]}],"id":[{"id":"10.13039\/100000185","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2019,6,5]]},"DOI":"10.1145\/3323873.3325043","type":"proceedings-article","created":{"date-parts":[[2019,6,10]],"date-time":"2019-06-10T12:10:58Z","timestamp":1560168658000},"page":"244-252","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":6,"title":["Improving What Cross-Modal Retrieval Models Learn through Object-Oriented Inter- and Intra-Modal Attention Networks"],"prefix":"10.1145","author":[{"given":"Po-Yao","family":"Huang","sequence":"first","affiliation":[{"name":"Carnegie Mellon University, Pittsburgh, PA, USA"}]},{"family":"Vaibhav","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University, Pittsburgh, PA, USA"}]},{"given":"Xiaojun","family":"Chang","sequence":"additional","affiliation":[{"name":"Monash University, Melbourne, Australia"}]},{"given":"Alexander G.","family":"Hauptmann","sequence":"additional","affiliation":[{"name":"Carnegie Mellon University, Pittsburgh, PA, USA"}]}],"member":"320","published-online":{"date-parts":[[2019,6,5]]},"reference":[{"key":"e_1_3_2_1_1_1","doi-asserted-by":"crossref","unstructured":"Peter Anderson Xiaodong He Chris Buehler Damien Teney Mark Johnson Stephen Gould and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In CVPR .  Peter Anderson Xiaodong He Chris Buehler Damien Teney Mark Johnson Stephen Gould and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In CVPR .","DOI":"10.1109\/CVPR.2018.00636"},{"key":"e_1_3_2_1_2_1","volume-title":"Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473","author":"Bahdanau Dzmitry","year":"2014","unstructured":"Dzmitry Bahdanau , Kyunghyun Cho , and Yoshua Bengio . 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 ( 2014 ). Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)."},{"key":"e_1_3_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1179"},{"key":"e_1_3_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-2123"},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.201"},{"key":"e_1_3_2_1_6_1","volume-title":"Jamie Ryan Kiros, and Sanja Fidler","author":"Faghri Fartash","year":"2018","unstructured":"Fartash Faghri , David J Fleet , Jamie Ryan Kiros, and Sanja Fidler . 2018 . VSE Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE"},{"key":"e_1_3_2_1_7_1","unstructured":": Improving Visual-Semantic Embeddings with Hard Negatives. (2018). https:\/\/github.com\/fartashf\/vsepp  : Improving Visual-Semantic Embeddings with Hard Negatives. (2018). https:\/\/github.com\/fartashf\/vsepp"},{"key":"e_1_3_2_1_8_1","volume-title":"Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems","author":"Frome Andrea","year":"2013","unstructured":"Andrea Frome , Gregory S. Corrado , Jonathon Shlens , Samy Bengio , Jeffrey Dean , Marc'Aurelio Ranzato , and Tomas Mikolov . 2013. DeViSE: A Deep Visual-Semantic Embedding Model . In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013 . Proceedings of a meeting held December 5--8, 2013, Lake Tahoe, Nevada, United States . 2121--2129. Andrea Frome, Gregory S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc'Aurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A Deep Visual-Semantic Embedding Model. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5--8, 2013, Lake Tahoe, Nevada, United States. 2121--2129."},{"key":"e_1_3_2_1_9_1","volume-title":"Proceedings of the thirteenth international conference on artificial intelligence and statistics. 249--256","author":"Glorot Xavier","year":"2010","unstructured":"Xavier Glorot and Yoshua Bengio . 2010 . Understanding the difficulty of training deep feedforward neural networks . In Proceedings of the thirteenth international conference on artificial intelligence and statistics. 249--256 . Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics. 249--256."},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00750"},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_3_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.5555\/2566972.2566993"},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3206025.3206079"},{"key":"e_1_3_2_1_15_1","volume-title":"Instance-Aware Image and Sentence Matching with Selective Multimodal LSTM. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 7254--7262","author":"Huang Yan","year":"2017","unstructured":"Yan Huang , Wei Wang , and Liang Wang . 2017 a. Instance-Aware Image and Sentence Matching with Selective Multimodal LSTM. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 7254--7262 . Yan Huang, Wei Wang, and Liang Wang. 2017a. Instance-Aware Image and Sentence Matching with Selective Multimodal LSTM. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 7254--7262."},{"key":"e_1_3_2_1_16_1","volume-title":"Learning semantic concepts and order for image and sentence matching. arXiv preprint arXiv:1712.02036","author":"Huang Yan","year":"2017","unstructured":"Yan Huang , Qi Wu , and Liang Wang . 2017b. Learning semantic concepts and order for image and sentence matching. arXiv preprint arXiv:1712.02036 ( 2017 ). Yan Huang, Qi Wu, and Liang Wang. 2017b. Learning semantic concepts and order for image and sentence matching. arXiv preprint arXiv:1712.02036 (2017)."},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D17-1215"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/2733373.2806240"},{"key":"e_1_3_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298932"},{"key":"e_1_3_2_1_20_1","unstructured":"Andrej Karpathy Armand Joulin and Li F Fei-Fei. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in neural information processing systems. 1889--1897.   Andrej Karpathy Armand Joulin and Li F Fei-Fei. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in neural information processing systems. 1889--1897."},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1181"},{"key":"e_1_3_2_1_22_1","volume-title":"Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980","author":"Kingma Diederik P","year":"2014","unstructured":"Diederik P Kingma and Jimmy Ba . 2014 . Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)."},{"key":"e_1_3_2_1_23_1","volume-title":"Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. NIPS Workshop","author":"Kiros Ryan","year":"2014","unstructured":"Ryan Kiros , Ruslan Salakhutdinov , and Richard S. Zemel . 2014 . Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. NIPS Workshop ( 2014 ). Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. NIPS Workshop (2014)."},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7299073"},{"key":"e_1_3_2_1_25_1","unstructured":"Alex Krizhevsky Ilya Sutskever and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105.   Alex Krizhevsky Ilya Sutskever and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105."},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N16-1030"},{"key":"e_1_3_2_1_27_1","volume-title":"Stacked Cross Attention for Image-Text Matching. arXiv preprint arXiv:1803.08024","author":"Lee Kuang-Huei","year":"2018","unstructured":"Kuang-Huei Lee , Xi Chen , Gang Hua , Houdong Hu , and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching. arXiv preprint arXiv:1803.08024 ( 2018 ). Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching. arXiv preprint arXiv:1803.08024 (2018)."},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.301"},{"key":"e_1_3_2_1_30_1","unstructured":"Tomas Mikolov Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111--3119.   Tomas Mikolov Ilya Sutskever Kai Chen Greg S Corrado and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111--3119."},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.232"},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.208"},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1162"},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2007.383266"},{"key":"e_1_3_2_1_35_1","unstructured":"Shaoqing Ren Kaiming He Ross Girshick and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91--99.   Shaoqing Ren Kaiming He Ross Girshick and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91--99."},{"key":"e_1_3_2_1_36_1","volume-title":"Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan and Andrew Zisserman . 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 ( 2014 ). Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)."},{"key":"e_1_3_2_1_37_1","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008.   Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N Gomez \u0141ukasz Kaiser and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008."},{"key":"e_1_3_2_1_38_1","volume-title":"Order-Embeddings of Images and Language. CoRR","author":"Vendrov Ivan","year":"2015","unstructured":"Ivan Vendrov , Ryan Kiros , Sanja Fidler , and Raquel Urtasun . 2015a. Order-Embeddings of Images and Language. CoRR , Vol. abs\/ 1511 .06361 ( 2015 ). Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. 2015a. Order-Embeddings of Images and Language. CoRR, Vol. abs\/1511.06361 (2015)."},{"key":"e_1_3_2_1_39_1","volume-title":"Order-embeddings of images and language. arXiv preprint arXiv:1511.06361","author":"Vendrov Ivan","year":"2015","unstructured":"Ivan Vendrov , Ryan Kiros , Sanja Fidler , and Raquel Urtasun . 2015b. Order-embeddings of images and language. arXiv preprint arXiv:1511.06361 ( 2015 ). Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. 2015b. Order-embeddings of images and language. arXiv preprint arXiv:1511.06361 (2015)."},{"key":"e_1_3_2_1_40_1","volume-title":"Learning two-branch neural networks for image-text matching tasks","author":"Wang Liwei","year":"2018","unstructured":"Liwei Wang , Yin Li , Jing Huang , and Svetlana Lazebnik . 2018. Learning two-branch neural networks for image-text matching tasks . IEEE Transactions on Pattern Analysis and Machine Intelligence ( 2018 ). Liwei Wang, Yin Li, Jing Huang, and Svetlana Lazebnik. 2018. Learning two-branch neural networks for image-text matching tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018)."},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.541"},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298966"},{"key":"e_1_3_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00166"},{"key":"e_1_3_2_1_44_1","volume-title":"Deep learning for sentiment analysis: A survey","author":"Zhang Lei","year":"2018","unstructured":"Lei Zhang , Shuai Wang , and Bing Liu . 2018. Deep learning for sentiment analysis: A survey . Wiley Interdisciplinary Reviews : Data Mining and Knowledge Discovery ( 2018 ). Lei Zhang, Shuai Wang, and Bing Liu. 2018. Deep learning for sentiment analysis: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery (2018)."},{"key":"e_1_3_2_1_45_1","volume-title":"Dual-Path Convolutional Image-Text Embedding. CoRR","author":"Zheng Zhedong","year":"2017","unstructured":"Zhedong Zheng , Liang Zheng , Michael Garrett , Yi Yang , and Yi-Dong Shen . 2017. Dual-Path Convolutional Image-Text Embedding. CoRR , Vol. abs\/ 1711 .05535 ( 2017 ). Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, and Yi-Dong Shen. 2017. Dual-Path Convolutional Image-Text Embedding. CoRR, Vol. abs\/1711.05535 (2017)."}],"event":{"name":"ICMR '19: International Conference on Multimedia Retrieval","location":"Ottawa ON Canada","acronym":"ICMR '19","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 2019 on International Conference on Multimedia Retrieval"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3323873.3325043","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3323873.3325043","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3323873.3325043","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T23:54:12Z","timestamp":1750204452000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3323873.3325043"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,6,5]]},"references-count":45,"alternative-id":["10.1145\/3323873.3325043","10.1145\/3323873"],"URL":"https:\/\/doi.org\/10.1145\/3323873.3325043","relation":{},"subject":[],"published":{"date-parts":[[2019,6,5]]},"assertion":[{"value":"2019-06-05","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}