{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,9]],"date-time":"2026-01-09T19:38:54Z","timestamp":1767987534149,"version":"3.49.0"},"publisher-location":"New York, NY, USA","reference-count":42,"publisher":"ACM","license":[{"start":{"date-parts":[[2019,6,5]],"date-time":"2019-06-05T00:00:00Z","timestamp":1559692800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2019,6,5]]},"DOI":"10.1145\/3323873.3325035","type":"proceedings-article","created":{"date-parts":[[2019,6,10]],"date-time":"2019-06-10T12:10:58Z","timestamp":1560168658000},"page":"182-186","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":9,"title":["Self-Supervised Visual Representations for Cross-Modal Retrieval"],"prefix":"10.1145","author":[{"given":"Yash","family":"Patel","sequence":"first","affiliation":[{"name":"Carnegie Mellon University, Pittsburgh, PA, USA"}]},{"given":"Lluis","family":"Gomez","sequence":"additional","affiliation":[{"name":"Universitat Autonoma de Barcelona, Barcelona, Spain"}]},{"given":"Mar\u00e7al","family":"Rusi\u00f1ol","sequence":"additional","affiliation":[{"name":"Universitat Autonoma de Barcelona, Barcelona, Spain"}]},{"given":"Dimosthenis","family":"Karatzas","sequence":"additional","affiliation":[{"name":"Universitat Autonoma de Barcelona, Barcelona, Spain"}]},{"given":"C.V.","family":"Jawahar","sequence":"additional","affiliation":[{"name":"CVIT, KCIS, IIIT, Hyderabad, India, Hyderabad, India"}]}],"member":"320","published-online":{"date-parts":[[2019,6,5]]},"reference":[{"key":"e_1_3_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.13"},{"key":"e_1_3_2_1_2_1","unstructured":"Galen Andrew Raman Arora Jeff Bilmes and Karen Livescu. 2013. Deep canonical correlation analysis. In ICML.   Galen Andrew Raman Arora Jeff Bilmes and Karen Livescu. 2013. Deep canonical correlation analysis. In ICML."},{"key":"e_1_3_2_1_3_1","volume-title":"Latent Dirichlet allocation. Journal of Machine Learning Research","author":"Blei David M","year":"2003","unstructured":"David M Blei , Andrew Y Ng , and Michael I Jordan . 2003. Latent Dirichlet allocation. Journal of Machine Learning Research ( 2003 ). David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research (2003)."},{"key":"e_1_3_2_1_4_1","volume-title":"Imagenet: A large-scale hierarchical image database. In CVPR.","author":"Deng Jia","year":"2009","unstructured":"Jia Deng , Wei Dong , Richard Socher , Li-Jia Li , Kai Li , and Li Fei-Fei . 2009 . Imagenet: A large-scale hierarchical image database. In CVPR. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In CVPR."},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"crossref","unstructured":"Carl Doersch Abhinav Gupta and Alexei A Efros. 2015. Unsupervised visual representation learning by context prediction. In ICCV.  Carl Doersch Abhinav Gupta and Alexei A Efros. 2015. Unsupervised visual representation learning by context prediction. In ICCV.","DOI":"10.1109\/ICCV.2015.167"},{"key":"e_1_3_2_1_6_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-009-0275-4"},{"key":"e_1_3_2_1_7_1","volume-title":"Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth.","author":"Farhadi Ali","year":"2010","unstructured":"Ali Farhadi , Mohsen Hejrati , Mohammad Amin Sadeghi , Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010 . Every picture tells a story: Generating sentences from images. In ECCV. Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: Generating sentences from images. In ECCV."},{"key":"e_1_3_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/2647868.2654902"},{"key":"e_1_3_2_1_9_1","volume-title":"Dimosthenis Karatzas, and CV Jawahar.","author":"Gomez Lluis","year":"2017","unstructured":"Lluis Gomez , Yash Patel , Marcc al Rusi nol , Dimosthenis Karatzas, and CV Jawahar. 2017 . Self-supervised learning of visual features through embedding images into text topic spaces. In CVPR. Lluis Gomez, Yash Patel, Marcc al Rusi nol, Dimosthenis Karatzas, and CV Jawahar. 2017. Self-supervised learning of visual features through embedding images into text topic spaces. In CVPR."},{"key":"e_1_3_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-013-0658-4"},{"key":"e_1_3_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1162\/0899766042321814"},{"key":"e_1_3_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/2806416.2806469"},{"key":"e_1_3_2_1_13_1","unstructured":"Philipp Kr\"ahenb\u00fchl Carl Doersch Jeff Donahue and Trevor Darrell. 2015. Data-dependent initializations of convolutional neural networks. In ICLR.  Philipp Kr\"ahenb\u00fchl Carl Doersch Jeff Donahue and Trevor Darrell. 2015. Data-dependent initializations of convolutional neural networks. In ICLR."},{"key":"e_1_3_2_1_14_1","unstructured":"Alex Krizhevsky Ilya Sutskever and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NIPS.   Alex Krizhevsky Ilya Sutskever and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NIPS."},{"key":"e_1_3_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/957013.957143"},{"key":"e_1_3_2_1_16_1","unstructured":"Tsung-Yi Lin Michael Maire Serge Belongie James Hays Pietro Perona Deva Ramanan Piotr Doll\u00e1r and C Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In ECCV.  Tsung-Yi Lin Michael Maire Serge Belongie James Hays Pietro Perona Deva Ramanan Piotr Doll\u00e1r and C Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In ECCV."},{"key":"e_1_3_2_1_17_1","unstructured":"Jiquan Ngiam Aditya Khosla Mingyu Kim Juhan Nam Honglak Lee and Andrew Y Ng. 2011. Multimodal deep learning. In ICML.   Jiquan Ngiam Aditya Khosla Mingyu Kim Juhan Nam Honglak Lee and Andrew Y Ng. 2011. Multimodal deep learning. In ICML."},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"crossref","unstructured":"Andrew Owens Jiajun Wu Josh H McDermott William T Freeman and Antonio Torralba. 2016. Ambient sound provides supervision for visual learning. In ECCV.  Andrew Owens Jiajun Wu Josh H McDermott William T Freeman and Antonio Torralba. 2016. Ambient sound provides supervision for visual learning. In ECCV.","DOI":"10.1007\/978-3-319-46448-0_48"},{"key":"e_1_3_2_1_19_1","volume-title":"Dimosthenis Karatzas, and CV Jawahar.","author":"Patel Yash","year":"2018","unstructured":"Yash Patel , Lluis Gomez , Raul Gomez , Marcc al Rusi nol , Dimosthenis Karatzas, and CV Jawahar. 2018 . TextTopicNet-Self-Supervised Learning of Visual Features Through Embedding Images on Semantic Text Spaces . arXiv preprint arXiv:1807.02110 (2018). Yash Patel, Lluis Gomez, Raul Gomez, Marcc al Rusi nol, Dimosthenis Karatzas, and CV Jawahar. 2018. TextTopicNet-Self-Supervised Learning of Visual Features Through Embedding Images on Semantic Text Spaces. arXiv preprint arXiv:1807.02110 (2018)."},{"key":"e_1_3_2_1_20_1","doi-asserted-by":"crossref","unstructured":"Yash Patel Lluis Gomez Marcc al Rusinol and Dimosthenis Karatzas. 2016. Dynamic Lexicon Generation for Natural Scene Images. In ECCV.  Yash Patel Lluis Gomez Marcc al Rusinol and Dimosthenis Karatzas. 2016. Dynamic Lexicon Generation for Natural Scene Images. In ECCV.","DOI":"10.1007\/978-3-319-46604-0_29"},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"crossref","unstructured":"Deepak Pathak Philipp Krahenbuhl Jeff Donahue Trevor Darrell and Alexei A Efros. 2016. Context encoders: Feature learning by inpainting. In CVPR.  Deepak Pathak Philipp Krahenbuhl Jeff Donahue Trevor Darrell and Alexei A Efros. 2016. Context encoders: Feature learning by inpainting. In CVPR.","DOI":"10.1109\/CVPR.2016.278"},{"key":"e_1_3_2_1_22_1","unstructured":"Yuxin Peng Xin Huang and Jinwei Qi. 2016. Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks.. In IJCAI.  Yuxin Peng Xin Huang and Jinwei Qi. 2016. Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks.. In IJCAI."},{"key":"e_1_3_2_1_23_1","volume-title":"Emanuele Coviello, Gabriel Doyle, Gert RG Lanckriet, Roger Levy, and Nuno Vasconcelos.","author":"Rasiwasia Nikhil","year":"2010","unstructured":"Nikhil Rasiwasia , Jose Costa Pereira , Emanuele Coviello, Gabriel Doyle, Gert RG Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010 . A new approach to cross-modal multimedia retrieval. In ACM-MM. Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert RG Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In ACM-MM."},{"key":"e_1_3_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1007\/11752790_2"},{"key":"e_1_3_2_1_25_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-015-0816-y"},{"key":"e_1_3_2_1_26_1","doi-asserted-by":"crossref","unstructured":"Abhishek Sharma Abhishek Kumar Hal Daume and David W Jacobs. 2012. Generalized multiview analysis: A discriminative latent space. In CVPR.   Abhishek Sharma Abhishek Kumar Hal Daume and David W Jacobs. 2012. Generalized multiview analysis: A discriminative latent space. In CVPR.","DOI":"10.1109\/CVPR.2012.6247923"},{"key":"e_1_3_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2016.2612883"},{"key":"e_1_3_2_1_28_1","volume-title":"Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556","author":"Simonyan Karen","year":"2014","unstructured":"Karen Simonyan and Andrew Zisserman . 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 ( 2014 ). Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)."},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1145\/2463676.2465274"},{"key":"e_1_3_2_1_30_1","unstructured":"Nitish Srivastava and Ruslan R Salakhutdinov. 2012. Multimodal learning with deep boltzmann machines. In NIPS.   Nitish Srivastava and Ruslan R Salakhutdinov. 2012. Multimodal learning with deep boltzmann machines. In NIPS."},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3123266.3123326"},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2016.06.001"},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2015.2505311"},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2013.261"},{"key":"e_1_3_2_1_35_1","volume-title":"A Comprehensive Survey on Cross-modal Retrieval. CoRR","author":"Wang Kaiye","year":"2016","unstructured":"Kaiye Wang , Qiyue Yin , Wei Wang , Shu Wu , and Liang Wang . 2016b. A Comprehensive Survey on Cross-modal Retrieval. CoRR ( 2016 ). Kaiye Wang, Qiyue Yin, Wei Wang, Shu Wu, and Liang Wang. 2016b. A Comprehensive Survey on Cross-modal Retrieval. CoRR (2016)."},{"key":"e_1_3_2_1_36_1","doi-asserted-by":"crossref","unstructured":"Xiaolong Wang and Abhinav Gupta. 2015. Unsupervised learning of visual representations using videos. In CVPR.  Xiaolong Wang and Abhinav Gupta. 2015. Unsupervised learning of visual representations using videos. In CVPR.","DOI":"10.1109\/ICCV.2015.320"},{"key":"e_1_3_2_1_37_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIP.2017.2676345"},{"key":"e_1_3_2_1_38_1","doi-asserted-by":"crossref","unstructured":"Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In CVPR.  Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In CVPR.","DOI":"10.1109\/CVPR.2015.7298966"},{"key":"e_1_3_2_1_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2013.2276704"},{"key":"e_1_3_2_1_40_1","unstructured":"Junbo Zhao Michael Mathieu Ross Goroshin and Yann Lecun. 2016. Stacked what-where auto-encoders. In ICLR.  Junbo Zhao Michael Mathieu Ross Goroshin and Yann Lecun. 2016. Stacked what-where auto-encoders. In ICLR."},{"key":"e_1_3_2_1_41_1","unstructured":"Bolei Zhou Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva. 2014. Learning deep features for scene recognition using places database. In NIPS.   Bolei Zhou Agata Lapedriza Jianxiong Xiao Antonio Torralba and Aude Oliva. 2014. Learning deep features for scene recognition using places database. In NIPS."},{"key":"e_1_3_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/2502081.2502107"}],"event":{"name":"ICMR '19: International Conference on Multimedia Retrieval","location":"Ottawa ON Canada","acronym":"ICMR '19","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 2019 on International Conference on Multimedia Retrieval"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3323873.3325035","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3323873.3325035","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T01:02:22Z","timestamp":1750208542000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3323873.3325035"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2019,6,5]]},"references-count":42,"alternative-id":["10.1145\/3323873.3325035","10.1145\/3323873"],"URL":"https:\/\/doi.org\/10.1145\/3323873.3325035","relation":{},"subject":[],"published":{"date-parts":[[2019,6,5]]},"assertion":[{"value":"2019-06-05","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}