{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,22]],"date-time":"2026-04-22T19:33:34Z","timestamp":1776886414403,"version":"3.51.2"},"reference-count":74,"publisher":"MIT Press - Journals","license":[{"start":{"date-parts":[[2022,5,5]],"date-time":"2022-05-05T00:00:00Z","timestamp":1651708800000},"content-version":"vor","delay-in-days":124,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,5,4]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Current state-of-the-art approaches to cross- modal retrieval process text and visual input jointly, relying on Transformer-based architectures with cross-attention mechanisms that attend over all words and objects in an image. While offering unmatched retrieval performance, such models: 1) are typically pretrained from scratch and thus less scalable, 2) suffer from huge retrieval latency and inefficiency issues, which makes them impractical in realistic applications. To address these crucial gaps towards both improved and efficient cross- modal retrieval, we propose a novel fine-tuning framework that turns any pretrained text-image multi-modal model into an efficient retrieval model. The framework is based on a cooperative retrieve-and-rerank approach that combines: 1) twin networks (i.e., a bi-encoder) to separately encode all items of a corpus, enabling efficient initial retrieval, and 2) a cross-encoder component for a more nuanced (i.e., smarter) ranking of the retrieved small set of items. We also propose to jointly fine- tune the two components with shared weights, yielding a more parameter-efficient model. Our experiments on a series of standard cross-modal retrieval benchmarks in monolingual, multilingual, and zero-shot setups, demonstrate improved accuracy and huge efficiency benefits over the state-of-the-art cross- encoders.1<\/jats:p>","DOI":"10.1162\/tacl_a_00473","type":"journal-article","created":{"date-parts":[[2022,5,5]],"date-time":"2022-05-05T19:12:01Z","timestamp":1651777921000},"page":"503-521","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":28,"title":["Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval"],"prefix":"10.1162","volume":"10","author":[{"given":"Gregor","family":"Geigle","sequence":"first","affiliation":[{"name":"Ubiquitous Knowledge Processing Lab, Technical University of Darmstadt, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Jonas","family":"Pfeiffer","sequence":"additional","affiliation":[{"name":"Ubiquitous Knowledge Processing Lab, Technical University of Darmstadt, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Nils","family":"Reimers","sequence":"additional","affiliation":[{"name":"Ubiquitous Knowledge Processing Lab, Technical University of Darmstadt, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ivan","family":"Vuli\u0107","sequence":"additional","affiliation":[{"name":"Language Technology Lab, University of Cambridge, United Kingdom"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Iryna","family":"Gurevych","sequence":"additional","affiliation":[{"name":"Ubiquitous Knowledge Processing Lab, Technical University of Darmstadt, Germany"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"281","published-online":{"date-parts":[[2022,5,4]]},"reference":[{"key":"2022050519113777900_bib1","doi-asserted-by":"publisher","first-page":"6077","DOI":"10.1109\/CVPR.2018.00636","article-title":"Bottom- up and top-down attention for image captioning and visual question answering","volume-title":"2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018","author":"Anderson","year":"2018"},{"issue":"1","key":"2022050519113777900_bib2","doi-asserted-by":"publisher","first-page":"117","DOI":"10.1145\/1327452.1327494","article-title":"Near- optimal hashing algorithms for approximate nearest neighbor in high dimensions","volume":"51","author":"Andoni","year":"2008","journal-title":"2006 47th annual IEEE symposium on foundations of computer science (FOCS\u201906)"},{"issue":"6","key":"2022050519113777900_bib3","doi-asserted-by":"publisher","first-page":"891","DOI":"10.1145\/293347.293348","article-title":"An optimal algorithm for approximate nearest neighbor searching fixed dimensions","volume":"45","author":"Arya","year":"1998","journal-title":"Journal of the ACM (JACM)"},{"key":"2022050519113777900_bib4","doi-asserted-by":"publisher","first-page":"304","DOI":"10.18653\/v1\/W18-6402","article-title":"Findings of the third shared task on multimodal machine translation","volume-title":"Proceedings of the Third Conference on Machine Translation: Shared Task Papers","author":"Barrault","year":"2018"},{"key":"2022050519113777900_bib5","doi-asserted-by":"publisher","first-page":"1977","DOI":"10.1145\/3397271.3401194","article-title":"MarkedBERT: Integrating traditional IR cues in pre-trained language models for passage retrieval","volume-title":"Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25-30, 2020","author":"Boualili","year":"2020"},{"key":"2022050519113777900_bib6","doi-asserted-by":"publisher","first-page":"978","DOI":"10.1162\/tacl_a_00408","article-title":"Multimodal pretraining unmasked: A meta-analysis and a unified framework of vision-and-language BERTs","volume":"9","author":"Bugliarello","year":"2021","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2022050519113777900_bib7","doi-asserted-by":"publisher","first-page":"197","DOI":"10.1007\/978-3-030-58548-8_12","article-title":"Learning to scale multilingual representations for vision-language tasks","volume-title":"Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part IV","author":"Burns","year":"2020"},{"key":"2022050519113777900_bib8","doi-asserted-by":"publisher","first-page":"1109","DOI":"10.1007\/978-3-642-02172-5_2","article-title":"Large scale online learning of image similarity through ranking","volume":"11","author":"Chechik","year":"2010","journal-title":"Journal of Machine Learning Research"},{"key":"2022050519113777900_bib9","doi-asserted-by":"publisher","first-page":"104","DOI":"10.1007\/978-3-030-58577-8_7","article-title":"UNITER: Universal image-text representation learning","volume-title":"Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXX","author":"Chen","year":"2020"},{"key":"2022050519113777900_bib10","doi-asserted-by":"publisher","first-page":"250","DOI":"10.18653\/v1\/W19-4330","article-title":"Learning cross-lingual sentence representations via a multi-task dual- encoder model","volume-title":"Proceedings of the 4th Workshop on Representation Learning for NLP, RepL4NLP@ACL 2019, Florence, Italy, August 2, 2019","author":"Chidambaram","year":"2019"},{"key":"2022050519113777900_bib11","doi-asserted-by":"publisher","first-page":"8440","DOI":"10.18653\/v1\/2020.acl-main.747","article-title":"Unsupervised cross-lingual representation learning at scale","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020","author":"Conneau","year":"2020"},{"key":"2022050519113777900_bib12","first-page":"4171","article-title":"BERT: Pre-training of deep bidirectional transformers for language understanding","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL- HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers)","author":"Devlin","year":"2019"},{"key":"2022050519113777900_bib13","doi-asserted-by":"publisher","first-page":"215","DOI":"10.18653\/v1\/W17-4718","article-title":"Findings of the second shared task on multimodal machine translation and multilingual image description","volume-title":"Proceedings of the Second Conference on Machine Translation","author":"Elliott","year":"2017"},{"key":"2022050519113777900_bib14","doi-asserted-by":"publisher","first-page":"70","DOI":"10.18653\/v1\/W16-3210","article-title":"Multi30K: Multilingual English-German image descriptions","volume-title":"Proceedings of the 5th Workshop on Vision and Language","author":"Elliott","year":"2016"},{"key":"2022050519113777900_bib15","first-page":"12","article-title":"VSE++: Improving visual-semantic embeddings with hard negatives","volume-title":"British Machine Vision Conference 2018, BMVC 2018, Newcastle, UK, September 3-6, 2018","author":"Faghri","year":"2018"},{"key":"2022050519113777900_bib16","article-title":"Language-agnostic BERT sentence embedding","author":"Feng","year":"2020","journal-title":"arXiv preprint arXiv:2007.01852"},{"key":"2022050519113777900_bib17","first-page":"2121","article-title":"DeViSE: A deep visual-semantic embedding model","volume-title":"Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States","author":"Frome","year":"2013"},{"key":"2022050519113777900_bib18","article-title":"Large-scale adversarial training for vision-and-language representation learning","volume-title":"Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual","author":"Gan","year":"2020"},{"key":"2022050519113777900_bib19","doi-asserted-by":"publisher","first-page":"2839","DOI":"10.18653\/v1\/D17-1303","article-title":"Image pivoting for learning multilingual multimodal representations","volume-title":"Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP","author":"Gella","year":"2017"},{"key":"2022050519113777900_bib20","doi-asserted-by":"publisher","first-page":"165","DOI":"10.18653\/v1\/W18-6317","article-title":"Effective parallel corpus mining using bilingual sentence embeddings","volume-title":"Proceedings of the Third Conference on Machine Translation: Research Papers, WMT 2018, Belgium, Brussels, October 31 - November 1, 2018","author":"Guo","year":"2018"},{"key":"2022050519113777900_bib21","first-page":"1312","article-title":"Fast approximate nearest-neighbor search with k-nearest neighbor graph","volume-title":"Proceedings of the Twenty- Second International Joint Conference on Artificial Intelligence, Barcelona, Catalonia, Spain, July 16-22, 2011","author":"Hajebi","year":"2011"},{"key":"2022050519113777900_bib22","doi-asserted-by":"publisher","first-page":"2161","DOI":"10.18653\/v1\/2020.findings-emnlp.196","article-title":"ConveRT: Efficient and accurate conversational representations from transformers","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, EMNLP 2020, Online Event, 16-20 November 2020","author":"Henderson","year":"2020"},{"key":"2022050519113777900_bib23","doi-asserted-by":"publisher","first-page":"5392","DOI":"10.18653\/v1\/P19-1536","article-title":"Training neural response selection for task- oriented dialogue systems","volume-title":"Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers","author":"Henderson","year":"2019"},{"key":"2022050519113777900_bib24","article-title":"In defense of the triplet loss for person re-identification","author":"Hermans","year":"2017","journal-title":"arXiv preprint arXiv: 1703.07737"},{"key":"2022050519113777900_bib25","article-title":"Improving efficient neural ranking models with cross-architecture knowledge distillation","author":"Hofst\u00e4tter","year":"2020","journal-title":"arXiv preprint arXiv:2010.02666"},{"key":"2022050519113777900_bib26","article-title":"Poly- encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring","volume-title":"8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020","author":"Humeau","year":"2020"},{"key":"2022050519113777900_bib27","article-title":"Distilling knowledge from reader to retriever for question answering","volume-title":"9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021","author":"Izacard","year":"2021"},{"key":"2022050519113777900_bib28","first-page":"4904","article-title":"Scaling up visual and vision-language representation learning with noisy text supervision","volume-title":"Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event","author":"Jia","year":"2021"},{"issue":"3","key":"2022050519113777900_bib29","doi-asserted-by":"publisher","first-page":"535","DOI":"10.1109\/TBDATA.2019.2921572","article-title":"Billion-scale similarity search with gpus","volume":"7","author":"Johnson","year":"2021","journal-title":"IEEE Transactions on Big Data"},{"key":"2022050519113777900_bib30","doi-asserted-by":"publisher","first-page":"402","DOI":"10.18653\/v1\/K18-1039","article-title":"Lessons learned in multilingual grounded language learning","volume-title":"Proceedings of the 22nd Conference on Computational Natural Language Learning, CoNLL 2018, Brussels, Belgium, October 31 - November 1, 2018","author":"K\u00e1d\u00e1r","year":"2018"},{"key":"2022050519113777900_bib31","doi-asserted-by":"publisher","first-page":"6769","DOI":"10.18653\/v1\/2020.emnlp-main.550","article-title":"Dense passage retrieval for open-domain question answering","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP","author":"Karpukhin","year":"2020"},{"key":"2022050519113777900_bib32","doi-asserted-by":"publisher","first-page":"11254","DOI":"10.1609\/aaai.v34i07.6785","article-title":"MULE: Multimodal universal language embedding","volume-title":"The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020","author":"Kim","year":"2020"},{"issue":"2","key":"2022050519113777900_bib33","doi-asserted-by":"publisher","first-page":"457","DOI":"10.1137\/S0097539798347177","article-title":"Efficient search for approximate nearest neighbor in high dimensional spaces","volume":"30","author":"Kushilevitz","year":"2000","journal-title":"SIAM Journal on Computing"},{"key":"2022050519113777900_bib34","first-page":"6086","article-title":"Latent retrieval for weakly supervised open domain question answering","volume-title":"Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL","author":"Lee","year":"2019"},{"key":"2022050519113777900_bib35","doi-asserted-by":"publisher","first-page":"212","DOI":"10.1007\/978-3-030-01225-0_13","article-title":"Stacked cross attention for image-text matching","volume-title":"Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part IV","author":"Lee","year":"2018"},{"key":"2022050519113777900_bib36","doi-asserted-by":"publisher","first-page":"11336","DOI":"10.1609\/aaai.v34i07.6795","article-title":"Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training","volume-title":"The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020","author":"Li","year":"2020"},{"key":"2022050519113777900_bib37","article-title":"Align before fuse: Vision and language representation learning with momentum distillation","author":"Li","year":"2021","journal-title":"arXiv preprint arXiv:2107.07651"},{"key":"2022050519113777900_bib38","doi-asserted-by":"publisher","first-page":"121","DOI":"10.1007\/978-3-030-58577-8_8","article-title":"Oscar: Object-semantics aligned pre-training for vision-language tasks","volume-title":"Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXX","author":"Li","year":"2020"},{"key":"2022050519113777900_bib39","doi-asserted-by":"publisher","first-page":"740","DOI":"10.1007\/978-3-319-10602-1_48","article-title":"Microsoft COCO: Common objects in context","volume-title":"Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V","author":"Lin","year":"2014"},{"key":"2022050519113777900_bib40","doi-asserted-by":"publisher","first-page":"342","DOI":"10.1007\/978-3-030-72113-8_23","article-title":"Evaluating multilingual text encoders for unsupervised cross-lingual retrieval","volume-title":"Advances in Information Retrieval - 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28 - April 1, 2021, Proceedings, Part I","author":"Litschko","year":"2021"},{"key":"2022050519113777900_bib41","first-page":"825","article-title":"An investigation of practical approximate nearest neighbor algorithms","volume-title":"Advances in Neural Information Processing Systems 17 [Neural Information Processing Systems, NIPS 2004, December 13-18, 2004, Vancouver, British Columbia, Canada]","author":"Liu","year":"2004"},{"key":"2022050519113777900_bib42","article-title":"Decoupled weight decay regularization","volume-title":"7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019","author":"Loshchilov","year":"2019"},{"key":"2022050519113777900_bib43","first-page":"13","article-title":"VilBERT: Pretraining task-agnostic visiolinguistic representations for vision-and- language tasks","volume-title":"Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada","author":"Jiasen","year":"2019"},{"key":"2022050519113777900_bib44","first-page":"5020","article-title":"VisualSparta: An Embarrassingly Simple Approach to Large-scale Text-to-Image Search with Weighted Bag-of-words","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL\/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021","author":"Xiaopeng","year":"2021"},{"key":"2022050519113777900_bib45","doi-asserted-by":"publisher","first-page":"4561","DOI":"10.1109\/ICCVW.2019.00557","article-title":"Joint Wasserstein autoencoders for aligning multimodal embeddings","volume-title":"2019 IEEE\/CVF International Conference on Computer Vision Workshops, ICCV Workshops 2019, Seoul, Korea (South), October 27-28, 2019","author":"Mahajan","year":"2019"},{"key":"2022050519113777900_bib46","first-page":"2156","article-title":"Dual attention networks for multimodal reasoning and matching","volume-title":"2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017","author":"Nam","year":"2017"},{"key":"2022050519113777900_bib47","first-page":"3977","article-title":"M3P: Learning universal representations via multitask multilingual multimodal pre-training","volume-title":"IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021","author":"Ni","year":"2021"},{"key":"2022050519113777900_bib48","article-title":"Passage re-ranking with BERT","author":"Nogueira","year":"2019","journal-title":"arXiv preprint arXiv:1901.04085"},{"key":"2022050519113777900_bib49","article-title":"Multi-stage document ranking with BERT","author":"Nogueira","year":"2019","journal-title":"arXiv preprint arXiv:1910 .14424"},{"key":"2022050519113777900_bib50","first-page":"1143","article-title":"Im2Text: Describing images using 1 million captioned photographs","volume-title":"Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain","author":"Ordonez","year":"2011"},{"key":"2022050519113777900_bib51","doi-asserted-by":"publisher","first-page":"2641","DOI":"10.1109\/ICCV.2015.303","article-title":"Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models","volume-title":"2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015","author":"Plummer","year":"2015"},{"key":"2022050519113777900_bib52","first-page":"5835","article-title":"RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering","volume-title":"Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT","author":"Yingqi","year":"2021"},{"key":"2022050519113777900_bib53","first-page":"8748","article-title":"Learning transferable visual models from natural language supervision","volume-title":"Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event","author":"Radford","year":"2021"},{"key":"2022050519113777900_bib54","doi-asserted-by":"publisher","first-page":"3980","DOI":"10.18653\/v1\/D19-1410","article-title":"Sentence-BERT: Sentence embeddings using siamese bert-networks","volume-title":"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019","author":"Reimers","year":"2019"},{"key":"2022050519113777900_bib55","first-page":"91","article-title":"Faster R-CNN: Towards real-time object detection with region proposal networks","volume-title":"Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada","author":"Ren","year":"2015"},{"key":"2022050519113777900_bib56","first-page":"6648","article-title":"End-to-end training of neural retrievers for open-domain question answering","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL\/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021","author":"Sachan","year":"2021"},{"key":"2022050519113777900_bib57","article-title":"End-to-end training of multi-document reader and retriever for open-domain question answering","author":"Sachan","year":"2021","journal-title":"arXiv preprint arXiv:2106.05346"},{"key":"2022050519113777900_bib58","doi-asserted-by":"publisher","first-page":"2556","DOI":"10.18653\/v1\/P18-1238","article-title":"Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning","volume-title":"Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Sharma","year":"2018"},{"key":"2022050519113777900_bib59","first-page":"5182","article-title":"Knowledge aware semantic concept expansion for image-text matching","volume-title":"Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019","author":"Shi","year":"2019"},{"issue":"1","key":"2022050519113777900_bib60","doi-asserted-by":"publisher","first-page":"53","DOI":"10.1145\/363707.363732","article-title":"Answering English questions by computer: A survey","volume":"8","author":"Simmons","year":"1965","journal-title":"Communications of the ACM"},{"key":"2022050519113777900_bib61","article-title":"VL-BERT: Pre-training of generic visual-linguistic representations","volume-title":"8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020","author":"Weijie","year":"2020"},{"key":"2022050519113777900_bib62","doi-asserted-by":"publisher","first-page":"5099","DOI":"10.18653\/v1\/D19-1514","article-title":"LXMERT: Learning cross-modality encoder representations from transformers","volume-title":"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019","author":"Tan","year":"2019"},{"key":"2022050519113777900_bib63","first-page":"5998","article-title":"Attention is all you need","volume-title":"Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA","author":"Vaswani","year":"2017"},{"issue":"2","key":"2022050519113777900_bib64","doi-asserted-by":"publisher","first-page":"394","DOI":"10.1109\/TPAMI.2018.2797921","article-title":"Learning two-branch neural networks for image-text matching tasks","volume":"41","author":"Wang","year":"2019","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"2022050519113777900_bib65","doi-asserted-by":"publisher","first-page":"3792","DOI":"10.24963\/ijcai.2019\/526","article-title":"Position focused attention network for image- text matching","volume-title":"Proceedings of the Twenty- Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019","author":"Wang","year":"2019"},{"key":"2022050519113777900_bib66","doi-asserted-by":"publisher","first-page":"5803","DOI":"10.1109\/ICCV.2019.00590","article-title":"Language-agnostic visual-semantic embeddings","volume-title":"2019 IEEE\/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019","author":"Wehrmann","year":"2019"},{"key":"2022050519113777900_bib67","doi-asserted-by":"publisher","first-page":"4602","DOI":"10.18653\/v1\/P19-1453","article-title":"Simple and effective paraphrastic similarity from parallel translations","volume-title":"Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers","author":"Wieting","year":"2019"},{"key":"2022050519113777900_bib68","article-title":"Approximate nearest neighbor negative contrastive learning for dense text retrieval","volume-title":"9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021","author":"Xiong","year":"2021"},{"key":"2022050519113777900_bib69","first-page":"4555","article-title":"A unified pretraining framework for passage ranking and expansion","volume-title":"Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021","author":"Yan","year":"2021"},{"key":"2022050519113777900_bib70","article-title":"Is retriever merely an approximator of reader?","author":"Yang","year":"2020","journal-title":"arXiv preprint arXiv:2010.10999"},{"key":"2022050519113777900_bib71","doi-asserted-by":"publisher","first-page":"72","DOI":"10.18653\/v1\/N19-4013","article-title":"End-to-end open-domain question answering with bertserini","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT","author":"Yang","year":"2019"},{"key":"2022050519113777900_bib72","doi-asserted-by":"publisher","first-page":"164","DOI":"10.18653\/v1\/W18-3022","article-title":"Learning semantic textual similarity from conversations","volume-title":"Proceedings of The Third Workshop on Representation Learning for NLP, Rep4NLP@ACL 2018, Melbourne, Australia, July 20, 2018","author":"Yang","year":"2018"},{"key":"2022050519113777900_bib73","doi-asserted-by":"publisher","first-page":"2666","DOI":"10.1145\/3404835.3462812","article-title":"Pretrained transformers for text ranking: BERT and beyond","volume-title":"SIGIR \u201921: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021","author":"Yates","year":"2021"},{"issue":"2","key":"2022050519113777900_bib74","doi-asserted-by":"publisher","first-page":"51:1","DOI":"10.1145\/3383184","article-title":"Dual-path convolutional image-text embeddings with instance loss","volume":"16","author":"Zheng","year":"2020","journal-title":"ACM Transactions On Multimedia Computing, Communications And Applications"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00473\/2020706\/tacl_a_00473.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00473\/2020706\/tacl_a_00473.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,5,5]],"date-time":"2022-05-05T19:12:52Z","timestamp":1651777972000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00473\/110994\/Retrieve-Fast-Rerank-Smart-Cooperative-and-Joint"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022]]},"references-count":74,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00473","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2022]]},"published":{"date-parts":[[2022]]}}}