{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T04:12:51Z","timestamp":1750219971101,"version":"3.41.0"},"publisher-location":"New York, NY, USA","reference-count":43,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T00:00:00Z","timestamp":1665360000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,10,10]]},"DOI":"10.1145\/3503161.3548757","type":"proceedings-article","created":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T15:42:35Z","timestamp":1665416555000},"page":"6925-6929","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Multi-modal Learning Algorithms and Network Architectures for Information Extraction and Retrieval"],"prefix":"10.1145","author":[{"given":"Maurits","family":"Bleeker","sequence":"first","affiliation":[{"name":"University of Amsterdam, Amsterdam, Netherlands"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2022,10,10]]},"reference":[{"key":"e_1_3_2_1_1_1","volume-title":"Exploring the limits of large scale pre-training. arXiv preprint arXiv:2110.02095","author":"Abnar Samira","year":"2021","unstructured":"Samira Abnar , Mostafa Dehghani , Behnam Neyshabur , and Hanie Sedghi . 2021. Exploring the limits of large scale pre-training. arXiv preprint arXiv:2110.02095 ( 2021 ). Samira Abnar, Mostafa Dehghani, Behnam Neyshabur, and Hanie Sedghi. 2021. Exploring the limits of large scale pre-training. arXiv preprint arXiv:2110.02095 (2021)."},{"key":"e_1_3_2_1_2_1","unstructured":"Jean-Baptiste Alayrac Jeff Donahue Pauline Luc Antoine Miech Iain Barr Yana Hasson Karel Lenc Arthur Mensch Katie Millican Malcolm Reynolds etal 2022. Flamingo: a Visual Language Model for Few-Shot Learning. arXiv preprint arXiv:2204.14198 (2022).  Jean-Baptiste Alayrac Jeff Donahue Pauline Luc Antoine Miech Iain Barr Yana Hasson Karel Lenc Arthur Mensch Katie Millican Malcolm Reynolds et al. 2022. Flamingo: a Visual Language Model for Few-Shot Learning. arXiv preprint arXiv:2204.14198 (2022)."},{"key":"e_1_3_2_1_3_1","volume-title":"Multimodal machine learning: A survey and taxonomy","author":"Baltruaitis Tadas","year":"2018","unstructured":"Tadas Baltruaitis , Chaitanya Ahuja , and Louis-Philippe Morency . 2018. Multimodal machine learning: A survey and taxonomy . IEEE transactions on pattern analysis and machine intelligence 41, 2 ( 2018 ), 423--443. Tadas Baltruaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence 41, 2 (2018), 423--443."},{"key":"e_1_3_2_1_4_1","volume-title":"ECAI 2020: 24th European Conference on Artificial Intelligence. IOS Press, 2664--2672","author":"Bleeker Maurits","year":"2020","unstructured":"Maurits Bleeker and Maarten de Rijke . 2020 . Bidirectional Scene Text Recognition with a Single Decoder . In ECAI 2020: 24th European Conference on Artificial Intelligence. IOS Press, 2664--2672 . Maurits Bleeker and Maarten de Rijke. 2020. Bidirectional Scene Text Recognition with a Single Decoder. In ECAI 2020: 24th European Conference on Artificial Intelligence. IOS Press, 2664--2672."},{"key":"e_1_3_2_1_5_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-99736-6_36"},{"key":"e_1_3_2_1_6_1","volume-title":"Keep the Caption Information: Preventing Shortcut Learning in Contrastive Image-Caption Retrieval. arXiv preprint arXiv:2204.13382","author":"Bleeker Maurits","year":"2022","unstructured":"Maurits Bleeker , Andrew Yates , and Maarten de Rijke . 2022. Keep the Caption Information: Preventing Shortcut Learning in Contrastive Image-Caption Retrieval. arXiv preprint arXiv:2204.13382 ( 2022 ). Maurits Bleeker, Andrew Yates, and Maarten de Rijke. 2022. Keep the Caption Information: Preventing Shortcut Learning in Contrastive Image-Caption Retrieval. arXiv preprint arXiv:2204.13382 (2022)."},{"key":"e_1_3_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58545-7_39"},{"key":"e_1_3_2_1_8_1","unstructured":"Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell etal 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020) 1877--1901.  Tom Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared D Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020) 1877--1901."},{"key":"e_1_3_2_1_9_1","volume-title":"IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020","author":"Chen Hui","year":"2020","unstructured":"Hui Chen , Guiguang Ding , Xudong Liu , Zijia Lin , Ji Liu , and Jungong Han . 2020 . IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020 , Seattle, WA, USA , June 13-19, 2020. Computer Vision Foundation \/ IEEE, 12652--12660. https:\/\/doi.org\/10.1109\/CVPR42600.2020.01267 Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020. IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. Computer Vision Foundation \/ IEEE, 12652--12660. https:\/\/doi.org\/10.1109\/CVPR42600.2020.01267"},{"key":"e_1_3_2_1_10_1","volume-title":"Adaptive Offline Quintuplet Loss for Image-Text Matching. In European Conference on Computer Vision (ECCV). Springer, 549--565","author":"Chen Tianlang","year":"2020","unstructured":"Tianlang Chen , Jiajun Deng , and Jiebo Luo . 2020 . Adaptive Offline Quintuplet Loss for Image-Text Matching. In European Conference on Computer Vision (ECCV). Springer, 549--565 . Tianlang Chen, Jiajun Deng, and Jiebo Luo. 2020. Adaptive Offline Quintuplet Loss for Image-Text Matching. In European Conference on Computer Vision (ECCV). Springer, 549--565."},{"key":"e_1_3_2_1_11_1","volume-title":"UK","volume":"565","author":"Chen Tianlang","year":"2020","unstructured":"Tianlang Chen , Jiajun Deng , and Jiebo Luo . 2020 . Adaptive Offline Quintuplet Loss for Image-Text Matching. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow , UK , August 23-28, 2020, Proceedings, Part XIII (Lecture Notes in Computer Science , Vol. 12358), Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer, 549-- 565 . https:\/\/doi.org\/10.1007\/978-3-030-58601-0_33 Tianlang Chen, Jiajun Deng, and Jiebo Luo. 2020. Adaptive Offline Quintuplet Loss for Image-Text Matching. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XIII (Lecture Notes in Computer Science, Vol. 12358), Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.). Springer, 549--565. https:\/\/doi.org\/10.1007\/978-3-030-58601-0_33"},{"key":"e_1_3_2_1_12_1","volume-title":"International conference on machine learning. PMLR, 1597--1607","author":"Chen Ting","year":"2020","unstructured":"Ting Chen , Simon Kornblith , Mohammad Norouzi , and Geoffrey Hinton . 2020 . A simple framework for contrastive learning of visual representations . In International conference on machine learning. PMLR, 1597--1607 . Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597--1607."},{"key":"e_1_3_2_1_13_1","volume-title":"Intriguing Properties of Contrastive Losses. arXiv preprint arXiv:2011.02803","author":"Chen Ting","year":"2020","unstructured":"Ting Chen and Lala Li. 2020. Intriguing Properties of Contrastive Losses. arXiv preprint arXiv:2011.02803 ( 2020 ). Ting Chen and Lala Li. 2020. Intriguing Properties of Contrastive Losses. arXiv preprint arXiv:2011.02803 (2020)."},{"key":"e_1_3_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58577-8_7"},{"key":"e_1_3_2_1_15_1","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805","author":"Devlin Jacob","year":"2018","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)."},{"key":"e_1_3_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i2.16209"},{"key":"e_1_3_2_1_17_1","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly etal 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).  Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)."},{"key":"e_1_3_2_1_18_1","volume-title":"Proceedings of the British Machine Vision Conference (BMVC).","author":"Faghri Fartash","year":"2018","unstructured":"Fartash Faghri , David J Fleet , Jamie Ryan Kiros , and Sanja Fidler . 2018 . VSE: Improving Visual-Semantic Embeddings with Hard Negatives . In Proceedings of the British Machine Vision Conference (BMVC). Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE: Improving Visual-Semantic Embeddings with Hard Negatives. In Proceedings of the British Machine Vision Conference (BMVC)."},{"key":"e_1_3_2_1_19_1","volume-title":"Scaling up visual and vision language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918","author":"Jia Chao","year":"2021","unstructured":"Chao Jia , Yinfei Yang , Ye Xia , Yi-Ting Chen , Zarana Parekh , Hieu Pham , Quoc V Le , Yunhsuan Sung , Zhen Li , and Tom Duerig . 2021. Scaling up visual and vision language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918 ( 2021 ). Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918 (2021)."},{"key":"e_1_3_2_1_20_1","volume-title":"Munich","volume":"228","author":"Lee Kuang-Huei","year":"2018","unstructured":"Kuang-Huei Lee , Xi Chen , Gang Hua , Houdong Hu , and Xiaodong He . 2018 . Stacked Cross Attention for Image-Text Matching. In Computer Vision - ECCV 2018 - 15th European Conference , Munich , Germany, September 8-14, 2018, Proceedings, Part IV (Lecture Notes in Computer Science , Vol. 11208), Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (Eds.). Springer, 212-- 228 . https: \/\/doi.org\/10.1007\/978-3-030-01225-0_13 Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part IV (Lecture Notes in Computer Science, Vol. 11208), Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (Eds.). Springer, 212--228. https: \/\/doi.org\/10.1007\/978-3-030-01225-0_13"},{"key":"e_1_3_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00475"},{"key":"e_1_3_2_1_22_1","volume-title":"Making contrastive learning robust to shortcuts. arXiv preprint arXiv:2012.09962","author":"Li Tianhong","year":"2020","unstructured":"Tianhong Li , Lijie Fan , Yuan Yuan , Hao He , Yonglong Tian , Rogerio Feris , Piotr Indyk , and Dina Katabi . 2020. Making contrastive learning robust to shortcuts. arXiv preprint arXiv:2012.09962 ( 2020 ). Tianhong Li, Lijie Fan, Yuan Yuan, Hao He, Yonglong Tian, Rogerio Feris, Piotr Indyk, and Dina Katabi. 2020. Making contrastive learning robust to shortcuts. arXiv preprint arXiv:2012.09962 (2020)."},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-10602-1_48"},{"key":"e_1_3_2_1_24_1","volume-title":"Graph Structured Network for Image-Text Matching. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020","author":"Liu Chunxiao","year":"2020","unstructured":"Chunxiao Liu , Zhendong Mao , Tianzhu Zhang , Hongtao Xie , Bin Wang , and Yongdong Zhang . 2020 . Graph Structured Network for Image-Text Matching. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020 , Seattle, WA, USA , June 13-19, 2020. Computer Vision Foundation \/ IEEE, 10918--10927. https:\/\/doi.org\/10.1109\/CVPR42600.2020.01093 Chunxiao Liu, Zhendong Mao, Tianzhu Zhang, Hongtao Xie, Bin Wang, and Yongdong Zhang. 2020. Graph Structured Network for Image-Text Matching. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. Computer Vision Foundation \/ IEEE, 10918--10927. https:\/\/doi.org\/10.1109\/CVPR42600.2020.01093"},{"key":"e_1_3_2_1_25_1","volume-title":"Less is More: Pre-training a Strong Siamese Encoder Using a Weak Decoder. CoRR abs\/2102.09206","author":"Lu Shuqi","year":"2021","unstructured":"Shuqi Lu , Chenyan Xiong , Di He , Guolin Ke , Waleed Malik , Zhicheng Dou , Paul Bennett , Tie-Yan Liu , and Arnold Overwijk . 2021. Less is More: Pre-training a Strong Siamese Encoder Using a Weak Decoder. CoRR abs\/2102.09206 ( 2021 ). arXiv:2102.09206 https:\/\/arxiv.org\/abs\/2102.09206 Shuqi Lu, Chenyan Xiong, Di He, Guolin Ke, Waleed Malik, Zhicheng Dou, Paul Bennett, Tie-Yan Liu, and Arnold Overwijk. 2021. Less is More: Pre-training a Strong Siamese Encoder Using a Weak Decoder. CoRR abs\/2102.09206 (2021). arXiv:2102.09206 https:\/\/arxiv.org\/abs\/2102.09206"},{"key":"e_1_3_2_1_26_1","volume-title":"Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders. CoRR abs\/2008.05231","author":"Messina Nicola","year":"2020","unstructured":"Nicola Messina , Giuseppe Amato , Andrea Esuli , Fabrizio Falchi , Claudio Gennaro , and St\u00e9phane Marchand-Maillet . 2020. Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders. CoRR abs\/2008.05231 ( 2020 ). arXiv:2008.05231 https:\/\/arxiv.org\/abs\/2008.05231 Nicola Messina, Giuseppe Amato, Andrea Esuli, Fabrizio Falchi, Claudio Gennaro, and St\u00e9phane Marchand-Maillet. 2020. Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders. CoRR abs\/2008.05231 (2020). arXiv:2008.05231 https:\/\/arxiv.org\/abs\/2008.05231"},{"key":"e_1_3_2_1_27_1","volume-title":"Transformer Reasoning Network for Image-Text Matching and Retrieval. In 25th International Conference on Pattern Recognition, ICPR 2020","author":"Messina Nicola","year":"2020","unstructured":"Nicola Messina , Fabrizio Falchi , Andrea Esuli , and Giuseppe Amato . 2020 . Transformer Reasoning Network for Image-Text Matching and Retrieval. In 25th International Conference on Pattern Recognition, ICPR 2020 , Virtual Event \/ Milan, Italy, January 10--15 , 2021. IEEE, 5222--5229. https:\/\/doi.org\/10.1109\/ICPR48806.2021. 9413172 Nicola Messina, Fabrizio Falchi, Andrea Esuli, and Giuseppe Amato. 2020. Transformer Reasoning Network for Image-Text Matching and Retrieval. In 25th International Conference on Pattern Recognition, ICPR 2020, Virtual Event \/ Milan, Italy, January 10--15, 2021. IEEE, 5222--5229. https:\/\/doi.org\/10.1109\/ICPR48806.2021. 9413172"},{"key":"e_1_3_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICPR48806.2021.9413172"},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58595-2_41"},{"key":"e_1_3_2_1_30_1","volume-title":"International Conference on Machine Learning. PMLR, 8748--8763","author":"Radford Alec","year":"2021","unstructured":"Alec Radford , Jong Wook Kim , Chris Hallacy , Aditya Ramesh , Gabriel Goh , Sandhini Agarwal , Girish Sastry , Amanda Askell , Pamela Mishkin , Jack Clark , 2021 . Learning transferable visual models from natural language supervision . In International Conference on Machine Learning. PMLR, 8748--8763 . Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748--8763."},{"key":"e_1_3_2_1_31_1","volume-title":"Can contrastive learning avoid shortcut solutions? arXiv preprint arXiv:2106.11230","author":"Robinson Joshua","year":"2021","unstructured":"Joshua Robinson , Li Sun , Ke Yu , Kayhan Batmanghelich , Stefanie Jegelka , and Suvrit Sra . 2021. Can contrastive learning avoid shortcut solutions? arXiv preprint arXiv:2106.11230 ( 2021 ). Joshua Robinson, Li Sun, Ke Yu, Kayhan Batmanghelich, Stefanie Jegelka, and Suvrit Sra. 2021. Can contrastive learning avoid shortcut solutions? arXiv preprint arXiv:2106.11230 (2021)."},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7298682"},{"key":"e_1_3_2_1_33_1","volume-title":"arXiv preprint arXiv:1806.00926","author":"Sheng Fenfen","year":"2018","unstructured":"Fenfen Sheng , Zhineng Chen , and Bo Xu. 2018. NRTR:A No-Recurrence Sequenceto- Sequence Model For Scene Text Recognition . arXiv preprint arXiv:1806.00926 ( 2018 ). Fenfen Sheng, Zhineng Chen, and Bo Xu. 2018. NRTR:ANo-Recurrence Sequenceto- Sequence Model For Scene Text Recognition. arXiv preprint arXiv:1806.00926 (2018)."},{"key":"e_1_3_2_1_34_1","unstructured":"Baoguang Shi et al. 2016. Robust scene text recognition with automatic rectification. In CVPR. 4168--4176.  Baoguang Shi et al. 2016. Robust scene text recognition with automatic rectification. In CVPR. 4168--4176."},{"key":"e_1_3_2_1_35_1","volume-title":"Aster: An attentional scene text recognizer with flexible rectification","author":"Baoguang Shi","year":"2018","unstructured":"Baoguang Shi et al. 2018 . Aster: An attentional scene text recognizer with flexible rectification . IEEE transactions on pattern analysis and machine intelligence (2018). Baoguang Shi et al. 2018. Aster: An attentional scene text recognizer with flexible rectification. IEEE transactions on pattern analysis and machine intelligence (2018)."},{"key":"e_1_3_2_1_36_1","volume-title":"Attention is all you need. Advances in neural information processing systems 30","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani , Noam Shazeer , Niki Parmar , Jakob Uszkoreit , Llion Jones , Aidan N Gomez , Lukasz Kaiser , and Illia Polosukhin . 2017. Attention is all you need. Advances in neural information processing systems 30 ( 2017 ). Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017)."},{"key":"e_1_3_2_1_37_1","volume-title":"Crossmodal Scene Graph Matching for Relationship-aware Image-Text Retrieval. In IEEE Winter Conference on Applications of Computer Vision,WACV 2020","author":"Wang Ruiping","year":"2020","unstructured":"SijinWang, Ruiping Wang , Ziwei Yao , Shiguang Shan , and Xilin Chen . 2020 . Crossmodal Scene Graph Matching for Relationship-aware Image-Text Retrieval. In IEEE Winter Conference on Applications of Computer Vision,WACV 2020 , Snow mass Village, CO, USA, March 1--5 , 2020. IEEE, 1497--1506. https:\/\/doi.org\/10.1109\/ WACV45572.2020.9093614 SijinWang, Ruiping Wang, Ziwei Yao, Shiguang Shan, and Xilin Chen. 2020. Crossmodal Scene Graph Matching for Relationship-aware Image-Text Retrieval. In IEEE Winter Conference on Applications of Computer Vision,WACV 2020, Snow mass Village, CO, USA, March 1--5, 2020. IEEE, 1497--1506. https:\/\/doi.org\/10.1109\/ WACV45572.2020.9093614"},{"key":"e_1_3_2_1_38_1","volume-title":"CAMP: Cross-Modal Adaptive Message Passing for Text- Image Retrieval. In 2019 IEEE\/CVF International Conference on Computer Vision, ICCV 2019","author":"Wang Zihao","year":"2019","unstructured":"Zihao Wang , Xihui Liu , Hongsheng Li , Lu Sheng , Junjie Yan , Xiaogang Wang , and Jing Shao . 2019 . CAMP: Cross-Modal Adaptive Message Passing for Text- Image Retrieval. In 2019 IEEE\/CVF International Conference on Computer Vision, ICCV 2019 , Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 5763--5772. https:\/\/doi.org\/10.1109\/ICCV.2019.00586 Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Yan, Xiaogang Wang, and Jing Shao. 2019. CAMP: Cross-Modal Adaptive Message Passing for Text- Image Retrieval. In 2019 IEEE\/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 5763--5772. https:\/\/doi.org\/10.1109\/ICCV.2019.00586"},{"key":"e_1_3_2_1_39_1","volume-title":"Symmetry-constrained Rectification Network for Scene Text Recognition. arXiv preprint arXiv:1908.01957","author":"Yang MingKun","year":"2019","unstructured":"MingKun Yang , Yushuo Guan , Minghui Liao , Xin He , Kaigui Bian , Song Bai , Cong Yao , and Xiang Bai . 2019. Symmetry-constrained Rectification Network for Scene Text Recognition. arXiv preprint arXiv:1908.01957 ( 2019 ). MingKun Yang, Yushuo Guan, Minghui Liao, Xin He, Kaigui Bian, Song Bai, Cong Yao, and Xiang Bai. 2019. Symmetry-constrained Rectification Network for Scene Text Recognition. arXiv preprint arXiv:1908.01957 (2019)."},{"key":"e_1_3_2_1_40_1","doi-asserted-by":"crossref","unstructured":"Xiao Yang Dafang He Zihan Zhou Daniel Kifer and C Lee Giles. 2017. Learning to Read Irregular Text with Attention Mechanisms.. In IJCAI. 3280--3286.  Xiao Yang Dafang He Zihan Zhou Daniel Kifer and C Lee Giles. 2017. Learning to Read Irregular Text with Attention Mechanisms.. In IJCAI. 3280--3286.","DOI":"10.24963\/ijcai.2017\/458"},{"key":"e_1_3_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00166"},{"key":"e_1_3_2_1_42_1","volume-title":"Heterogeneous Attention Network for Effective and Efficient Cross-modal Retrieval. In SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval","author":"Yu Tan","year":"2021","unstructured":"Tan Yu , Yi Yang , Yi Li , Lin Liu , Hongliang Fei , and Ping Li . 2021 . Heterogeneous Attention Network for Effective and Efficient Cross-modal Retrieval. In SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval , Virtual Event, Canada, July 11--15 , 2021, Fernando Diaz, Chirag Shah, Torsten Suel, Pablo Castells, Rosie Jones, and Tetsuya Sakai (Eds.). ACM, 1146--1156. https:\/\/doi.org\/10.1145\/3404835.3462924 Tan Yu, Yi Yang, Yi Li, Lin Liu, Hongliang Fei, and Ping Li. 2021. Heterogeneous Attention Network for Effective and Efficient Cross-modal Retrieval. In SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11--15, 2021, Fernando Diaz, Chirag Shah, Torsten Suel, Pablo Castells, Rosie Jones, and Tetsuya Sakai (Eds.). ACM, 1146--1156. https:\/\/doi.org\/10.1145\/3404835.3462924"},{"key":"e_1_3_2_1_43_1","volume-title":"Esir: End-to-end scene text recognition via iterative image rectification. arXiv preprint arXiv:1812.05824","author":"Zhan Fangneng","year":"2018","unstructured":"Fangneng Zhan and Shijian Lu . 2018 . Esir: End-to-end scene text recognition via iterative image rectification. arXiv preprint arXiv:1812.05824 (2018). Fangneng Zhan and Shijian Lu. 2018. Esir: End-to-end scene text recognition via iterative image rectification. arXiv preprint arXiv:1812.05824 (2018)."}],"event":{"name":"MM '22: The 30th ACM International Conference on Multimedia","sponsor":["SIGMM ACM Special Interest Group on Multimedia"],"location":"Lisboa Portugal","acronym":"MM '22"},"container-title":["Proceedings of the 30th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3548757","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3503161.3548757","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T17:49:17Z","timestamp":1750182557000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3548757"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,10]]},"references-count":43,"alternative-id":["10.1145\/3503161.3548757","10.1145\/3503161"],"URL":"https:\/\/doi.org\/10.1145\/3503161.3548757","relation":{},"subject":[],"published":{"date-parts":[[2022,10,10]]},"assertion":[{"value":"2022-10-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}