{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,28]],"date-time":"2026-04-28T20:10:41Z","timestamp":1777407041856,"version":"3.51.4"},"publisher-location":"New York, NY, USA","reference-count":37,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T00:00:00Z","timestamp":1665360000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"National Research Foundation, Singapore under its Strategic Capability Research Centres Funding Initiative"}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,10,10]]},"DOI":"10.1145\/3503161.3547870","type":"proceedings-article","created":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T15:42:35Z","timestamp":1665416555000},"page":"2061-2069","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":36,"title":["A Unified End-to-End Retriever-Reader Framework for Knowledge-based VQA"],"prefix":"10.1145","author":[{"given":"Yangyang","family":"Guo","sequence":"first","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Liqiang","family":"Nie","sequence":"additional","affiliation":[{"name":"Harbin Institute of Technology (Shenzhen), Shenzhen, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yongkang","family":"Wong","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yibing","family":"Liu","sequence":"additional","affiliation":[{"name":"City University of Hong Kong, Hong Kong, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zhiyong","family":"Cheng","sequence":"additional","affiliation":[{"name":"Qilu University of Technology (Shandong Academy of Sciences), Jinan, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Mohan","family":"Kankanhalli","sequence":"additional","affiliation":[{"name":"National University of Singapore, Singapore, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2022,10,10]]},"reference":[{"key":"e_1_3_2_2_1_1","doi-asserted-by":"publisher","DOI":"10.1007\/s11263-016-0966-6"},{"key":"e_1_3_2_2_2_1","volume-title":"Computer Vision and Pattern Recognition","author":"Anderson Peter","unstructured":"Peter Anderson , Xiaodong He , Chris Buehler , Damien Teney , Mark Johnson , Stephen Gould , and Lei Zhang . 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering . In Computer Vision and Pattern Recognition . IEEE , 6077--6086. Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Computer Vision and Pattern Recognition. IEEE, 6077--6086."},{"key":"e_1_3_2_2_3_1","volume-title":"MUTAN: Multimodal Tucker Fusion for Visual Question Answering. In International Conference on Computer Vision. IEEE, 2631--2639","author":"Hedi","year":"2017","unstructured":"Hedi Ben-younes, R\u00e9mi Cad\u00e8ne , Matthieu Cord , and Nicolas Thome . 2017 . MUTAN: Multimodal Tucker Fusion for Visual Question Answering. In International Conference on Computer Vision. IEEE, 2631--2639 . Hedi Ben-younes, R\u00e9mi Cad\u00e8ne, Matthieu Cord, and Nicolas Thome. 2017. MUTAN: Multimodal Tucker Fusion for Visual Question Answering. In International Conference on Computer Vision. IEEE, 2631--2639."},{"key":"e_1_3_2_2_4_1","volume-title":"Conference on Neural Information Processing Systems.","author":"Brown Tom B.","year":"2020","unstructured":"Tom B. Brown , Benjamin Mann , Nick Ryder , Melanie Subbiah , Jared Kaplan , Prafulla Dhariwal , Arvind Neelakantan , Pranav Shyam , Girish Sastry , Amanda Askell , Sandhini Agarwal , Ariel Herbert-Voss , Gretchen Krueger , Tom Henighan , Rewon Child , Aditya Ramesh , Daniel M. Ziegler , Jeffrey Wu , Clemens Winter , Christopher Hesse , Mark Chen , Eric Sigler , Mateusz Litwin , Scott Gray , Benjamin Chess , Jack Clark , Christopher Berner , Sam McCandlish , Alec Radford , Ilya Sutskever , and Dario Amodei . 2020 . Language Models are Few-Shot Learners . In Conference on Neural Information Processing Systems. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Conference on Neural Information Processing Systems."},{"key":"e_1_3_2_2_5_1","volume-title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In North American","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2019 . BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In North American Chapter of the Association for Computational Linguistics. ACL , 4171--4186. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In North American Chapter of the Association for Computational Linguistics. ACL, 4171--4186."},{"key":"e_1_3_2_2_6_1","volume-title":"International Conference on Learning Representations.","author":"Dosovitskiy Alexey","year":"2021","unstructured":"Alexey Dosovitskiy , Lucas Beyer , Alexander Kolesnikov , Dirk Weissenborn , Xiaohua Zhai , Thomas Unterthiner , Mostafa Dehghani , Matthias Minderer , Georg Heigold , Sylvain Gelly , Jakob Uszkoreit , and Neil Houlsby . 2021 . An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale . In International Conference on Learning Representations. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations."},{"key":"e_1_3_2_2_7_1","volume-title":"ConceptBert: Concept-Aware Representation for Visual Question Answering. In Findings of Conference on Empirical Methods in Natural Language Processing. ACL, 489--498","author":"Francc","year":"2020","unstructured":"Francc ois Gard\u00e8 res, Maryam Ziaeefard , Baptiste Abeloos , and Freddy L\u00e9 cu\u00e9. 2020 . ConceptBert: Concept-Aware Representation for Visual Question Answering. In Findings of Conference on Empirical Methods in Natural Language Processing. ACL, 489--498 . Francc ois Gard\u00e8 res, Maryam Ziaeefard, Baptiste Abeloos, and Freddy L\u00e9 cu\u00e9. 2020. ConceptBert: Concept-Aware Representation for Visual Question Answering. In Findings of Conference on Empirical Methods in Natural Language Processing. ACL, 489--498."},{"key":"e_1_3_2_2_8_1","volume-title":"Focal and Composed Vision-semantic Modeling for Visual Question Answering. In ACM Multimedia Conference. ACM, 4528--4536","author":"Han Yudong","year":"2021","unstructured":"Yudong Han , Yangyang Guo , Jianhua Yin , Meng Liu , Yupeng Hu , and Liqiang Nie . 2021 . Focal and Composed Vision-semantic Modeling for Visual Question Answering. In ACM Multimedia Conference. ACM, 4528--4536 . Yudong Han, Yangyang Guo, Jianhua Yin, Meng Liu, Yupeng Hu, and Liqiang Nie. 2021. Focal and Composed Vision-semantic Modeling for Visual Question Answering. In ACM Multimedia Conference. ACM, 4528--4536."},{"key":"e_1_3_2_2_9_1","volume-title":"GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 6700--6709","author":"Drew","unstructured":"Drew A. Hudson and Christopher D. Manning. 2019 . GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 6700--6709 . Drew A. Hudson and Christopher D. Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 6700--6709."},{"key":"e_1_3_2_2_10_1","volume-title":"CSKG: The CommonSense Knowledge Graph","author":"Ilievski Filip","year":"2021","unstructured":"Filip Ilievski , Pedro A. Szekely , and Bin Zhang . 2021 . CSKG: The CommonSense Knowledge Graph . In ESWC. Springer , 680--696. Filip Ilievski, Pedro A. Szekely, and Bin Zhang. 2021. CSKG: The CommonSense Knowledge Graph. In ESWC. Springer, 680--696."},{"key":"e_1_3_2_2_11_1","volume-title":"International Conference on Learning Representations.","author":"Izacard Gautier","year":"2021","unstructured":"Gautier Izacard and Edouard Grave . 2021 . Distilling Knowledge from Reader to Retriever for Question Answering . In International Conference on Learning Representations. Gautier Izacard and Edouard Grave. 2021. Distilling Knowledge from Reader to Retriever for Question Answering. In International Conference on Learning Representations."},{"key":"e_1_3_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.findings-acl.13"},{"key":"e_1_3_2_2_13_1","volume-title":"Dense Passage Retrieval for Open-Domain Question Answering. In Conference on Empirical Methods in Natural Language Processing. ACL, 6769--6781","author":"Karpukhin Vladimir","year":"2020","unstructured":"Vladimir Karpukhin , Barlas Oguz , Sewon Min , Patrick S. H. Lewis , Ledell Wu , Sergey Edunov , Danqi Chen , and Wen-tau Yih. 2020 . Dense Passage Retrieval for Open-Domain Question Answering. In Conference on Empirical Methods in Natural Language Processing. ACL, 6769--6781 . Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick S. H. Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In Conference on Empirical Methods in Natural Language Processing. ACL, 6769--6781."},{"key":"e_1_3_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/3397271.3401075"},{"key":"e_1_3_2_2_15_1","volume-title":"Bilinear Attention Networks. In Conference on Neural Information Processing Systems. 1571--1581","author":"Kim Jin-Hwa","year":"2018","unstructured":"Jin-Hwa Kim , Jaehyun Jun , and Byoung-Tak Zhang . 2018 . Bilinear Attention Networks. In Conference on Neural Information Processing Systems. 1571--1581 . Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear Attention Networks. In Conference on Neural Information Processing Systems. 1571--1581."},{"key":"e_1_3_2_2_16_1","volume-title":"International Conference on Machine Learning. PMLR, 5583--5594","author":"Kim Wonjae","year":"2021","unstructured":"Wonjae Kim , Bokyung Son , and Ildoo Kim . 2021 . ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision . In International Conference on Machine Learning. PMLR, 5583--5594 . Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. In International Conference on Machine Learning. PMLR, 5583--5594."},{"key":"e_1_3_2_2_17_1","volume-title":"Boosting Visual Question Answering with Context-aware Knowledge Aggregation. In ACM Multimedia Conference. ACM, 1227--1235","author":"Li Guohao","year":"2020","unstructured":"Guohao Li , Xin Wang , and Wenwu Zhu . 2020 . Boosting Visual Question Answering with Context-aware Knowledge Aggregation. In ACM Multimedia Conference. ACM, 1227--1235 . Guohao Li, Xin Wang, and Wenwu Zhu. 2020. Boosting Visual Question Answering with Context-aware Knowledge Aggregation. In ACM Multimedia Conference. ACM, 1227--1235."},{"key":"e_1_3_2_2_18_1","volume-title":"VisualBERT: A Simple and Performant Baseline for Vision and Language. CoRR","author":"Li Liunian Harold","year":"2019","unstructured":"Liunian Harold Li , Mark Yatskar , Da Yin , Cho-Jui Hsieh , and Kai-Wei Chang . 2019. VisualBERT: A Simple and Performant Baseline for Vision and Language. CoRR , Vol. abs\/ 1908 .03557 ( 2019 ). Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. VisualBERT: A Simple and Performant Baseline for Vision and Language. CoRR, Vol. abs\/1908.03557 (2019)."},{"key":"e_1_3_2_2_19_1","volume-title":"Interest-aware Message-Passing GCN for Recommendation. In The Web Conference","author":"Liu Fan","year":"2021","unstructured":"Fan Liu , Zhiyong Cheng , Lei Zhu , Zan Gao , and Liqiang Nie . 2021 . Interest-aware Message-Passing GCN for Recommendation. In The Web Conference 2021. ACM, 1296--1305. Fan Liu, Zhiyong Cheng, Lei Zhu, Zan Gao, and Liqiang Nie. 2021. Interest-aware Message-Passing GCN for Recommendation. In The Web Conference 2021. ACM, 1296--1305."},{"key":"e_1_3_2_2_20_1","volume-title":"ConceptNet-A Practical Commonsense Reasoning Tool-kit. BT technology journal","author":"Liu Hugo","year":"2004","unstructured":"Hugo Liu and Push Singh . 2004. ConceptNet-A Practical Commonsense Reasoning Tool-kit. BT technology journal , Vol. 22 , 4 ( 2004 ), 211--226. Hugo Liu and Push Singh. 2004. ConceptNet-A Practical Commonsense Reasoning Tool-kit. BT technology journal, Vol. 22, 4 (2004), 211--226."},{"key":"e_1_3_2_2_21_1","volume-title":"RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR","author":"Liu Yinhan","year":"2019","unstructured":"Yinhan Liu , Myle Ott , Naman Goyal , Jingfei Du , Mandar Joshi , Danqi Chen , Omer Levy , Mike Lewis , Luke Zettlemoyer , and Veselin Stoyanov . 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR , Vol. abs\/ 1907 .11692 ( 2019 ). Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR, Vol. abs\/1907.11692 (2019)."},{"key":"e_1_3_2_2_22_1","volume-title":"Decoupled Weight Decay Regularization. In International Conference on Learning Representations.","author":"Loshchilov Ilya","year":"2019","unstructured":"Ilya Loshchilov and Frank Hutter . 2019 . Decoupled Weight Decay Regularization. In International Conference on Learning Representations. Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In International Conference on Learning Representations."},{"key":"e_1_3_2_2_23_1","volume-title":"Vilbert: Pretraining Task-agnostic Visiolinguistic Representations for Vision-and-language Tasks. Conference on Neural Information Processing Systems","author":"Lu Jiasen","year":"2019","unstructured":"Jiasen Lu , Dhruv Batra , Devi Parikh , and Stefan Lee . 2019 . Vilbert: Pretraining Task-agnostic Visiolinguistic Representations for Vision-and-language Tasks. Conference on Neural Information Processing Systems (2019), 13--23. Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining Task-agnostic Visiolinguistic Representations for Vision-and-language Tasks. Conference on Neural Information Processing Systems (2019), 13--23."},{"key":"e_1_3_2_2_24_1","volume-title":"Hierarchical Question-Image Co-Attention for Visual Question Answering. In Conference on Neural Information Processing Systems. 289--297","author":"Lu Jiasen","year":"2016","unstructured":"Jiasen Lu , Jianwei Yang , Dhruv Batra , and Devi Parikh . 2016 . Hierarchical Question-Image Co-Attention for Visual Question Answering. In Conference on Neural Information Processing Systems. 289--297 . Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical Question-Image Co-Attention for Visual Question Answering. In Conference on Neural Information Processing Systems. 289--297."},{"key":"e_1_3_2_2_25_1","volume-title":"KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA. In Computer Vision and Pattern Recognition","author":"Marino Kenneth","year":"2021","unstructured":"Kenneth Marino , Xinlei Chen , Devi Parikh , Abhinav Gupta , and Marcus Rohrbach . 2021 . KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA. In Computer Vision and Pattern Recognition . IEEE , 14111--14121. Kenneth Marino, Xinlei Chen, Devi Parikh, Abhinav Gupta, and Marcus Rohrbach. 2021. KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA. In Computer Vision and Pattern Recognition. IEEE, 14111--14121."},{"key":"e_1_3_2_2_26_1","volume-title":"Computer Vision and Pattern Recognition","author":"Marino Kenneth","unstructured":"Kenneth Marino , Mohammad Rastegari , Ali Farhadi , and Roozbeh Mottaghi . 2019. OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge . In Computer Vision and Pattern Recognition . IEEE , 3195--3204. Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. 2019. OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge. In Computer Vision and Pattern Recognition. IEEE, 3195--3204."},{"key":"e_1_3_2_2_27_1","volume-title":"Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering. In Conference on Neural Information Processing Systems. 2659--2670","author":"Narasimhan Medhini","unstructured":"Medhini Narasimhan , Svetlana Lazebnik , and Alexander G. Schwing . 2018 . Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering. In Conference on Neural Information Processing Systems. 2659--2670 . Medhini Narasimhan, Svetlana Lazebnik, and Alexander G. Schwing. 2018. Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering. In Conference on Neural Information Processing Systems. 2659--2670."},{"key":"e_1_3_2_2_28_1","volume-title":"Schwing","author":"Narasimhan Medhini","year":"2018","unstructured":"Medhini Narasimhan and Alexander G . Schwing . 2018 . Straight to the Facts : Learning Knowledge Base Retrieval for Factual Visual Question Answering. In European Conference on Computer Vision. Springer , 460--477. Medhini Narasimhan and Alexander G. Schwing. 2018. Straight to the Facts: Learning Knowledge Base Retrieval for Factual Visual Question Answering. In European Conference on Computer Vision. Springer, 460--477."},{"key":"e_1_3_2_2_29_1","volume-title":"End-to-End Training of Multi-Document Reader and Retriever for Open-Domain Question Answering. In Conference on Neural Information Processing Systems.","author":"Sachan Devendra Singh","year":"2021","unstructured":"Devendra Singh Sachan , Siva Reddy , William L. Hamilton , Chris Dyer , and Dani Yogatama . 2021 . End-to-End Training of Multi-Document Reader and Retriever for Open-Domain Question Answering. In Conference on Neural Information Processing Systems. Devendra Singh Sachan, Siva Reddy, William L. Hamilton, Chris Dyer, and Dani Yogatama. 2021. End-to-End Training of Multi-Document Reader and Retriever for Open-Domain Question Answering. In Conference on Neural Information Processing Systems."},{"key":"e_1_3_2_2_30_1","volume-title":"Image Alt-text Dataset For Automatic Image Captioning. In Annual Meeting of the Association for Computational Linguistics. ACL, 2556--2565","author":"Sharma Piyush","year":"2018","unstructured":"Piyush Sharma , Nan Ding , Sebastian Goodman , and Radu Soricut . 2018 . Conceptual Captions: A Cleaned, Hypernymed , Image Alt-text Dataset For Automatic Image Captioning. In Annual Meeting of the Association for Computational Linguistics. ACL, 2556--2565 . Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Annual Meeting of the Association for Computational Linguistics. ACL, 2556--2565."},{"key":"e_1_3_2_2_31_1","volume-title":"VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In International Conference on Learning Representations.","author":"Su Weijie","year":"2019","unstructured":"Weijie Su , Xizhou Zhu , Yue Cao , Bin Li , Lewei Lu , Furu Wei , and Jifeng Dai . 2019 . VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In International Conference on Learning Representations. Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In International Conference on Learning Representations."},{"key":"e_1_3_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1514"},{"key":"e_1_3_2_2_33_1","volume-title":"Conference on Neural Information Processing Systems. 5998--6008","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani , Noam Shazeer , Niki Parmar , Jakob Uszkoreit , Llion Jones , Aidan N. Gomez , Lukasz Kaiser , and Illia Polosukhin . 2017 . Attention is All you Need . In Conference on Neural Information Processing Systems. 5998--6008 . Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Conference on Neural Information Processing Systems. 5998--6008."},{"key":"e_1_3_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/2629489"},{"key":"e_1_3_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2017.2754246"},{"key":"e_1_3_2_2_36_1","volume-title":"Multi-Modal Answer Validation for Knowledge-Based VQA. In AAAI Conference on Artificial Intelligence.","author":"Wu Jialin","year":"2022","unstructured":"Jialin Wu , Jiasen Lu , Ashish Sabharwal , and Roozbeh Mottaghi . 2022 . Multi-Modal Answer Validation for Knowledge-Based VQA. In AAAI Conference on Artificial Intelligence. Jialin Wu, Jiasen Lu, Ashish Sabharwal, and Roozbeh Mottaghi. 2022. Multi-Modal Answer Validation for Knowledge-Based VQA. In AAAI Conference on Artificial Intelligence."},{"key":"e_1_3_2_2_37_1","volume-title":"Towards Knowledge-Augmented Visual Question Answering. In International Conference on Computational Linguistics. ACL","author":"Ziaeefard Maryam","year":"2020","unstructured":"Maryam Ziaeefard and Freddy L\u00e9cu\u00e9 . 2020 . Towards Knowledge-Augmented Visual Question Answering. In International Conference on Computational Linguistics. ACL , 1863--1873. Maryam Ziaeefard and Freddy L\u00e9cu\u00e9. 2020. Towards Knowledge-Augmented Visual Question Answering. In International Conference on Computational Linguistics. ACL, 1863--1873."}],"event":{"name":"MM '22: The 30th ACM International Conference on Multimedia","location":"Lisboa Portugal","acronym":"MM '22","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 30th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3547870","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3503161.3547870","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:02:35Z","timestamp":1750186955000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3547870"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,10]]},"references-count":37,"alternative-id":["10.1145\/3503161.3547870","10.1145\/3503161"],"URL":"https:\/\/doi.org\/10.1145\/3503161.3547870","relation":{},"subject":[],"published":{"date-parts":[[2022,10,10]]},"assertion":[{"value":"2022-10-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}