{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,8,5]],"date-time":"2025-08-05T12:47:53Z","timestamp":1754398073881},"publisher-location":"California","reference-count":0,"publisher":"International Joint Conferences on Artificial Intelligence Organization","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2023,8]]},"abstract":"<jats:p>We study visual question answering in a setting where the answer has to be mined from a pool of relevant and irrelevant images given as a context. For such a setting, a model must first retrieve relevant images from the pool and answer the question from these retrieved images. We refer to this problem as retrieval-based visual question answering (or RETVQA in short). The RETVQA is distinctively different and more challenging than the traditionally-studied Visual Question Answering (VQA), where a given question has to be answered with a single relevant image in context. Towards solving the RETVQA task, we propose a unified Multi Image BART (MI-BART) that takes a question and retrieved images using our relevance encoder for free-form fluent answer generation. Further, we introduce the largest dataset in this space, namely RETVQA, which has the following salient features: multi-image and retrieval requirement for VQA, metadata-independent questions over a pool of heterogeneous images, expecting a mix of classification-oriented and open-ended generative answers. Our proposed framework achieves an accuracy of 76.5% and a fluency of 79.3% on the proposed dataset, namely RETVQA and also outperforms state-of-the-art methods by 4.9% and 11.8% on the image segment of the publicly available WebQA dataset on the accuracy and fluency metrics, respectively.<\/jats:p>","DOI":"10.24963\/ijcai.2023\/146","type":"proceedings-article","created":{"date-parts":[[2023,8,11]],"date-time":"2023-08-11T08:31:30Z","timestamp":1691742690000},"page":"1312-1321","source":"Crossref","is-referenced-by-count":6,"title":["Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering"],"prefix":"10.24963","author":[{"given":"Abhirama Subramanyam","family":"Penamakuri","sequence":"first","affiliation":[{"name":"Indian Institute of Technology Jodhpur"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Manish","family":"Gupta","sequence":"additional","affiliation":[{"name":"Microsoft"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Mithun Das","family":"Gupta","sequence":"additional","affiliation":[{"name":"Microsoft"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Anand","family":"Mishra","sequence":"additional","affiliation":[{"name":"Indian Institute of Technology Jodhpur"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"10584","event":{"number":"32","sponsor":["International Joint Conferences on Artificial Intelligence Organization (IJCAI)"],"acronym":"IJCAI-2023","name":"Thirty-Second International Joint Conference on Artificial Intelligence {IJCAI-23}","start":{"date-parts":[[2023,8,19]]},"theme":"Artificial Intelligence","location":"Macau, SAR China","end":{"date-parts":[[2023,8,25]]}},"container-title":["Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence"],"original-title":[],"deposited":{"date-parts":[[2023,8,11]],"date-time":"2023-08-11T08:37:39Z","timestamp":1691743059000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.ijcai.org\/proceedings\/2023\/146"}},"subtitle":[],"proceedings-subject":"Artificial Intelligence Research Articles","short-title":[],"issued":{"date-parts":[[2023,8]]},"references-count":0,"URL":"https:\/\/doi.org\/10.24963\/ijcai.2023\/146","relation":{},"subject":[],"published":{"date-parts":[[2023,8]]}}}