{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,15]],"date-time":"2025-10-15T00:44:08Z","timestamp":1760489048435,"version":"build-2065373602"},"reference-count":40,"publisher":"Association for Computing Machinery (ACM)","issue":"10","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2025,10,31]]},"abstract":"<jats:p>Textual Question Answering (TQA) remains a formidable challenge, despite over a decade of research. The integration of transformer networks and external knowledge via pre-trained models has marked a significant advancement in TQA. Yet, a crucial element often overlooked is the incorporation of external visual understanding. In this study, we introduce an innovative TQA approach that equips machines with the capability for on-demand visual grounding, thereby enriching their comprehension of questions and enhancing the relevance of generated answers. Our methodology utilizes  web image search to tap into a vast pool of global knowledge and employs a novel technique for determining the most appropriate answer through on-demand visual grounding. We present a variety of multimedia model configurations, showcasing that our proposed method not only surpasses existing systems without necessitating pre-training but also achieves performance comparable to fine-tuned models 30 times its size as well as closed-source LLMs such as GPT-4o, a testament to its efficiency. Furthermore, an interpretability analysis reveals the integral role of visual grounding in the model\u2019s decision-making process. This research offers a fresh outlook on augmenting TQA performance by harnessing the potential of visual grounding, with broad implications for natural language processing and artificial intelligence.<\/jats:p>","DOI":"10.1145\/3729231","type":"journal-article","created":{"date-parts":[[2025,4,15]],"date-time":"2025-04-15T13:18:16Z","timestamp":1744723096000},"page":"1-25","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Elevating Textual Question Answering with On-Demand Visual Augmentation"],"prefix":"10.1145","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-6009-7612","authenticated-orcid":false,"given":"Sina","family":"Ehsani","sequence":"first","affiliation":[{"name":"Department of Systems and Industrial Engineering, The University of Arizona, Tucson, Arizona, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0268-2941","authenticated-orcid":false,"given":"Jian","family":"Liu","sequence":"additional","affiliation":[{"name":"Department of Systems and Industrial Engineering, The University of Arizona, Tucson, Arizona, USA"}]}],"member":"320","published-online":{"date-parts":[[2025,10,14]]},"reference":[{"key":"e_1_3_2_2_2","unstructured":"Josh Achiam Steven Adler Sandhini Agarwal Lama Ahmad Ilge Akkaya Florencia Leoni Aleman Diogo Almeida Janko Altenschmidt Sam Altman Shyamal Anadkat et al. 2023. GPT-4 technical report. arXiv:2303.08774. Retrieved from https:\/\/arxiv.org\/abs\/2303.08774"},{"key":"e_1_3_2_3_2","first-page":"1877","article-title":"Language models are few-shot learners","author":"Brown Tom","year":"2020","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. In Proceedings of the Advances in Neural Information Processing Systems 33, 1877\u20131901.","journal-title":"Proceedings of the Advances in Neural Information Processing Systems 33"},{"key":"e_1_3_2_4_2","doi-asserted-by":"crossref","unstructured":"Simone Caldarella Massimiliano Mancini Elisa Ricci and Rahaf Aljundi. 2024. The phantom menace: Unmasking privacy leakages in vision-language models. arXiv:2408.01228. Retrieved from https:\/\/arxiv.org\/abs\/2408.01228","DOI":"10.1007\/978-3-031-92648-8_26"},{"key":"e_1_3_2_5_2","unstructured":"Aakanksha Chowdhery Sharan Narang Jacob Devlin Maarten Bosma Gaurav Mishra Adam Roberts Paul Barham Hyung Won Chung Charles Sutton Sebastian Gehrmann et al. 2022. PaLM: Scaling language modeling with pathways. arXiv:2204.02311. Retrieved from https:\/\/arxiv.org\/abs\/2204.02311"},{"key":"e_1_3_2_6_2","first-page":"92420","article-title":"ConStat: Performance-based contamination detection in large language models","author":"Dekoninck Jasper","year":"2024","unstructured":"Jasper Dekoninck, Mark M\u00fcller, and Martin Vechev. 2024. ConStat: Performance-based contamination detection in large language models. In Proceedings of the Advances in Neural Information Processing Systems 37, 92420\u201392464.","journal-title":"Proceedings of the Advances in Neural Information Processing Systems 37"},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"e_1_3_2_8_2","unstructured":"Jacob Devlin Ming-Wei Chang Kenton Lee and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Retrieved from https:\/\/arxiv.org\/abs\/1810.04805"},{"key":"e_1_3_2_9_2","unstructured":"Alexey Dosovitskiy Lucas Beyer Alexander Kolesnikov Dirk Weissenborn Xiaohua Zhai Thomas Unterthiner Mostafa Dehghani Matthias Minderer Georg Heigold Sylvain Gelly et al. 2020. An image is worth 16\u2009\u00d7\u200916 words: Transformers for image recognition at scale. arXiv:2010.11929. Retrieved from https:\/\/arxiv.org\/abs\/2010.11929"},{"key":"e_1_3_2_10_2","unstructured":"Abhimanyu Dubey Abhinav Jauhri Abhinav Pandey Abhishek Kadian Ahmad Al-Dahle Aiesha Letman Akhil Mathur Alan Schelten Amy Yang Angela Fan et al. 2024. The Llama 3 herd of models. arXiv:2407.21783. Retrieved from https:\/\/arxiv.org\/abs\/2407.21783"},{"key":"e_1_3_2_11_2","volume-title":"Bootstrap Methods: Another Look at the Jackknife","author":"Efron Bradley","year":"1992","unstructured":"Bradley Efron. 1992. Bootstrap Methods: Another Look at the Jackknife. Springer."},{"key":"e_1_3_2_12_2","doi-asserted-by":"crossref","unstructured":"Deepanway Ghosal Navonil Majumder Rada Mihalcea and Soujanya Poria. 2022. Two is better than many? Binary classification as an effective approach to multi-choice question answering. arXiv:2210.16495. Retrieved from https:\/\/arxiv.org\/abs\/2210.16495","DOI":"10.18653\/v1\/2022.emnlp-main.691"},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1007\/s42979-020-00312-x"},{"key":"e_1_3_2_14_2","unstructured":"Pengcheng He Xiaodong Liu Jianfeng Gao and Weizhu Chen. 2020. DeBERTa: Decoding-enhanced BERT with disentangled attention. arXiv:2006.03654. Retrieved from https:\/\/arxiv.org\/abs\/2006.03654"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","unstructured":"Matthew Honnibal Ines Montani Sofie Van Landeghem and Adriane Boyd. 2020. spaCy: Industrial-Strength Natural Language Processing in Python. DOI: 10.5281\/zenodo.1212303","DOI":"10.5281\/zenodo.1212303"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2020.2988903"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/N18-1023"},{"key":"e_1_3_2_18_2","doi-asserted-by":"crossref","unstructured":"Daniel Khashabi Sewon Min Tushar Khot Ashish Sabharwal Oyvind Tafjord Peter Clark and Hannaneh Hajishirzi. 2020. UnifiedQA: Crossing format boundaries with a single QA system. arXiv:2005.00700. Retrieved from https:\/\/arxiv.org\/abs\/2005.00700","DOI":"10.18653\/v1\/2020.findings-emnlp.171"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6319"},{"key":"e_1_3_2_20_2","doi-asserted-by":"publisher","DOI":"10.1162\/coli.2007.33.1.147"},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00023"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01234-2_34"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.00515"},{"key":"e_1_3_2_24_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.02520"},{"key":"e_1_3_2_25_2","unstructured":"Yinhan Liu Myle Ott Naman Goyal Jingfei Du Mandar Joshi Danqi Chen Omer Levy Mike Lewis Luke Zettlemoyer and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692. Retrieved from https:\/\/arxiv.org\/abs\/1907.11692"},{"key":"e_1_3_2_26_2","unstructured":"Tomas Mikolov Kai Chen Greg Corrado and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv:1301.3781. Retrieved from https:\/\/arxiv.org\/abs\/1301.3781"},{"key":"e_1_3_2_27_2","unstructured":"Alec Radford Karthik Narasimhan Tim Salimans and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. OpenAI blog."},{"issue":"140","key":"e_1_3_2_28_2","first-page":"1","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","volume":"21","author":"Raffel Colin","year":"2020","unstructured":"Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21, 140 (2020), 1\u201367. Retrieved from http:\/\/jmlr.org\/papers\/v21\/20-074.html","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.1145\/2939672.2939778"},{"key":"e_1_3_2_30_2","unstructured":"Damien Sileo. 2021. Visual grounding strategies for text-only natural language processing. arXiv:2103.13942. Retrieved from https:\/\/arxiv.org\/abs\/2103.13942"},{"key":"e_1_3_2_31_2","unstructured":"Dingjie Song Sicheng Lai Shunian Chen Lichao Sun and Benyou Wang. 2024. Both text and images leaked! A systematic analysis of multimodal LLM data contamination. arXiv:2411.03823. Retrieved from https:\/\/arxiv.org\/abs\/2411.03823"},{"key":"e_1_3_2_32_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/W16-2346"},{"key":"e_1_3_2_33_2","doi-asserted-by":"crossref","unstructured":"Hao Tan and Mohit Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. arXiv:1908.07490. Retrieved from https:\/\/arxiv.org\/abs\/1908.07490","DOI":"10.18653\/v1\/D19-1514"},{"key":"e_1_3_2_34_2","unstructured":"Hugo Touvron Thibaut Lavril Gautier Izacard Xavier Martinet Marie-Anne Lachaux Timoth\u00e9e Lacroix Baptiste Rozi\u00e8re Naman Goyal Eric Hambro Faisal Azhar et al. 2023. LLaMA: Open and efficient foundation language models. arXiv:2302.13971. Retrieved from https:\/\/arxiv.org\/abs\/2302.13971"},{"key":"e_1_3_2_35_2","article-title":"Attention is all you need","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30.","journal-title":"Proceedings of the Advances in Neural Information Processing Systems 30"},{"key":"e_1_3_2_36_2","article-title":"SuperGLUE: A stickier benchmark for general-purpose language understanding systems","author":"Wang Alex","year":"2019","unstructured":"Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. In Proceedings of the Advances in Neural Information Processing Systems 32.","journal-title":"Proceedings of the Advances in Neural Information Processing Systems 32"},{"key":"e_1_3_2_37_2","doi-asserted-by":"crossref","unstructured":"Vikas Yadav Steven Bethard and Mihai Surdeanu. 2020. Unsupervised alignment-based iterative evidence retrieval for multi-hop question answering. arXiv:2005.01218. Retrieved from https:\/\/arxiv.org\/abs\/2005.01218","DOI":"10.18653\/v1\/2020.acl-main.414"},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.1145\/3316767"},{"key":"e_1_3_2_39_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.inffus.2019.03.005"},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1007\/s13735-023-00267-8"},{"key":"e_1_3_2_41_2","unstructured":"Fengbin Zhu Wenqiang Lei Chao Wang Jianming Zheng Soujanya Poria and Tat-Seng Chua. 2021. Retrieving and reading: A comprehensive survey on open-domain question answering. arXiv:2101.00774. Retrieved from https:\/\/arxiv.org\/abs\/2101.00774"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3729231","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,14]],"date-time":"2025-10-14T21:23:55Z","timestamp":1760477035000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3729231"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,14]]},"references-count":40,"journal-issue":{"issue":"10","published-print":{"date-parts":[[2025,10,31]]}},"alternative-id":["10.1145\/3729231"],"URL":"https:\/\/doi.org\/10.1145\/3729231","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"type":"print","value":"1551-6857"},{"type":"electronic","value":"1551-6865"}],"subject":[],"published":{"date-parts":[[2025,10,14]]},"assertion":[{"value":"2024-05-22","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-03-21","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-10-14","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}