{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,16]],"date-time":"2026-01-16T08:41:56Z","timestamp":1768552916944,"version":"3.49.0"},"reference-count":46,"publisher":"MDPI AG","issue":"4","license":[{"start":{"date-parts":[[2023,8,7]],"date-time":"2023-08-07T00:00:00Z","timestamp":1691366400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Robotics"],"abstract":"<jats:p>Visual Question Answering (VQA) models fail catastrophically on questions related to the reading of text-carrying images. However, TextVQA aims to answer questions by understanding the scene texts in an image\u2013question context, such as the brand name of a product or the time on a clock from an image. Most TextVQA approaches focus on objects and scene text detection, which are then integrated with the words in a question by a simple transformer encoder. The focus of these approaches is to use shared weights during the training of a multi-modal dataset, but it fails to capture the semantic relations between an image and a question. In this paper, we proposed a Scene Graph-Based Co-Attention Network (SceneGATE) for TextVQA, which reveals the semantic relations among the objects, the Optical Character Recognition (OCR) tokens and the question words. It is achieved by a TextVQA-based scene graph that discovers the underlying semantics of an image. We create a guided-attention module to capture the intra-modal interplay between the language and the vision as a guidance for inter-modal interactions. To permit explicit teaching of the relations between the two modalities, we propose and integrate two attention modules, namely a scene graph-based semantic relation-aware attention and a positional relation-aware attention. We conduct extensive experiments on two widely used benchmark datasets, Text-VQA and ST-VQA. It is shown that our SceneGATE method outperforms existing ones because of the scene graph and its attention modules.<\/jats:p>","DOI":"10.3390\/robotics12040114","type":"journal-article","created":{"date-parts":[[2023,8,7]],"date-time":"2023-08-07T06:37:31Z","timestamp":1691390251000},"page":"114","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["SceneGATE: Scene-Graph Based Co-Attention Networks for Text Visual Question Answering"],"prefix":"10.3390","volume":"12","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4910-5925","authenticated-orcid":false,"given":"Feiqi","family":"Cao","sequence":"first","affiliation":[{"name":"School of Computer Science, Faculty of Engineering, University of Sydney, Camperdown, NSW 2006, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0480-1991","authenticated-orcid":false,"given":"Siwen","family":"Luo","sequence":"additional","affiliation":[{"name":"School of Computer Science, Faculty of Engineering, University of Sydney, Camperdown, NSW 2006, Australia"}]},{"given":"Felipe","family":"Nunez","sequence":"additional","affiliation":[{"name":"School of Computer Science, Faculty of Engineering, University of Sydney, Camperdown, NSW 2006, Australia"}]},{"given":"Zean","family":"Wen","sequence":"additional","affiliation":[{"name":"School of Computer Science, Faculty of Engineering, University of Sydney, Camperdown, NSW 2006, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3371-8628","authenticated-orcid":false,"given":"Josiah","family":"Poon","sequence":"additional","affiliation":[{"name":"School of Computer Science, Faculty of Engineering, University of Sydney, Camperdown, NSW 2006, Australia"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1948-6819","authenticated-orcid":false,"given":"Soyeon Caren","family":"Han","sequence":"additional","affiliation":[{"name":"School of Computer Science, Faculty of Engineering, University of Sydney, Camperdown, NSW 2006, Australia"},{"name":"Department of Computer Science and Software Engineering, School of Physics, Maths and Computing, University of Western Australia, Crawley, WA 6009, Australia"}]}],"member":"1968","published-online":{"date-parts":[[2023,8,7]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., Parikh, D., and Rohrbach, M. (2019, January 15). Towards vqa models that can read. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00851"},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Hu, R., Singh, A., Darrell, T., and Rohrbach, M. (2020, January 13\u201319). Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01001"},{"key":"ref_3","first-page":"3608","article-title":"Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps","volume":"35","author":"Zhu","year":"2021","journal-title":"Proc. AAAI Conf. Artif. Intell."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Gao, D., Li, K., Wang, R., Shan, S., and Chen, X. (2020, January 13\u201319). Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.","DOI":"10.1109\/CVPR42600.2020.01276"},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Kant, Y., Batra, D., Anderson, P., Schwing, A., Parikh, D., Lu, J., and Agrawal, H. (2020, January 23\u201328). Spatially aware multimodal transformers for textvqa. Proceedings of the Computer Vision\u2013ECCV 2020: 16th European Conference, Glasgow, UK. Proceedings, Part IX 16.","DOI":"10.1007\/978-3-030-58545-7_41"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Han, W., Huang, H., and Han, T. (2020, January 8\u201313). Finding the Evidence: Localization-aware Answer Prediction for Text Visual Question Answering. Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain.","DOI":"10.18653\/v1\/2020.coling-main.278"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"9603","DOI":"10.1109\/TPAMI.2021.3132034","article-title":"Structured Multimodal Attentions for TextVQA","volume":"44","author":"Gao","year":"2022","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Johnson, J., Gupta, A., and Fei-Fei, L. (2018, January 18\u201323). Image generation from scene graphs. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00133"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Lu, X., Fan, Z., Wang, Y., Oh, J., and Ros\u00e9, C.P. (2021, January 11\u201317). Localize, Group, and Select: Boosting Text-VQA by Scene Text Modeling. Proceedings of the 2021 IEEE\/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada.","DOI":"10.1109\/ICCVW54120.2021.00297"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Zeng, G., Zhang, Y., Zhou, Y., and Yang, X. (2021, January 20\u201324). Beyond OCR + VQA: Involving OCR into the Flow for Robust and Accurate TextVQA. Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event China.","DOI":"10.1145\/3474085.3475606"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., and Parikh, D. (2015, January 7\u201313). VQA: Visual Question Answering. Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile.","DOI":"10.1109\/ICCV.2015.279"},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Shrestha, R., Kafle, K., and Kanan, C. (2019, January 15\u201320). Answer them all! toward universal visual question answering models. Proceedings of the Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.01072"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Ben-Younes, H., Cadene, R., Thome, N., and Cord, M. (2019, January 27\u201328). Block: Bilinear superdiagonal fusion for visual question answering and visual relationship detection. Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA.","DOI":"10.1609\/aaai.v33i01.33018102"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Cadene, R., Ben-Younes, H., Cord, M., and Thome, N. (2019, January 15\u201320). Murel: Multimodal relational reasoning for visual question answering. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00209"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Urooj, A., Mazaheri, A., Da vitoria lobo, N., and Shah, M. (2020, January 16\u201320). MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering. Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event.","DOI":"10.18653\/v1\/2020.findings-emnlp.417"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Han, Y., Guo, Y., Yin, J., Liu, M., Hu, Y., and Nie, L. (2021, January 20\u201324). Focal and Composed Vision-Semantic Modeling for Visual Question Answering. Proceedings of the 29th ACM International Conference on Multimedia (MM \u201921), Virtual Event China.","DOI":"10.1145\/3474085.3475609"},{"key":"ref_17","unstructured":"Hudson, D.A., and Manning, C.D. (May, January 30). Compositional Attention Networks for Machine Reasoning. Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., and Zhang, L. (2018, January 18\u201323). Bottom-up and top-down attention for image captioning and visual question answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00636"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Yu, Z., Yu, J., Cui, Y., Tao, D., and Tian, Q. (2019, January 15\u201320). Deep modular co-attention networks for visual question answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00644"},{"key":"ref_20","unstructured":"Gao, P., You, H., Zhang, Z., Wang, X., and Li, H. (November, January 27). Multi-modality latent interaction network for visual question answering. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Republic of Korea."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Nguyen, D.K., and Okatani, T. (2018, January 18\u201323). Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00637"},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Rahman, T., Chou, S.H., Sigal, L., and Carenini, G. (2021, January 19\u201325). An Improved Attention for Visual Question Answering. Proceedings of the 2021 IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA.","DOI":"10.1109\/CVPRW53098.2021.00181"},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Yang, X., Tang, K., Zhang, H., and Cai, J. (2019, January 15\u201320). Auto-encoding scene graphs for image captioning. Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.01094"},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Gu, J., Joty, S., Cai, J., Zhao, H., Yang, X., and Wang, G. (2019, January 15\u201320). Unpaired image captioning via scene graph alignments. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Long Beach, CA, USA.","DOI":"10.1109\/ICCV.2019.01042"},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Han, C., Long, S., Luo, S., Wang, K., and Poon, J. (2020, January 8\u201313). VICTR: Visual Information Captured Text Representation for Text-to-Vision Multimodal Tasks. Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain.","DOI":"10.18653\/v1\/2020.coling-main.277"},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Wang, S., Wang, R., Yao, Z., Shan, S., and Chen, X. (2020, January 1\u20135). Cross-modal scene graph matching for relationship-aware image-text retrieval. Proceedings of the IEEE\/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA.","DOI":"10.1109\/WACV45572.2020.9093614"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Luo, S., Han, S.C., Sun, K., and Poon, J. (2020, January 18\u201322). REXUP: I REason, I EXtract, I UPdate with Structured Compositional Reasoning for Visual Question Answering. Proceedings of the International Conference on Neural Information Processing, Bangkok, Thailand.","DOI":"10.1007\/978-3-030-63830-6_44"},{"key":"ref_28","unstructured":"Hudson, D., and Manning, C.D. (2019, January 8\u201314). Learning by abstraction: The neural state machine. Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Haurilet, M., Roitberg, A., and Stiefelhagen, R. (2019, January 15\u201320). It\u2019s Not About the Journey; It\u2019s About the Destination: Following Soft Paths Under Question-Guidance for Visual Reasoning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00203"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Nuthalapati, S.V., Chandradevan, R., Giunchiglia, E., Li, B., Kayser, M., Lukasiewicz, T., and Yang, C. (2021, January 1\u20135). Lightweight Visual Question Answering Using Scene Graphs. Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Virtual Event.","DOI":"10.1145\/3459637.3482218"},{"key":"ref_31","first-page":"5914","article-title":"SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning","volume":"36","author":"Wang","year":"2022","journal-title":"Proc. AAAI Conf. Artif. Intell."},{"key":"ref_32","unstructured":"Devlin, J., Chang, M.W., Lee, K., and Toutanova, K. (2019, January 2\u20137). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistic (NAACL), Minneapolis, MN, USA."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"32","DOI":"10.1007\/s11263-016-0981-7","article-title":"Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations","volume":"123","author":"Krishna","year":"2017","journal-title":"Int. J. Comput. Vis."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"2552","DOI":"10.1109\/TPAMI.2014.2339814","article-title":"Word Spotting and Recognition with Embedded Attributes","volume":"36","author":"Gordo","year":"2014","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_35","unstructured":"Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (2017, January 4\u20139). Attention is All you Need. Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA."},{"key":"ref_36","unstructured":"Krasin, I., Duerig, T., Alldrin, N., Ferrari, V., Abu-El-Haija, S., Kuznetsova, A., Rom, H., Uijlings, J., Popov, S., and Veit, A. (2017). OpenImages: A Public Dataset for Large-Scale Multi-Label and Multi-Class Image Classification. 2, 18. Available online: https:\/\/github.com\/openimages."},{"key":"ref_37","unstructured":"Biten, A.F., Tito, R., Mafla, A., Gomez, L., Rusinol, M., Valveny, E., Jawahar, C., and Karatzas, D. (November, January 27). Scene text visual question answering. Proceedings of the IEEE\/CVF International Conference on Computer Vision, Seoul, Republic of Korea."},{"key":"ref_38","unstructured":"Veit, A., Matera, T., Neumann, L., Matas, J., and Belongie, S. (2016). COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images. arXiv."},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Gurari, D., Li, Q., Stangl, A.J., Guo, A., Lin, C., Grauman, K., Luo, J., and Bigham, J.P. (2018, January 18\u201323). VizWiz Grand Challenge: Answering Visual Questions from Blind People. Proceedings of the 2018 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00380"},{"key":"ref_40","doi-asserted-by":"crossref","unstructured":"Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., Bigorda, L.G.i., Mestre, S.R., Mas, J., Mota, D.F., Almaz\u00e0n, J.A., and de las Heras, L.P. (2013, January 25\u201328). ICDAR 2013 Robust Reading Competition. Proceedings of the 2013 12th International Conference on Document Analysis and Recognition (ICDAR \u201913), Washington, DC, USA.","DOI":"10.1109\/ICDAR.2013.221"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura, M., Matas, J., Neumann, L., Chandrasekhar, V.R., and Lu, S. (2015, January 23\u201326). ICDAR 2015 competition on Robust Reading. Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR), Tunis, Tunisia.","DOI":"10.1109\/ICDAR.2015.7333942"},{"key":"ref_42","doi-asserted-by":"crossref","unstructured":"Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., and Fei-Fei, L. (2009, January 20\u201325). ImageNet: A large-scale hierarchical image database. Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA.","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"ref_43","doi-asserted-by":"crossref","unstructured":"Mishra, A., Alahari, K., and Jawahar, C. (2013, January 1\u20138). Image Retrieval Using Textual Cues. Proceedings of the 2013 IEEE International Conference on Computer Vision, Sydney, Australia.","DOI":"10.1109\/ICCV.2013.378"},{"key":"ref_44","doi-asserted-by":"crossref","unstructured":"Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and Parikh, D. (2017, January 21\u201326). Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.670"},{"key":"ref_45","doi-asserted-by":"crossref","unstructured":"Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., and Girshick, R. (2017, January 21\u201326). CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.215"},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Hudson, D.A., and Manning, C.D. (2019, January 15\u201320). GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. Proceedings of the 2019 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00686"}],"container-title":["Robotics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2218-6581\/12\/4\/114\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T20:27:09Z","timestamp":1760128029000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2218-6581\/12\/4\/114"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,8,7]]},"references-count":46,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2023,8]]}},"alternative-id":["robotics12040114"],"URL":"https:\/\/doi.org\/10.3390\/robotics12040114","relation":{},"ISSN":["2218-6581"],"issn-type":[{"value":"2218-6581","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,8,7]]}}}