{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T01:40:45Z","timestamp":1760060445335,"version":"build-2065373602"},"reference-count":39,"publisher":"MDPI AG","issue":"9","license":[{"start":{"date-parts":[[2025,9,1]],"date-time":"2025-09-01T00:00:00Z","timestamp":1756684800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Computers"],"abstract":"<jats:p>Question answering from visually rich documents (VRDs) is the task of retrieving the correct answer to a natural language question by considering the content of textual and visual elements in the document, as well as the pages\u2019 layout. To answer closed-ended questions that require a deep understanding of the hierarchical relationships between the elements, i.e., the full document-level understanding (FDU) task, state-of-the-art graph-based approaches to FDU model the pairwise element relationships in a graph model. Although they incorporate logical links (e.g., a caption refers to a figure) and spatial ones (e.g., a caption is placed below the figure), they currently disregard the semantic similarity among multimodal document elements, thus potentially yielding suboptimal scoring of the elements\u2019 relevance to the input question. In this paper, we propose GRAS-FDU, a new graph attention network tailored to FDU. GATS-FDU is trained to jointly consider multiple document facets, i.e., the local, spatial, and semantic elements\u2019 relationships. The results show that our approach achieves superior performance compared to several baseline methods.<\/jats:p>","DOI":"10.3390\/computers14090362","type":"journal-article","created":{"date-parts":[[2025,9,2]],"date-time":"2025-09-02T08:23:38Z","timestamp":1756801418000},"page":"362","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["A Graph Attention Network Combining Multifaceted Element Relationships for Full Document-Level Understanding"],"prefix":"10.3390","volume":"14","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-3605-1577","authenticated-orcid":false,"given":"Lorenzo","family":"Vaiani","sequence":"first","affiliation":[{"name":"Dipartimento di Automatica e Informatica, Politecnico di Torino, Corso Duca degli Abruzzi, 24, 10129 Torino, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9077-4103","authenticated-orcid":false,"given":"Davide","family":"Napolitano","sequence":"additional","affiliation":[{"name":"Dipartimento di Automatica e Informatica, Politecnico di Torino, Corso Duca degli Abruzzi, 24, 10129 Torino, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-7185-5247","authenticated-orcid":false,"given":"Luca","family":"Cagliero","sequence":"additional","affiliation":[{"name":"Dipartimento di Automatica e Informatica, Politecnico di Torino, Corso Duca degli Abruzzi, 24, 10129 Torino, Italy"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2025,9,1]]},"reference":[{"key":"ref_1","unstructured":"Ding, Y., Han, S.C., Lee, J., and Hovy, E. (2025). Deep Learning based Visually Rich Document Content Understanding: A Survey. arXiv."},{"key":"ref_2","unstructured":"Barboule, C., Piwowarski, B., and Chabot, Y. (2025). Survey on Question Answering over Visually Rich Documents: Methods, Challenges, and Trends. arXiv."},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"De Francisci Morales, G., Perlich, C., Ruchansky, N., Kourtellis, N., Baralis, E., and Bonchi, F. (2023, January 18\u201322). PDF-VQA: A New Dataset for Real-World VQA on PDF Documents. Proceedings of the Machine Learning and Knowledge Discovery in Databases: Applied Data Science and Demo Track, Turin, Italy.","DOI":"10.1007\/978-3-031-43430-3"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Han, S.C., Ding, Y., Luo, S., Poon, J., Yoon, H., Huang, Z., Duuring, P., and Holden, E.J. (2023). Workshop on Document Intelligence Understanding. arXiv.","DOI":"10.1145\/3583780.3615312"},{"key":"ref_5","unstructured":"Napolitano, D., Vaiani, L., and Cagliero, L. (2023). Enhancing BERT-Based Visual Question Answering through Keyword-Driven Sentence Selection. arXiv."},{"key":"ref_6","unstructured":"Gu, N., Gao, Y., and Hahnloser, R.H.R. (2023). MemSum-DQA: Adapting An Efficient Long Document Extractive Summarizer for Document Question Answering. arXiv."},{"key":"ref_7","unstructured":"Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., and Clark, J. (2021). Learning Transferable Visual Models From Natural Language Supervision. arXiv."},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Huang, Y., Lv, T., Cui, L., Lu, Y., and Wei, F. (2022). LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. arXiv.","DOI":"10.1145\/3503161.3548112"},{"key":"ref_9","unstructured":"Ku, L., Martins, A., and Srikumar, V. (2024, January 11\u201316). 3MVRD: Multimodal Multi-task Multi-teacher Visually-Rich Form Document Understanding. Proceedings of the Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Tan, H., and Bansal, M. (2019). LXMERT: Learning Cross-Modality Encoder Representations from Transformers. arXiv.","DOI":"10.18653\/v1\/D19-1514"},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Hu, R., Singh, A., Darrell, T., and Rohrbach, M. (2020, January 13\u201319). Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA. Proceedings of the 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA.","DOI":"10.1109\/CVPR42600.2020.01001"},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"109834","DOI":"10.1016\/j.patcog.2023.109834","article-title":"Hierarchical multimodal transformers for Multipage DocVQA","volume":"144","author":"Tito","year":"2023","journal-title":"Pattern Recogn."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Chen, J., Lv, T., Cui, L., Zhang, C., and Wei, F. (2022). XDoc: Unified Pre-training for Cross-Format Document Understanding. arXiv.","DOI":"10.18653\/v1\/2022.findings-emnlp.71"},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Kang, L., Tito, R., Valveny, E., and Karatzas, D. (2024). Multi-Page Document Visual Question Answering using Self-Attention Scoring Mechanism. arXiv.","DOI":"10.1007\/978-3-031-70552-6_13"},{"key":"ref_15","unstructured":"Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., and Chang, K.W. (2019). VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Peng, Q., Pan, Y., Wang, W., Luo, B., Zhang, Z., Huang, Z., Hu, T., Yin, W., Chen, Y., and Zhang, Y. (2022). ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding. arXiv.","DOI":"10.18653\/v1\/2022.findings-emnlp.274"},{"key":"ref_17","unstructured":"Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P.J. (2023). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv."},{"key":"ref_18","doi-asserted-by":"crossref","unstructured":"Blau, T., Fogel, S., Ronen, R., Golts, A., Ganz, R., Avraham, E.B., Aberdam, A., Tsiper, S., and Litman, R. (2024). GRAM: Global Reasoning for Multi-Page VQA. arXiv.","DOI":"10.1109\/CVPR52733.2024.01477"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Luo, C., Shen, Y., Zhu, Z., Zheng, Q., Yu, Z., and Yao, C. (2024). LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding. arXiv.","DOI":"10.1109\/CVPR52733.2024.01480"},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Liu, C., Yin, K., Cao, H., Jiang, X., Li, X., Liu, Y., Jiang, D., Sun, X., and Xu, L. (2024). HRVDA: High-Resolution Visual Document Assistant. arXiv.","DOI":"10.1109\/CVPR52733.2024.01471"},{"key":"ref_21","unstructured":"Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv."},{"key":"ref_22","unstructured":"Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv."},{"key":"ref_23","unstructured":"Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Fang, Y., Wang, W., Xie, B., Sun, Q., Wu, L., Wang, X., Huang, T., Wang, X., and Cao, Y. (2022). EVA: Exploring the Limits of Masked Visual Representation Learning at Scale. arXiv.","DOI":"10.1109\/CVPR52729.2023.01855"},{"key":"ref_25","unstructured":"Li, J., Li, D., Xiong, C., and Hoi, S.C.H. (2022). BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. arXiv."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Jiang, T., Song, M., Zhang, Z., Huang, H., Deng, W., Sun, F., Zhang, Q., Wang, D., and Zhuang, F. (2024). E5-V: Universal Embeddings with Multimodal Large Language Models. arXiv.","DOI":"10.18653\/v1\/2024.findings-emnlp.181"},{"key":"ref_27","unstructured":"Kim, W., Son, B., and Kim, I. (2021). ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. arXiv."},{"key":"ref_28","unstructured":"Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., and Tang, J. (2025). Qwen2.5-VL Technical Report. arXiv."},{"key":"ref_29","unstructured":"Koukounas, A., Mastrapas, G., G\u00fcnther, M., Wang, B., Martens, S., Mohr, I., Sturua, S., Akram, M.K., Mart\u00ednez, J.F., and Ognawala, S. (2024). Jina CLIP: Your CLIP Model Is Also Your Text Retriever. arXiv."},{"key":"ref_30","unstructured":"Sun, Q., Fang, Y., Wu, L., Wang, X., and Cao, Y. (2023). EVA-CLIP: Improved Training Techniques for CLIP at Scale. arXiv."},{"key":"ref_31","unstructured":"Veli\u010dkovi\u0107, P., Cucurull, G., Casanova, A., Romero, A., Li\u00f2, P., and Bengio, Y. (2018). Graph Attention Networks. arXiv."},{"key":"ref_32","unstructured":"Loshchilov, I., and Hutter, F. (2017). Fixing Weight Decay Regularization in Adam. arXiv."},{"key":"ref_33","unstructured":"Islam, P., Kannappan, A., Kiela, D., Qian, R., Scherrer, N., and Vidgen, B. (2023). FinanceBench: A New Benchmark for Financial Question Answering. arXiv."},{"key":"ref_34","unstructured":"Lai, M., Menini, S., Polignano, M., Russo, V., Sprugnoli, R., and Venturi, G. (2023, January 7\u20138). PoliTo at MULTI-Fake-DetectiVE: Improving FND-CLIP for Multimodal Italian Fake News Detection. Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2023), Parma, Italy."},{"key":"ref_35","unstructured":"Korhonen, A., Traum, D., and M\u00e0rquez, L. (August, January 28). BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy."},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Masry, A., Long, D., Tan, J.Q., Joty, S., and Hoque, E. (2022, January 22\u201327). ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland.","DOI":"10.18653\/v1\/2022.findings-acl.177"},{"key":"ref_37","unstructured":"Ferro, N., Maistro, M., Pasi, G., Alonso, O., Trotman, A., and Verberne, S. (2025, January 13\u201318). Graph-Based Multimodal Contrastive Learning for Chart Question Answering. Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2025, Padua, Italy."},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Huang, Z., Chen, K., He, J., Bai, X., Karatzas, D., Lu, S., and Jawahar, C.V. (2019, January 20\u201325). ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction. Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia.","DOI":"10.1109\/ICDAR.2019.00244"},{"key":"ref_39","doi-asserted-by":"crossref","unstructured":"Van Landeghem, J., Borchmann, L., Tito, R., Pietruszka, M., Jurkiewicz, D., Powalski, R., Joziak, P., Biswas, S., Coustaty, M., and Stanisawek, T. (2023, January 21\u201326). ICDAR 2023 Competition on Document UnderstanDing of Everything (DUDE). Proceedings of the ICDAR 2023, San Jos\u00e9, CA, USA.","DOI":"10.1007\/978-3-031-41679-8_24"}],"container-title":["Computers"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2073-431X\/14\/9\/362\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T18:36:54Z","timestamp":1760035014000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2073-431X\/14\/9\/362"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,1]]},"references-count":39,"journal-issue":{"issue":"9","published-online":{"date-parts":[[2025,9]]}},"alternative-id":["computers14090362"],"URL":"https:\/\/doi.org\/10.3390\/computers14090362","relation":{},"ISSN":["2073-431X"],"issn-type":[{"type":"electronic","value":"2073-431X"}],"subject":[],"published":{"date-parts":[[2025,9,1]]}}}