{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,14]],"date-time":"2026-02-14T05:31:59Z","timestamp":1771047119723,"version":"3.50.1"},"publisher-location":"New York, NY, USA","reference-count":40,"publisher":"ACM","license":[{"start":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T00:00:00Z","timestamp":1665360000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"NSFC","award":["No.61906018"],"award-info":[{"award-number":["No.61906018"]}]},{"name":"NSFC","award":["No.62076032"],"award-info":[{"award-number":["No.62076032"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,10,10]]},"DOI":"10.1145\/3503161.3548172","type":"proceedings-article","created":{"date-parts":[[2022,10,10]],"date-time":"2022-10-10T15:43:12Z","timestamp":1665416592000},"page":"4909-4920","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":5,"title":["A Region-based Document VQA"],"prefix":"10.1145","author":[{"given":"Xinya","family":"Wu","sequence":"first","affiliation":[{"name":"Beijing University of Posts and Telecommunications, Beijing, China"}]},{"given":"Duo","family":"Zheng","sequence":"additional","affiliation":[{"name":"Beijing University of Posts and Telecommunications, Beijing, China"}]},{"given":"Ruonan","family":"Wang","sequence":"additional","affiliation":[{"name":"Beijing University of Posts and Telecommunications, Beijing, China"}]},{"given":"Jiashen","family":"Sun","sequence":"additional","affiliation":[{"name":"Meituan Group, Beijing, China"}]},{"given":"Minzhen","family":"Hu","sequence":"additional","affiliation":[{"name":"Beijing University of Posts and Telecommunications, Beijing, China"}]},{"given":"Fangxiang","family":"Feng","sequence":"additional","affiliation":[{"name":"Beijing University of Posts and Telecommunications, Beijing, China"}]},{"given":"Xiaojie","family":"Wang","sequence":"additional","affiliation":[{"name":"Beijing University of Posts and Telecommunications, Beijing, China"}]},{"given":"Huixing","family":"Jiang","sequence":"additional","affiliation":[{"name":"Meituan Group, Beijing, China"}]},{"given":"Fan","family":"Yang","sequence":"additional","affiliation":[{"name":"Meituan Group, Beijing, China"}]}],"member":"320","published-online":{"date-parts":[[2022,10,10]]},"reference":[{"key":"e_1_3_2_2_1_1","volume-title":"Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018","author":"Anderson Peter","year":"2018","unstructured":"Peter Anderson , Xiaodong He , Chris Buehler , Damien Teney , Mark Johnson , Stephen Gould , and Lei Zhang . 2018 . Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018 , Salt Lake City, UT, USA, June 18--22 , 2018. IEEE Computer Society, 6077--6086. https:\/\/doi.org\/10.1109\/CVPR.2018.00636 Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18--22, 2018. IEEE Computer Society, 6077--6086. https:\/\/doi.org\/10.1109\/CVPR.2018.00636"},{"key":"e_1_3_2_2_2_1","volume-title":"VQA: Visual Question Answering. In 2015 IEEE International Conference on Computer Vision, ICCV 2015","author":"Antol Stanislaw","year":"2015","unstructured":"Stanislaw Antol , Aishwarya Agrawal , Jiasen Lu , Margaret Mitchell , Dhruv Batra , C. Lawrence Zitnick , and Devi Parikh . 2015 . VQA: Visual Question Answering. In 2015 IEEE International Conference on Computer Vision, ICCV 2015 , Santiago, Chile, December 7--13 , 2015. IEEE Computer Society, 2425--2433. https:\/\/doi.org\/10.1109\/ICCV.2015.279 Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7--13, 2015. IEEE Computer Society, 2425--2433. https:\/\/doi.org\/10.1109\/ICCV.2015.279"},{"key":"e_1_3_2_2_3_1","unstructured":"Srikar Appalaraju Bhavan Jasani Bhargava Urala Kota Yusheng Xie and R. Manmatha. 20. DocFormer: End-to-End Transformer for Document Understanding. ArXiv preprint Vol. abs\/ (20). https:\/\/arxiv.org\/abs\/  Srikar Appalaraju Bhavan Jasani Bhargava Urala Kota Yusheng Xie and R. Manmatha. 20. DocFormer: End-to-End Transformer for Document Understanding. ArXiv preprint Vol. abs\/ (20). https:\/\/arxiv.org\/abs\/"},{"key":"e_1_3_2_2_4_1","volume-title":"Scene Text Visual Question Answering. In 2019 IEEE\/CVF International Conference on Computer Vision, ICCV 2019","author":"Biten Ali Furkan","year":"2019","unstructured":"Ali Furkan Biten , Rub\u00e8 n Tito , Andr\u00e9 s Mafla , Llu'i s G\u00f3 mez i Bigorda , Marcc al Rusi n ol, C. V. Jawahar , Ernest Valveny , and Dimosthenis Karatzas . 2019 . Scene Text Visual Question Answering. In 2019 IEEE\/CVF International Conference on Computer Vision, ICCV 2019 , Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 4290--4300. https:\/\/doi.org\/10.1109\/ICCV.2019.00439 Ali Furkan Biten, Rub\u00e8 n Tito, Andr\u00e9 s Mafla, Llu'i s G\u00f3 mez i Bigorda, Marcc al Rusi n ol, C. V. Jawahar, Ernest Valveny, and Dimosthenis Karatzas. 2019. Scene Text Visual Question Answering. In 2019 IEEE\/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 4290--4300. https:\/\/doi.org\/10.1109\/ICCV.2019.00439"},{"key":"e_1_3_2_2_5_1","volume-title":"DMRM: A Dual-Channel Multi-Hop Reasoning Model for Visual Dialog. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI","author":"Chen Feilong","year":"2020","unstructured":"Feilong Chen , Fandong Meng , Jiaming Xu , Peng Li , Bo Xu , and Jie Zhou . 2020 . DMRM: A Dual-Channel Multi-Hop Reasoning Model for Visual Dialog. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7--12, 2020. AAAI Press , 7504--7511. https:\/\/aaai.org\/ojs\/index.php\/AAAI\/article\/view\/6248 Feilong Chen, Fandong Meng, Jiaming Xu, Peng Li, Bo Xu, and Jie Zhou. 2020. DMRM: A Dual-Channel Multi-Hop Reasoning Model for Visual Dialog. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7--12, 2020. AAAI Press, 7504--7511. https:\/\/aaai.org\/ojs\/index.php\/AAAI\/article\/view\/6248"},{"key":"e_1_3_2_2_6_1","volume-title":"Visual Dialog. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017","author":"Das Abhishek","year":"2017","unstructured":"Abhishek Das , Satwik Kottur , Khushi Gupta , Avi Singh , Deshraj Yadav , Jos\u00e9 M. F. Moura , Devi Parikh , and Dhruv Batra . 2017 . Visual Dialog. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 , Honolulu, HI, USA, July 21--26 , 2017. IEEE Computer Society, 1080--1089. https:\/\/doi.org\/10.1109\/CVPR.2017.121 Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jos\u00e9 M. F. Moura, Devi Parikh, and Dhruv Batra. 2017. Visual Dialog. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. IEEE Computer Society, 1080--1089. https:\/\/doi.org\/10.1109\/CVPR.2017.121"},{"key":"e_1_3_2_2_7_1","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","volume":"1","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin , Ming-Wei Chang , Kenton Lee , and Kristina Toutanova . 2019 . BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171--4186. https:\/\/doi.org\/10. 18653\/v1\/N19--1423 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171--4186. https:\/\/doi.org\/10.18653\/v1\/N19--1423"},{"key":"e_1_3_2_2_8_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D16-1044"},{"key":"e_1_3_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1648"},{"key":"e_1_3_2_2_10_1","volume-title":"LAMBERT: Layout-Aware language Modeling using BERT for information extraction.","author":"Garncarek Lukasz","year":"2020","unstructured":"Lukasz Garncarek , Rafa? Powalski, Tomasz Stanisawek , Bartosz Topolski , Piotr Halama , and Filip Grali'ski . 2020 . LAMBERT: Layout-Aware language Modeling using BERT for information extraction. Lukasz Garncarek, Rafa? Powalski, Tomasz Stanisawek, Bartosz Topolski, Piotr Halama, and Filip Grali'ski. 2020. LAMBERT: Layout-Aware language Modeling using BERT for information extraction."},{"key":"e_1_3_2_2_11_1","volume-title":"Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017","author":"Goyal Yash","year":"2017","unstructured":"Yash Goyal , Tejas Khot , Douglas Summers-Stay , Dhruv Batra , and Devi Parikh . 2017 . Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 , Honolulu, HI, USA, July 21--26 , 2017. IEEE Computer Society, 6325--6334. https:\/\/doi.org\/10.1109\/CVPR.2017.670 Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. IEEE Computer Society, 6325--6334. https:\/\/doi.org\/10.1109\/CVPR.2017.670"},{"key":"e_1_3_2_2_12_1","volume-title":"VizWiz Grand Challenge: Answering Visual Questions From Blind People. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018","author":"Gurari Danna","year":"2018","unstructured":"Danna Gurari , Qing Li , Abigale J. Stangl , Anhong Guo , Chi Lin , Kristen Grauman , Jiebo Luo , and Jeffrey P. Bigham . 2018 . VizWiz Grand Challenge: Answering Visual Questions From Blind People. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018 , Salt Lake City, UT, USA, June 18--22 , 2018 . IEEE Computer Society, 3608--3617. https:\/\/doi.org\/10.1109\/CVPR.2018.00380 Danna Gurari, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. 2018. VizWiz Grand Challenge: Answering Visual Questions From Blind People. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18--22, 2018. IEEE Computer Society, 3608--3617. https:\/\/doi.org\/10.1109\/CVPR.2018.00380"},{"key":"e_1_3_2_2_13_1","volume-title":"Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020","author":"Hu Ronghang","year":"2020","unstructured":"Ronghang Hu , Amanpreet Singh , Trevor Darrell , and Marcus Rohrbach . 2020 . Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020 , Seattle, WA, USA, June 13--19 , 2020. IEEE, 9989--9999. https:\/\/doi.org\/10.1109\/CVPR42600.2020.01001 Ronghang Hu, Amanpreet Singh, Trevor Darrell, and Marcus Rohrbach. 2020. Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA. In 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13--19, 2020. IEEE, 9989--9999. https:\/\/doi.org\/10.1109\/CVPR42600.2020.01001"},{"key":"e_1_3_2_2_14_1","volume-title":"CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017","author":"Johnson Justin","year":"2017","unstructured":"Justin Johnson , Bharath Hariharan , Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross B. Girshick. 2017 . CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 , Honolulu, HI, USA, July 21--26 , 2017 . IEEE Computer Society, 1988--1997. https:\/\/doi.org\/10.1109\/CVPR.2017.215 Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross B. Girshick. 2017. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. IEEE Computer Society, 1988--1997. https:\/\/doi.org\/10.1109\/CVPR.2017.215"},{"key":"e_1_3_2_2_15_1","volume-title":"ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In 8th International Conference on Learning Representations, ICLR 2020","author":"Lan Zhenzhong","year":"2020","unstructured":"Zhenzhong Lan , Mingda Chen , Sebastian Goodman , Kevin Gimpel , Piyush Sharma , and Radu Soricut . 2020 . ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In 8th International Conference on Learning Representations, ICLR 2020 , Addis Ababa, Ethiopia, April 26--30 , 2020. OpenReview.net. https:\/\/openreview.net\/forum?id=H1eA7AEtvS Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26--30, 2020. OpenReview.net. https:\/\/openreview.net\/forum?id=H1eA7AEtvS"},{"key":"e_1_3_2_2_16_1","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Li Chenliang","year":"1865","unstructured":"Chenliang Li , Bin Bi , Ming Yan , Wei Wang , Songfang Huang , Fei Huang , and Luo Si. 2021. StructuralLM: Structural Pre-training for Form Understanding . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) . Association for Computational Linguistics , Online , 6309--6318. https:\/\/doi.org\/10. 1865 3\/v1\/2021.acl-long.493 Chenliang Li, Bin Bi, Ming Yan, Wei Wang, Songfang Huang, Fei Huang, and Luo Si. 2021. StructuralLM: Structural Pre-training for Form Understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 6309--6318. https:\/\/doi.org\/10.18653\/v1\/2021.acl-long.493"},{"key":"e_1_3_2_2_17_1","unstructured":"Yinhan Liu Myle Ott Naman Goyal Jingfei Du Mandar Joshi Danqi Chen Omer Levy Michael Lewis Luke Zettlemoyer and Veselin Stoyanov. 2020. RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv preprint Vol. abs\/ (2020). https:\/\/arxiv.org\/abs\/  Yinhan Liu Myle Ott Naman Goyal Jingfei Du Mandar Joshi Danqi Chen Omer Levy Michael Lewis Luke Zettlemoyer and Veselin Stoyanov. 2020. RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv preprint Vol. abs\/ (2020). https:\/\/arxiv.org\/abs\/"},{"key":"e_1_3_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-70139-4"},{"key":"e_1_3_2_2_19_1","unstructured":"Minesh Mathew Dimosthenis Karatzas and C. V. Jawahar. 20. DocVQA: A Dataset for VQA on Document Images. ArXiv preprint Vol. abs\/ (20). https:\/\/arxiv.org\/abs\/  Minesh Mathew Dimosthenis Karatzas and C. V. Jawahar. 20. DocVQA: A Dataset for VQA on Document Images. ArXiv preprint Vol. abs\/ (20). https:\/\/arxiv.org\/abs\/"},{"key":"e_1_3_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-86331-9_47"},{"key":"e_1_3_2_2_21_1","doi-asserted-by":"crossref","unstructured":"Yuxi Qian Yuncong Hu Ruonan Wang Fangxiang Feng and Xiaojie Wang. 2022. Question-Driven Graph Fusion Network For Visual Question Answering. arxiv: 2204.00975 [cs.CV]  Yuxi Qian Yuncong Hu Ruonan Wang Fangxiang Feng and Xiaojie Wang. 2022. Question-Driven Graph Fusion Network For Visual Question Answering. arxiv: 2204.00975 [cs.CV]","DOI":"10.1109\/ICME52920.2022.9859591"},{"key":"e_1_3_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P18-2124"},{"key":"e_1_3_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D16-1264"},{"key":"e_1_3_2_2_24_1","volume-title":"5th International Conference on Learning Representations, ICLR","author":"Seo Min Joon","year":"2017","unstructured":"Min Joon Seo , Aniruddha Kembhavi , Ali Farhadi , and Hannaneh Hajishirzi . 2017. Bidirectional Attention Flow for Machine Comprehension . In 5th International Conference on Learning Representations, ICLR 2017 , Toulon, France, April 24--26, 2017, Conference Track Proceedings. OpenReview .net. https:\/\/openreview.net\/forum?id=HJ0UKP9ge Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional Attention Flow for Machine Comprehension. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24--26, 2017, Conference Track Proceedings. OpenReview.net. https:\/\/openreview.net\/forum?id=HJ0UKP9ge"},{"key":"e_1_3_2_2_25_1","volume-title":"Peng Fu, and Weiping Wang.","author":"Si Qingyi","year":"2021","unstructured":"Qingyi Si , Zheng Lin , Ming yu Zheng , Peng Fu, and Weiping Wang. 2021 . Check It Again:Progressive Visual Question Answering via Visual Entailment. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 4101--4110. https:\/\/doi.org\/10.18653\/v1\/2021.acl-long.317 Qingyi Si, Zheng Lin, Ming yu Zheng, Peng Fu, and Weiping Wang. 2021. Check It Again:Progressive Visual Question Answering via Visual Entailment. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 4101--4110. https:\/\/doi.org\/10.18653\/v1\/2021.acl-long.317"},{"key":"e_1_3_2_2_26_1","volume-title":"Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards VQA Models That Can Read. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019","author":"Singh Amanpreet","year":"2019","unstructured":"Amanpreet Singh , Vivek Natarajan , Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards VQA Models That Can Read. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019 , Long Beach, CA, USA, June 16--20 , 2019 . Computer Vision Foundation \/ IEEE, 8317--8326. https:\/\/doi.org\/10.1109\/CVPR.2019.00851 Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards VQA Models That Can Read. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16--20, 2019. Computer Vision Foundation \/ IEEE, 8317--8326. https:\/\/doi.org\/10.1109\/CVPR.2019.00851"},{"key":"e_1_3_2_2_27_1","unstructured":"Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N. Gomez Lukasz Kaiser and Illia Polosukhin. 2017. Attention is All you Need. In Neural Information Processing Systems.  Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit Llion Jones Aidan N. Gomez Lukasz Kaiser and Illia Polosukhin. 2017. Attention is All you Need. In Neural Information Processing Systems."},{"key":"e_1_3_2_2_28_1","doi-asserted-by":"crossref","unstructured":"Ruonan Wang Yuxi Qian Fangxiang Feng Xiaojie Wang and Huixing Jiang. 2022. Co-VQA : Answering by Interactive Sub Question Sequence. https:\/\/doi.org\/10.48550\/ARXIV.2204.00879  Ruonan Wang Yuxi Qian Fangxiang Feng Xiaojie Wang and Huixing Jiang. 2022. Co-VQA : Answering by Interactive Sub Question Sequence. https:\/\/doi.org\/10.48550\/ARXIV.2204.00879","DOI":"10.18653\/v1\/2022.findings-acl.188"},{"key":"e_1_3_2_2_29_1","volume-title":"LayoutLM: Pre-training of Text and Layout for Document Image Understanding. In KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining","author":"Xu Yiheng","year":"2020","unstructured":"Yiheng Xu , Minghao Li , Lei Cui , Shaohan Huang , Furu Wei , and Ming Zhou . 2020 . LayoutLM: Pre-training of Text and Layout for Document Image Understanding. In KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , Virtual Event, CA, USA, August 23--27 , 2020, Rajesh Gupta, Yan Liu, Jiliang Tang, and B. Aditya Prakash (Eds.). ACM, 1192--1200. https:\/\/dl.acm.org\/doi\/10.1145\/3394486.3403172 Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. 2020. LayoutLM: Pre-training of Text and Layout for Document Image Understanding. In KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23--27, 2020, Rajesh Gupta, Yan Liu, Jiliang Tang, and B. Aditya Prakash (Eds.). ACM, 1192--1200. https:\/\/dl.acm.org\/doi\/10.1145\/3394486.3403172"},{"key":"e_1_3_2_2_30_1","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Xu Yang","year":"1865","unstructured":"Yang Xu , Yiheng Xu , Tengchao Lv , Lei Cui , Furu Wei , Guoxin Wang , Yijuan Lu , Dinei Florencio , Cha Zhang , Wanxiang Che , Min Zhang , and Lidong Zhou . 2021b. LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) . Association for Computational Linguistics , Online , 2579--2591. https:\/\/doi.org\/10. 1865 3\/v1\/2021.acl-long.201 Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, and Lidong Zhou. 2021b. LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 2579--2591. https:\/\/doi.org\/10.18653\/v1\/2021.acl-long.201"},{"key":"e_1_3_2_2_31_1","volume-title":"Proceedings of the 32nd British Machine Vision Conference (BMVC).","author":"Xu Zipeng","year":"2021","unstructured":"Zipeng Xu , Fandong Meng , Xiaojie Wang , Duo Zheng , Chenxu Lv , and Jie Zhou . 2021 a. modeling Explicit Concerning States for Reinforcement Learning in Visual Dialogue . In Proceedings of the 32nd British Machine Vision Conference (BMVC). Zipeng Xu, Fandong Meng, Xiaojie Wang, Duo Zheng, Chenxu Lv, and Jie Zhou. 2021a. modeling Explicit Concerning States for Reinforcement Learning in Visual Dialogue. In Proceedings of the 32nd British Machine Vision Conference (BMVC)."},{"key":"e_1_3_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.523"},{"key":"e_1_3_2_2_33_1","volume-title":"Le","author":"Yu Adams Wei","year":"2018","unstructured":"Adams Wei Yu , David Dohan , Minh-Thang Luong , Rui Zhao , Kai Chen , Mohammad Norouzi , and Quoc V . Le . 2018 . QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview .net. https:\/\/openreview.net\/forum?id=B14TlG-RW Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V. Le. 2018. QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net. https:\/\/openreview.net\/forum?id=B14TlG-RW"},{"key":"e_1_3_2_2_34_1","volume-title":"Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering. In IEEE International Conference on Computer Vision, ICCV 2017","author":"Yu Zhou","year":"2017","unstructured":"Zhou Yu , Jun Yu , Jianping Fan , and Dacheng Tao . 2017 . Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering. In IEEE International Conference on Computer Vision, ICCV 2017 , Venice, Italy, October 22--29 , 2017. IEEE Computer Society, 1839--1848. https:\/\/doi.org\/10.1109\/ICCV.2017.202 Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2017. Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22--29, 2017. IEEE Computer Society, 1839--1848. https:\/\/doi.org\/10.1109\/ICCV.2017.202"},{"key":"e_1_3_2_2_35_1","unstructured":"Li Yulin Yuxi Qian Yuechen Yu Xiameng Qin Chengquan Zhang Yan Liu Yao Kun Han Junyu Jingtuo Liu and Errui Ding. 20. StrucTexT: Structured Text Understanding with Multi-Modal Transformers. ArXiv preprint Vol. abs\/ (20). https:\/\/arxiv.org\/abs\/  Li Yulin Yuxi Qian Yuechen Yu Xiameng Qin Chengquan Zhang Yan Liu Yao Kun Han Junyu Jingtuo Liu and Errui Ding. 20. StrucTexT: Structured Text Understanding with Multi-Modal Transformers. ArXiv preprint Vol. abs\/ (20). https:\/\/arxiv.org\/abs\/"},{"key":"e_1_3_2_2_36_1","doi-asserted-by":"crossref","unstructured":"Zhenrong Zhang Jiefeng Ma Jun Du Licheng Wang and Jianshu Zhang. 2022. Multimodal Pre-training Based on Graph Attention Network for Document Understanding.  Zhenrong Zhang Jiefeng Ma Jun Du Licheng Wang and Jianshu Zhang. 2022. Multimodal Pre-training Based on Graph Attention Network for Document Understanding.","DOI":"10.1109\/TMM.2022.3214102"},{"key":"e_1_3_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6510"},{"key":"e_1_3_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6511"},{"key":"e_1_3_2_2_39_1","volume-title":"Retrospective reader for machine reading comprehension. ArXiv preprint","author":"Zhang Zhuosheng","year":"2020","unstructured":"Zhuosheng Zhang , Junjie Yang , and Hai Zhao . 2020c. Retrospective reader for machine reading comprehension. ArXiv preprint , Vol. abs\/ 2001 .09694 ( 2020 ). https:\/\/arxiv.org\/abs\/2001.09694 Zhuosheng Zhang, Junjie Yang, and Hai Zhao. 2020c. Retrospective reader for machine reading comprehension. ArXiv preprint, Vol. abs\/2001.09694 (2020). https:\/\/arxiv.org\/abs\/2001.09694"},{"key":"e_1_3_2_2_40_1","doi-asserted-by":"crossref","unstructured":"Duo Zheng Zipeng Xu Fandong Meng Xiaojie Wang Jiaan Wang and Jie Zhou. 2021. Enhancing Visual Dialog Questioner with Entity-based Strategy Learning and Augmented Guesser.. In Empirical Methods in Natural Language Processing.  Duo Zheng Zipeng Xu Fandong Meng Xiaojie Wang Jiaan Wang and Jie Zhou. 2021. Enhancing Visual Dialog Questioner with Entity-based Strategy Learning and Augmented Guesser.. In Empirical Methods in Natural Language Processing.","DOI":"10.18653\/v1\/2021.findings-emnlp.158"}],"event":{"name":"MM '22: The 30th ACM International Conference on Multimedia","location":"Lisboa Portugal","acronym":"MM '22","sponsor":["SIGMM ACM Special Interest Group on Multimedia"]},"container-title":["Proceedings of the 30th ACM International Conference on Multimedia"],"original-title":[],"link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3548172","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3503161.3548172","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T19:00:20Z","timestamp":1750186820000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3503161.3548172"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,10,10]]},"references-count":40,"alternative-id":["10.1145\/3503161.3548172","10.1145\/3503161"],"URL":"https:\/\/doi.org\/10.1145\/3503161.3548172","relation":{},"subject":[],"published":{"date-parts":[[2022,10,10]]},"assertion":[{"value":"2022-10-10","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}