{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,17]],"date-time":"2025-09-17T15:12:19Z","timestamp":1758121939868,"version":"3.41.0"},"reference-count":67,"publisher":"Association for Computing Machinery (ACM)","issue":"3","license":[{"start":{"date-parts":[[2025,2,19]],"date-time":"2025-02-19T00:00:00Z","timestamp":1739923200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"crossref","award":["62441604 and 62476093"],"award-info":[{"award-number":["62441604 and 62476093"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2025,3,31]]},"abstract":"<jats:p>\n            Visual Information Extraction (VIE) has experienced substantial growth and heightened interest due to its pivotal role in intelligent document processing. However, most existing related pre-trained models typically can only process the data from a certain (set of) language(s)\u2014often just English, representing a distinct limitation. To solve it, we present a\n            <jats:italic>L<\/jats:italic>\n            anguage-subst\n            <jats:italic>i<\/jats:italic>\n            tutable\n            <jats:italic>L<\/jats:italic>\n            ayout-image\n            <jats:italic>T<\/jats:italic>\n            ransformer (LiLTv2). It can be pre-trained just once on mono-lingual documents and then collaborate with off-the-shelf textual models in other languages during fine-tuning. Firstly, LiLTv2 utilizes a new dual-stream model architecture, one stream for substitutable text information and the other for layout and image information. Then, LiLTv2 has improved upon the optimization strategy and the diverse tasks adopted in the pre-training stage. Finally, we innovatively propose a teacher-student knowledge distillation learning with segment-level multi-modal features named SegKD. Extensive experimental results on widely used benchmarks can demonstrate the superior effectiveness of our method.\n          <\/jats:p>","DOI":"10.1145\/3708351","type":"journal-article","created":{"date-parts":[[2024,12,11]],"date-time":"2024-12-11T18:18:50Z","timestamp":1733941130000},"page":"1-27","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":4,"title":["LiLTv2: Language-substitutable Layout-image Transformer for Visual Information Extraction"],"prefix":"10.1145","volume":"21","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2060-3488","authenticated-orcid":false,"given":"Jiapeng","family":"Wang","sequence":"first","affiliation":[{"name":"Department of Electronics and Information, South China University of Technology, Guangzhou, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0009-8681-311X","authenticated-orcid":false,"given":"Zening","family":"Lin","sequence":"additional","affiliation":[{"name":"Department of Electronics and Information, South China University of Technology, Guangzhou, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-8248-2906","authenticated-orcid":false,"given":"Dayi","family":"Huang","sequence":"additional","affiliation":[{"name":"Kingsoft Office, Zhuhai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-2029-8391","authenticated-orcid":false,"given":"Longfei","family":"Xiong","sequence":"additional","affiliation":[{"name":"Kingsoft Office, Zhuhai, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5456-0957","authenticated-orcid":false,"given":"Lianwen","family":"Jin","sequence":"additional","affiliation":[{"name":"South China University of Technology, Guangzhou, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"320","published-online":{"date-parts":[[2025,2,19]]},"reference":[{"key":"e_1_3_3_2_2","first-page":"993","article-title":"DocFormer: End-to-end transformer for document understanding","author":"Appalaraju Srikar","year":"2021","unstructured":"Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, and R. Manmatha. 2021. DocFormer: End-to-end transformer for document understanding. In ICCV, 993\u20131003.","journal-title":"ICCV"},{"key":"e_1_3_3_3_2","first-page":"21","article-title":"Layer normalization","volume":"1050","author":"Ba Jimmy Lei","year":"2016","unstructured":"Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. Stat 1050 (2016), 21.","journal-title":"Stat"},{"key":"e_1_3_3_4_2","first-page":"642","article-title":"UniLMv2: Pseudo-masked language models for unified language model pre-training","author":"Bao Hangbo","year":"2020","unstructured":"Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Jianfeng Gao, Songhao Piao, Ming Zhou, et al. 2020. UniLMv2: Pseudo-masked language models for unified language model pre-training. In ICML, 642\u2013652.","journal-title":"ICML"},{"key":"e_1_3_3_5_2","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00051"},{"key":"e_1_3_3_6_2","doi-asserted-by":"publisher","DOI":"10.1145\/1870121.1870123"},{"key":"e_1_3_3_7_2","doi-asserted-by":"publisher","DOI":"10.1145\/2808200"},{"key":"e_1_3_3_8_2","first-page":"3576","article-title":"InfoXLM: An information-theoretic framework for cross-lingual language model pre-training","author":"Chi Zewen","year":"2021","unstructured":"Zewen Chi, Li Dong, Furu Wei, Nan Yang, Saksham Singhal, Wenhui Wang, Xia Song, Xian-Ling Mao, Heyan Huang, and Ming Zhou. 2021. InfoXLM: An information-theoretic framework for cross-lingual language model pre-training. In NAACL-HLT, 3576\u20133588.","journal-title":"NAACL-HLT"},{"key":"e_1_3_3_9_2","first-page":"688","article-title":"The significance of reading order in document recognition and its evaluation","author":"Clausner Christian","year":"2013","unstructured":"Christian Clausner, Stefan Pletschacher, and Apostolos Antonacopoulos. 2013. The significance of reading order in document recognition and its evaluation. In ICDAR, 688\u2013692.","journal-title":"ICDAR"},{"key":"e_1_3_3_10_2","first-page":"8440","article-title":"Unsupervised cross-lingual representation learning at scale","author":"Conneau Alexis","year":"2020","unstructured":"Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm\u00e1n, \u00c9douard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In ACL, 8440\u20138451.","journal-title":"ACL"},{"key":"e_1_3_3_11_2","first-page":"433","article-title":"Smartfix: A requirements-driven system for document analysis and understanding","author":"Dengel Andreas R.","year":"2002","unstructured":"Andreas R. Dengel and Bertin Klein. 2002. Smartfix: A requirements-driven system for document analysis and understanding. In DAS Workshop, 433\u2013444.","journal-title":"DAS Workshop"},{"key":"e_1_3_3_12_2","article-title":"BERTgrid: Contextualized embedding for 2D document representation and understanding","author":"Denk Timo I.","year":"2019","unstructured":"Timo I. Denk and Christian Reisswig. 2019. BERTgrid: Contextualized embedding for 2D document representation and understanding. In NeurIPS Document Intelligence Workshop.","journal-title":"NeurIPS Document Intelligence Workshop"},{"key":"e_1_3_3_13_2","first-page":"4171","article-title":"BERT: Pre-training of deep bidirectional transformers for language understanding","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 4171\u20134186.","journal-title":"NAACL-HLT"},{"key":"e_1_3_3_14_2","article-title":"An image is worth 16x16 words: Transformers for image recognition at scale","author":"Dosovitskiy Alexey","year":"2020","unstructured":"Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR.","journal-title":"ICLR"},{"key":"e_1_3_3_15_2","first-page":"39","article-title":"UniDoc: Unified pretraining framework for document understanding","volume":"34","author":"Gu Jiuxiang","year":"2021","unstructured":"Jiuxiang Gu, Jason Kuen, Vlad I. Morariu, Handong Zhao, Rajiv Jain, Nikolaos Barmpalios, Ani Nenkova, and Tong Sun. 2021. UniDoc: Unified pretraining framework for document understanding. NeurIPS 34 (2021), 39\u201350.","journal-title":"NeurIPS"},{"key":"e_1_3_3_16_2","first-page":"4583","article-title":"XYLayoutLM: Towards layout-aware multimodal networks for visually-rich document understanding","author":"Gu Zhangxuan","year":"2022","unstructured":"Zhangxuan Gu, Changhua Meng, Ke Wang, Jun Lan, Weiqiang Wang, Ming Gu, and Liqing Zhang. 2022. XYLayoutLM: Towards layout-aware multimodal networks for visually-rich document understanding. In CVPR, 4583\u20134592.","journal-title":"CVPR"},{"key":"e_1_3_3_17_2","first-page":"952","article-title":"Recursive XY cut using bounding boxes of connected components","volume":"2","author":"Ha Jaekyu","year":"1995","unstructured":"Jaekyu Ha, Robert M. Haralick, and Ihsin T. Phillips. 1995. Recursive XY cut using bounding boxes of connected components. In ICDAR, Vol. 2, 952\u2013955.","journal-title":"ICDAR"},{"key":"e_1_3_3_18_2","article-title":"Distilling the knowledge in a neural network","author":"Hinton Geoffrey E.","year":"2014","unstructured":"Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. 2014. Distilling the knowledge in a neural network. In NeurIPS Deep Learning Workshop.","journal-title":"NeurIPS Deep Learning Workshop"},{"key":"e_1_3_3_19_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v36i10.21322"},{"key":"e_1_3_3_20_2","article-title":"LoRA: Low-rank adaptation of large language models","author":"Hu Edward J.","year":"2021","unstructured":"Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-rank adaptation of large language models. In ICLR.","journal-title":"ICLR"},{"key":"e_1_3_3_21_2","first-page":"4083","article-title":"LayoutLMv3: Pre-training for document AI with unified text and image masking","author":"Huang Yupan","year":"2022","unstructured":"Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. 2022. LayoutLMv3: Pre-training for document AI with unified text and image masking. In ACM Multimedia, 4083\u20134091.","journal-title":"ACM Multimedia"},{"key":"e_1_3_3_22_2","first-page":"1516","article-title":"ICDAR2019 competition on scanned receipt OCR and information extraction","author":"Huang Zheng","year":"2019","unstructured":"Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, and C. V. Jawahar. 2019. ICDAR2019 competition on scanned receipt OCR and information extraction. In ICDAR, 1516\u20131520.","journal-title":"ICDAR"},{"key":"e_1_3_3_23_2","first-page":"246","article-title":"Learning information extraction patterns from examples","author":"Huffman Scott B.","year":"1995","unstructured":"Scott B. Huffman. 1995. Learning information extraction patterns from examples. In IJCAI, 246\u2013260.","journal-title":"IJCAI"},{"key":"e_1_3_3_24_2","first-page":"1","article-title":"FUNSD: A dataset for form understanding in noisy scanned documents","volume":"2","author":"Jaume Guillaume","year":"2019","unstructured":"Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe Thiran. 2019. FUNSD: A dataset for form understanding in noisy scanned documents. In ICDAR Workshops, Vol. 2, 1\u20136.","journal-title":"ICDAR Workshops"},{"key":"e_1_3_3_25_2","first-page":"4163","article-title":"TinyBERT: Distilling BERT for natural language understanding","author":"Jiao Xiaoqi","year":"2020","unstructured":"Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2020. TinyBERT: Distilling BERT for natural language understanding. In Findings of EMNLP, 4163\u20134174.","journal-title":"Findings of EMNLP"},{"key":"e_1_3_3_26_2","first-page":"4459","article-title":"Chargrid: Towards understanding 2D documents","author":"Katti Anoop R.","year":"2018","unstructured":"Anoop R. Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, Johannes H\u00f6hne, and Jean Baptiste Faddoul. 2018. Chargrid: Towards understanding 2D documents. In EMNLP, 4459\u20134469.","journal-title":"EMNLP"},{"key":"e_1_3_3_27_2","first-page":"389","article-title":"VisualWordGrid: Information extraction from scanned documents using a multimodal approach","author":"Kerroumi Mohamed","year":"2021","unstructured":"Mohamed Kerroumi, Othmane Sayem, and Aymen Shabou. 2021. VisualWordGrid: Information extraction from scanned documents using a multimodal approach. In ICDAR, 389\u2013402.","journal-title":"ICDAR"},{"key":"e_1_3_3_28_2","first-page":"5583","article-title":"ViLT: Vision-and-language transformer without convolution or region supervision","author":"Kim Wonjae","year":"2021","unstructured":"Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. ViLT: Vision-and-language transformer without convolution or region supervision. In ICML, 5583\u20135594.","journal-title":"ICML"},{"key":"e_1_3_3_29_2","article-title":"Adam: A method for stochastic optimization","author":"Kingma Diederik P.","year":"2015","unstructured":"Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR.","journal-title":"ICLR"},{"key":"e_1_3_3_30_2","first-page":"260","article-title":"Neural architectures for named entity recognition","author":"Lample Guillaume","year":"2016","unstructured":"Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In NAACL-HLT, 260\u2013270.","journal-title":"NAACL-HLT"},{"key":"e_1_3_3_31_2","first-page":"3735","article-title":"FormNet: Structural encoding beyond sequential modeling in form document information extraction","author":"Lee Chen-Yu","year":"2022","unstructured":"Chen-Yu Lee, Chun-Liang Li, Timothy Dozat, Vincent Perot, Guolong Su, Nan Hua, Joshua Ainslie, Renshen Wang, Yasuhisa Fujii, and Tomas Pfister. 2022. FormNet: Structural encoding beyond sequential modeling in form document information extraction. In ACL, 3735\u20133754.","journal-title":"ACL"},{"key":"e_1_3_3_32_2","doi-asserted-by":"crossref","first-page":"665","DOI":"10.1145\/1148170.1148307","article-title":"Building a test collection for complex document information processing","author":"Lewis David","year":"2006","unstructured":"David Lewis, Gady Agam, Shlomo Argamon, Ophir Frieder, David Grossman, and Jefferson Heard. 2006. Building a test collection for complex document information processing. In ACM SIGIR, 665\u2013666.","journal-title":"ACM SIGIR"},{"key":"e_1_3_3_33_2","first-page":"6309","article-title":"StructuralLM: Structural pre-training for form understanding","author":"Li Chenliang","year":"2021","unstructured":"Chenliang Li, Bin Bi, Ming Yan, Wei Wang, Songfang Huang, Fei Huang, and Luo Si. 2021. StructuralLM: Structural pre-training for form understanding. In ACL, 6309\u20136318.","journal-title":"ACL"},{"key":"e_1_3_3_34_2","first-page":"5652","article-title":"SelfDoc: Self-supervised document representation learning","author":"Li Peizhao","year":"2021","unstructured":"Peizhao Li, Jiuxiang Gu, Jason Kuen, Vlad I. Morariu, Handong Zhao, Rajiv Jain, Varun Manjunatha, and Hongfu Liu. 2021. SelfDoc: Self-supervised document representation learning. In CVPR, 5652\u20135660.","journal-title":"CVPR"},{"key":"e_1_3_3_35_2","first-page":"1912","article-title":"StrucTexT: Structured text understanding with multi-modal transformers","author":"Li Yulin","year":"2021","unstructured":"Yulin Li, Yuxi Qian, Yuechen Yu, Xiameng Qin, Chengquan Zhang, Yan Liu, Kun Yao, Junyu Han, Jingtuo Liu, and Errui Ding. 2021. StrucTexT: Structured text understanding with multi-modal transformers. In ACM Multimedia, 1912\u20131920.","journal-title":"ACM Multimedia"},{"key":"e_1_3_3_36_2","first-page":"19584","article-title":"DocTr: Document transformer for structured information extraction in documents","author":"Liao Haofu","year":"2023","unstructured":"Haofu Liao, Aruni RoyChowdhury, Weijian Li, Ankan Bansal, Yuting Zhang, Zhuowen Tu, Ravi Kumar Satzoda, R. Manmatha, and Vijay Mahadevan. 2023. DocTr: Document transformer for structured information extraction in documents. In ICCV, 19584\u201319594.","journal-title":"ICCV"},{"key":"e_1_3_3_37_2","first-page":"548","article-title":"ViBERTgrid: A jointly trained multi-modal 2D document representation for key information extraction from documents","author":"Lin Weihong","year":"2021","unstructured":"Weihong Lin, Qifang Gao, Lei Sun, Zhuoyao Zhong, Kai Hu, Qin Ren, and Qiang Huo. 2021. ViBERTgrid: A jointly trained multi-modal 2D document representation for key information extraction from documents. In ICDAR, 548\u2013563.","journal-title":"ICDAR"},{"key":"e_1_3_3_38_2","first-page":"32","article-title":"Graph convolution for multimodal information extraction from visually rich documents","author":"Liu Xiaojing","year":"2019","unstructured":"Xiaojing Liu, Feiyu Gao, Qiong Zhang, and Huasha Zhao. 2019. Graph convolution for multimodal information extraction from visually rich documents. In NAACL-HLT, 32\u201339.","journal-title":"NAACL-HLT"},{"key":"e_1_3_3_39_2","unstructured":"Yinhan Liu Myle Ott Naman Goyal Jingfei Du Mandar Joshi Danqi Chen Omer Levy Mike Lewis Luke Zettlemoyer and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692. Retrieved from https:\/\/arxiv.org\/abs\/1907.11692"},{"key":"e_1_3_3_40_2","article-title":"Decoupled weight decay regularization","author":"Loshchilov Ilya","year":"2018","unstructured":"Ilya Loshchilov and Frank Hutter. 2018. Decoupled weight decay regularization. In ICLR.","journal-title":"ICLR"},{"key":"e_1_3_3_41_2","first-page":"7092","article-title":"GeoLayoutLM: Geometric pre-training for visual information extraction","author":"Luo Chuwei","year":"2023","unstructured":"Chuwei Luo, Changxu Cheng, Qi Zheng, and Cong Yao. 2023. GeoLayoutLM: Geometric pre-training for visual information extraction. In CVPR, 7092\u20137101.","journal-title":"CVPR"},{"key":"e_1_3_3_42_2","unstructured":"Aaron van den Oord Yazhe Li and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv:1807.03748. Retrieved from https:\/\/arxiv.org\/abs\/1807.03748"},{"key":"e_1_3_3_43_2","article-title":"CORD: A consolidated receipt dataset for post-OCR parsing","author":"Park Seunghyun","year":"2019","unstructured":"Seunghyun Park, Seung Shin, Bado Lee, Junyeop Lee, Jaeheung Surh, Minjoon Seo, and Hwalsuk Lee. 2019. CORD: A consolidated receipt dataset for post-OCR parsing. In NeurIPS Document Intelligence Workshop.","journal-title":"NeurIPS Document Intelligence Workshop"},{"key":"e_1_3_3_44_2","first-page":"732","article-title":"Going full-tilt boogie on document understanding with text-image-layout transformer","author":"Powalski Rafa\u0142","year":"2021","unstructured":"Rafa\u0142 Powalski, \u0141ukasz Borchmann, Dawid Jurkiewicz, Tomasz Dwojak, Micha\u0142 Pietruszka, and Gabriela Pa\u0142ka. 2021. Going full-tilt boogie on document understanding with text-image-layout transformer. In ICDAR, 732\u2013747.","journal-title":"ICDAR"},{"key":"e_1_3_3_45_2","first-page":"751","article-title":"GraphIE: A graph-based framework for information extraction","author":"Qian Yujie","year":"2019","unstructured":"Yujie Qian, Enrico Santus, Zhijing Jin, Jiang Guo, and Regina Barzilay. 2019. GraphIE: A graph-based framework for information extraction. In NAACL-HLT, 751\u2013761.","journal-title":"NAACL-HLT"},{"key":"e_1_3_3_46_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2016.2577031"},{"key":"e_1_3_3_47_2","first-page":"2","article-title":"Automatically constructing a dictionary for information extraction tasks","volume":"1","author":"Riloff Ellen","year":"1993","unstructured":"Ellen Riloff. 1993. Automatically constructing a dictionary for information extraction tasks. In AAAI, Vol. 1, 2\u20131.","journal-title":"AAAI"},{"key":"e_1_3_3_48_2","doi-asserted-by":"publisher","DOI":"10.1145\/3267127"},{"key":"e_1_3_3_49_2","first-page":"101","article-title":"Intellix\u2013End-user trained information extraction for document archiving","author":"Schuster Daniel","year":"2013","unstructured":"Daniel Schuster, Klemens Muthmann, Daniel Esser, Alexander Schill, Michael Berger, Christoph Weidling, Kamil Aliyev, and A. Hofmeier. 2013. Intellix\u2013End-user trained information extraction for document archiving. In ICDAR, 101\u2013105.","journal-title":"ICDAR"},{"key":"e_1_3_3_50_2","doi-asserted-by":"publisher","DOI":"10.1145\/2379790.2379795"},{"key":"e_1_3_3_51_2","first-page":"1039","article-title":"MatchVIE: Exploiting match relevancy between entities for visual information extraction","author":"Tang Guozhi","year":"2021","unstructured":"Guozhi Tang, Lele Xie, Lianwen Jin, Jiapeng Wang, Jingdong Chen, Zhen Xu, Qianying Wang, Yaqiang Wu, and Hui Li. 2021. MatchVIE: Exploiting match relevancy between entities for visual information extraction. In IJCAI, 1039\u20131045.","journal-title":"IJCAI"},{"key":"e_1_3_3_52_2","first-page":"19254","article-title":"Unifying vision, text, and layout for universal document processing","author":"Tang Zineng","year":"2023","unstructured":"Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, and Mohit Bansal. 2023. Unifying vision, text, and layout for universal document processing. In CVPR, 19254\u201319264.","journal-title":"CVPR"},{"key":"e_1_3_3_53_2","first-page":"15200","article-title":"LayoutMask: Enhance text-layout interaction in multi-modal pre-training for document understanding","author":"Tu Yi","year":"2023","unstructured":"Yi Tu, Ya Guo, Huan Chen, and Jinyang Tang. 2023. LayoutMask: Enhance text-layout interaction in multi-modal pre-training for document understanding. In ACL, 15200\u201315212.","journal-title":"ACL"},{"key":"e_1_3_3_54_2","first-page":"6000","article-title":"Attention is all you need","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS, 6000\u20136010.","journal-title":"NIPS"},{"key":"e_1_3_3_55_2","first-page":"7747","article-title":"LiLT: A simple yet effective language-independent layout transformer for structured document understanding","author":"Wang Jiapeng","year":"2022","unstructured":"Jiapeng Wang, Lianwen Jin, and Kai Ding. 2022. LiLT: A simple yet effective language-independent layout transformer for structured document understanding. In ACL, 7747\u20137757.","journal-title":"ACL"},{"key":"e_1_3_3_56_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v35i4.16378"},{"key":"e_1_3_3_57_2","first-page":"1082","article-title":"Tag, copy or predict: A unified weakly-supervised learning framework for visual information extraction using sequences","author":"Wang Jiapeng","year":"2021","unstructured":"Jiapeng Wang, Tianwei Wang, Guozhi Tang, Lianwen Jin, Weihong Ma, Kai Ding, and Yichao Huang. 2021. Tag, copy or predict: A unified weakly-supervised learning framework for visual information extraction using sequences. In IJCAI, 1082\u20131090.","journal-title":"IJCAI"},{"key":"e_1_3_3_58_2","first-page":"4735","article-title":"LayoutReader: Pre-training of text and layout for reading order detection","author":"Wang Zilong","year":"2021","unstructured":"Zilong Wang, Yiheng Xu, Lei Cui, Jingbo Shang, and Furu Wei. 2021. LayoutReader: Pre-training of text and layout for reading order detection. In EMNLP, 4735\u20134744.","journal-title":"EMNLP"},{"key":"e_1_3_3_59_2","first-page":"1192","article-title":"LayoutLM: Pre-training of text and layout for document image understanding","author":"Xu Yiheng","year":"2020","unstructured":"Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. 2020. LayoutLM: Pre-training of text and layout for document image understanding. In ACM SIGKDD, 1192\u20131200.","journal-title":"ACM SIGKDD"},{"key":"e_1_3_3_60_2","unstructured":"Yiheng Xu Tengchao Lv Lei Cui Guoxin Wang Yijuan Lu Dinei Florencio Cha Zhang and Furu Wei. 2021. LayoutXLM: Multimodal pre-training for multilingual visually-rich document understanding. arXiv:2104.08836. Retrieved from https:\/\/arxiv.org\/abs\/2104.08836"},{"key":"e_1_3_3_61_2","first-page":"2579","article-title":"LayoutLMv2: Multi-modal pre-training for visually-rich document understanding","author":"Xu Yang","year":"2021","unstructured":"Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, et al. 2021. LayoutLMv2: Multi-modal pre-training for visually-rich document understanding. In ACL, 2579\u20132591.","journal-title":"ACL"},{"key":"e_1_3_3_62_2","doi-asserted-by":"publisher","DOI":"10.1145\/2240136.2240141"},{"key":"e_1_3_3_63_2","first-page":"15358","article-title":"Modeling entities as semantic points for visual information extraction in the wild","author":"Yang Zhibo","year":"2023","unstructured":"Zhibo Yang, Rujiao Long, Pengfei Wang, Sibo Song, Humen Zhong, Wenqing Cheng, Xiang Bai, and Cong Yao. 2023. Modeling entities as semantic points for visual information extraction in the wild. In CVPR, 15358\u201315367.","journal-title":"CVPR"},{"key":"e_1_3_3_64_2","first-page":"4363","article-title":"PICK: Processing key information extraction from documents using improved graph learning-convolutional networks","author":"Yu Wenwen","year":"2021","unstructured":"Wenwen Yu, Ning Lu, Xianbiao Qi, Ping Gong, and Rong Xiao. 2021. PICK: Processing key information extraction from documents using improved graph learning-convolutional networks. In ICPR, 4363\u20134370.","journal-title":"ICPR"},{"key":"e_1_3_3_65_2","article-title":"StrucTexTv2: Masked visual-textual prediction for document image pre-training","author":"Yu Yuechen","year":"2023","unstructured":"Yuechen Yu, Yulin Li, Chengquan Zhang, Xiaoqiang Zhang, Zengyuan Guo, Xiameng Qin, Kun Yao, Junyu Han, Errui Ding, and Jingdong Wang. 2023. StrucTexTv2: Masked visual-textual prediction for document image pre-training. In ICLR.","journal-title":"ICLR"},{"key":"e_1_3_3_66_2","first-page":"13716","article-title":"Reading order matters: Information extraction from visually-rich documents by token path prediction","author":"Zhang Chong","year":"2023","unstructured":"Chong Zhang, Ya Guo, Yi Tu, Huan Chen, Jinyang Tang, Huijia Zhu, Qi Zhang, and Tao Gui. 2023. Reading order matters: Information extraction from visually-rich documents by token path prediction. In EMNLP, 13716\u201313730.","journal-title":"EMNLP"},{"key":"e_1_3_3_67_2","first-page":"1413","article-title":"TRIE: End-to-end text reading and information extraction for document understanding","author":"Zhang Peng","year":"2020","unstructured":"Peng Zhang, Yunlu Xu, Zhanzhan Cheng, Shiliang Pu, Jing Lu, Liang Qiao, Yi Niu, and Fei Wu. 2020. TRIE: End-to-end text reading and information extraction for document understanding. In ACM Multimedia, 1413\u20131422.","journal-title":"ACM Multimedia"},{"key":"e_1_3_3_68_2","doi-asserted-by":"publisher","DOI":"10.1109\/TMM.2022.3214102"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3708351","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3708351","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,19]],"date-time":"2025-06-19T01:09:45Z","timestamp":1750295385000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3708351"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,2,19]]},"references-count":67,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2025,3,31]]}},"alternative-id":["10.1145\/3708351"],"URL":"https:\/\/doi.org\/10.1145\/3708351","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"type":"print","value":"1551-6857"},{"type":"electronic","value":"1551-6865"}],"subject":[],"published":{"date-parts":[[2025,2,19]]},"assertion":[{"value":"2024-07-24","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2024-11-18","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-02-19","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}