{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,1]],"date-time":"2026-06-01T23:35:01Z","timestamp":1780356901944,"version":"3.54.1"},"reference-count":318,"publisher":"Association for Computing Machinery (ACM)","issue":"1","funder":[{"DOI":"10.13039\/501100001809","name":"National Science Foundation of China","doi-asserted-by":"crossref","award":["62376057"],"award-info":[{"award-number":["62376057"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/501100012226","name":"Fundamental Research Funds for the Central Universities","doi-asserted-by":"crossref","award":["2242025K30024"],"award-info":[{"award-number":["2242025K30024"]}],"id":[{"id":"10.13039\/501100012226","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Start-up Research Fund of Southeast University","award":["RF1028623234"],"award-info":[{"award-number":["RF1028623234"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Inf. Syst."],"published-print":{"date-parts":[[2026,1,31]]},"abstract":"<jats:p>The rapid proliferation of documents has made document intelligence increasingly critical across various industries. In recent years, Large Language Models (LLMs) have dramatically transformed the field of document intelligence, allowing for more advanced and accurate document processing solutions. Despite these advancements, most existing surveys have failed to focus on these breakthroughs, instead concentrating on traditional methods and earlier machine learning techniques. This survey seeks to fill that gap by offering an in-depth analysis of approximately 300 papers published between 2021 and mid-2025, thus providing a comprehensive overview of the impact of LLMs in document intelligence. The key topics explored include Retrieval-Augmented Generation (RAG), long-context processing, and fine-tuning LLMs for document comprehension. Furthermore, the survey highlights essential datasets, practical applications, current challenges, and future research directions, offering critical insights for both researchers and industry practitioners looking to advance the field.<\/jats:p>","DOI":"10.1145\/3768156","type":"journal-article","created":{"date-parts":[[2025,9,16]],"date-time":"2025-09-16T13:18:09Z","timestamp":1758028689000},"page":"1-64","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":11,"title":["Large Language Models in Document Intelligence: A Comprehensive Survey, Recent Advances, Challenges, and Future Trends"],"prefix":"10.1145","volume":"44","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7352-1710","authenticated-orcid":false,"given":"Wenjun","family":"Ke","sequence":"first","affiliation":[{"name":"School of Computer Science and Engineering, Southeast University, Nanjing, China and Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, Nanjing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-9474-7377","authenticated-orcid":false,"given":"Yifan","family":"Zheng","sequence":"additional","affiliation":[{"name":"Beijing Institute of Computer Technology and Application, Beijing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0000-6191-7232","authenticated-orcid":false,"given":"Yining","family":"Li","sequence":"additional","affiliation":[{"name":"College of Software Engineering, Southeast University, Nanjing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0001-0373-5956","authenticated-orcid":false,"given":"Hengyuan","family":"Xu","sequence":"additional","affiliation":[{"name":"College of Software Engineering, Southeast University, Nanjing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0385-8988","authenticated-orcid":false,"given":"Dong","family":"Nie","sequence":"additional","affiliation":[{"name":"Meta Inc, Palo Alto, California, USA"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8782-857X","authenticated-orcid":false,"given":"Peng","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, Southeast University, Nanjing, China and Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, Nanjing, China"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-3238-5884","authenticated-orcid":false,"given":"Yao","family":"He","sequence":"additional","affiliation":[{"name":"Institute of Collaborative Innovation, University of Macau, Macau, China"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"320","published-online":{"date-parts":[[2025,11,14]]},"reference":[{"key":"e_1_3_2_2_2","unstructured":"2025. Azure AI Services: Document Intelligence. Retrieved from https:\/\/learn.microsoft.com\/en-us\/azure\/ai-services\/document-intelligence\/model-overview"},{"key":"e_1_3_2_3_2","doi-asserted-by":"crossref","first-page":"1180962","DOI":"10.3389\/fphar.2023.1180962","article-title":"Approach to machine learning for extraction of real-world data variables from electronic health records","volume":"14","author":"Adamson Blythe","year":"2023","unstructured":"Blythe Adamson, Michael Waskom, Auriane Blarre, Jonathan Kelly, Konstantin Krismer, Sheila Nemeth, James Gippetti, John Ritten, Katherine Harrison, George Ho, et al. 2023. Approach to machine learning for extraction of real-world data variables from electronic health records. Frontiers in Pharmacology 14 (2023), 1180962.","journal-title":"Frontiers in Pharmacology"},{"key":"e_1_3_2_4_2","unstructured":"Meta AI. 2023. LLaMA: A Foundational 65-Billion-Parameter Large Language Model from Meta AI. Retrieved from https:\/\/ai.meta.com\/blog\/large-language-model-llama-meta-ai\/"},{"key":"e_1_3_2_5_2","unstructured":"Alibaba. 2023. Qwen technical report. arXiv:2309.16609. Retrieved from https:\/\/arxiv.org\/abs\/2309.16609"},{"key":"e_1_3_2_6_2","first-page":"2226","volume-title":"Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Asai Akari","year":"2022","unstructured":"Akari Asai, Matt Gardner, and Hannaneh Hajishirzi. 2022. Evidentiality-guided generation for Knowledge-Intensive NLP tasks. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2226\u20132243. DOI: 10.18653\/v1\/2022.naacl-main.162"},{"key":"e_1_3_2_7_2","first-page":"228","volume-title":"Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Ayala Orlando","year":"2024","unstructured":"Orlando Ayala and Patrice Bechard. 2024. Reducing hallucination in structured outputs via retrieval-augmented generation. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 228\u2013238. DOI: 10.18653\/v1\/2024.naacl-industry.19"},{"key":"e_1_3_2_8_2","unstructured":"Jinze Bai Shuai Bai Shusheng Yang Shijie Wang Sinan Tan Peng Wang Junyang Lin Chang Zhou and Jingren Zhou. 2024. Qwen-VL: A Versatile Vision-Language Model for Understanding Localization Text Reading and Beyond. Retrieved from https:\/\/openreview.net\/forum?id=qrGjFJVl3m"},{"key":"e_1_3_2_9_2","unstructured":"Shuai Bai Keqin Chen Xuejing Liu Jialin Wang Wenbin Ge Sibo Song Kai Dang Peng Wang Shijie Wang Jun Tang et al. 2025. Qwen2.5-VL Technical Report. arXiv:2502.13923. Retrieved from https:\/\/arxiv.org\/abs\/2502.13923"},{"key":"e_1_3_2_10_2","unstructured":"Iz Beltagy Matthew E. Peters and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv:2004.05150. Retrieved from https:\/\/arxiv.org\/abs\/2004.05150"},{"key":"e_1_3_2_11_2","first-page":"35522","article-title":"Unlimiformer: Long-range transformers with unlimited length input","volume":"36","author":"Bertsch Amanda","year":"2024","unstructured":"Amanda Bertsch, Uri Alon, Graham Neubig, and Matthew Gormley. 2024. Unlimiformer: Long-range transformers with unlimited length input. Advances in Neural Information Processing Systems 36 (2024), 35522\u201335543.","journal-title":"Advances in Neural Information Processing Systems"},{"issue":"6","key":"e_1_3_2_12_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3355610","article-title":"Document layout analysis: A comprehensive survey","volume":"52","author":"Binmakhashen Galal M.","year":"2019","unstructured":"Galal M. Binmakhashen and Sabri A. Mahmoud. 2019. Document layout analysis: A comprehensive survey. ACM Computing Surveys 52, 6 (2019), 1\u201336.","journal-title":"ACM Computing Surveys"},{"key":"e_1_3_2_13_2","unstructured":"Anjanava Biswas and Wrick Talukdar. 2024. Robustness of structured data extraction from in-plane rotated documents using multi-modal large language models (LLM). arXiv: 2406.10295. Retrieved from https:\/\/arxiv.org\/abs\/2406.10295"},{"key":"e_1_3_2_14_2","volume-title":"The 12th International Conference on Learning Representations (ICLR)","author":"Blecher Lukas","year":"2024","unstructured":"Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. 2024. Nougat: Neural optical understanding for academic documents. In The 12th International Conference on Learning Representations (ICLR). OpenReview.net. Retrieved from http:\/\/dblp.uni-trier.de\/db\/conf\/iclr\/iclr2024.html#BlecherCSS24"},{"key":"e_1_3_2_15_2","volume-title":"International Conference on Learning Representations","author":"Bolya Daniel","year":"2023","unstructured":"Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. 2023. Token merging: Your ViT but faster. In International Conference on Learning Representations."},{"key":"e_1_3_2_16_2","doi-asserted-by":"crossref","unstructured":"Luiz Bonifacio Hugo Abonizio Marzieh Fadaee and Rodrigo Nogueira. 2022. Inpars: Data augmentation for information retrieval using large language models. arXiv:2202.05144. Retrieved from https:\/\/arxiv.org\/abs\/2202.05144","DOI":"10.1145\/3477495.3531863"},{"key":"e_1_3_2_17_2","first-page":"1877","article-title":"Language models are few-shot learners","volume":"33","author":"Brown Tom","year":"2020","unstructured":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems 33 (2020), 1877\u20131901. Retrieved from https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2020\/file\/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf","journal-title":"Advances in Neural Information Processing Systems"},{"issue":"2","key":"e_1_3_2_18_2","first-page":"221","article-title":"What is a digital document","volume":"2","author":"Buckland Michael","year":"1998","unstructured":"Michael Buckland. 1998. What is a digital document. Document Num\u00e9rique 2, 2 (1998), 221\u2013230.","journal-title":"Document Num\u00e9rique"},{"key":"e_1_3_2_19_2","first-page":"1818","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops","author":"Caffagni Davide","year":"2024","unstructured":"Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sarto Sara, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. 2024. Wiki-LLaVA: Hierarchical retrieval-augmented generation for multimodal LLMs. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 1818\u20131826."},{"key":"e_1_3_2_20_2","first-page":"14536","volume-title":"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing","author":"Cao Yihan","year":"2023","unstructured":"Yihan Cao, Shuyi Chen, Ryan Liu, Zhiruo Wang, and Daniel Fried. 2023. API-assisted code generation for question answering on varied table structures. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 14536\u201314548. DOI: 10.18653\/v1\/2023.emnlp-main.897"},{"key":"e_1_3_2_21_2","first-page":"45","volume-title":"Proceedings of the 17th International Conference on Document Analysis and Recognition (ICDAR \u201923)","author":"Chakraborty Sagar","year":"2023","unstructured":"Sagar Chakraborty, Gaurav Harit, and Saptarshi Ghosh. 2023. TransDocAnalyser: A framework for semi-structured offline handwritten documents analysis with an application to legal domain. In Proceedings of the 17th International Conference on Document Analysis and Recognition (ICDAR \u201923), Part I, 45\u201362. DOI: 10.1007\/978-3-031-41676-7_3"},{"key":"e_1_3_2_22_2","unstructured":"Harrison Chase. 2022. LangChain. Retrieved from https:\/\/github.com\/hwchase17\/langchain"},{"key":"e_1_3_2_23_2","unstructured":"Guanhua Chen Wenhan Yu and Lei Sha. 2024. Unlocking multi-view insights in knowledge-dense retrieval-augmented generation. arXiv:2404.12879. Retrieved from https:\/\/arxiv.org\/abs\/2404.12879"},{"key":"e_1_3_2_24_2","unstructured":"Shouyuan Chen Sherman Wong Liangjian Chen and Yuandong Tian. 2023. Extending context window of large language models via positional interpolation. arXiv:2306.15595. Retrieved from https:\/\/arxiv.org\/abs\/2306.15595"},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.findings-eacl.83"},{"key":"e_1_3_2_26_2","doi-asserted-by":"crossref","first-page":"5558","DOI":"10.18653\/v1\/2022.emnlp-main.375","volume-title":"Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing","author":"Chen Wenhu","year":"2022","unstructured":"Wenhu Chen, Hexiang Hu, Xi Chen, Pat Verga, and William Cohen. 2022. MuRAG: Multimodal retrieval-augmented generator for open question answering over images and text. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 5558\u20135570. DOI: 10.18653\/v1\/2022.emnlp-main.375"},{"key":"e_1_3_2_27_2","unstructured":"Wenhu Chen Hongmin Wang Jianshu Chen Yunkai Zhang Hong Wang Shiyang Li Xiyou Zhou and William Yang Wang. 2019. Tabfact: A large-scale dataset for table-based fact verification. arXiv:1909.02164. Retrieved from https:\/\/arxiv.org\/abs\/1909.02164"},{"key":"e_1_3_2_28_2","unstructured":"Xinyue Chen Pengyu Gao Jiangjiang Song and Xiaoyang Tan. 2024. HiQA: A hierarchical contextual augmentation RAG for massive documents QA. arXiv:2402.01767. Retrieved from https:\/\/arxiv.org\/abs\/2402.01767"},{"issue":"2","key":"e_1_3_2_29_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3440756","article-title":"Text recognition in the wild: A survey","volume":"54","author":"Chen Xiaoxue","year":"2021","unstructured":"Xiaoxue Chen, Lianwen Jin, Yuanzhi Zhu, Canjie Luo, and Tianwei Wang. 2021. Text recognition in the wild: A survey. ACM Computing Surveys 54, 2 (2021), 1\u201335.","journal-title":"ACM Computing Surveys"},{"key":"e_1_3_2_30_2","unstructured":"Yufan Chen Ruiping Liu Junwei Zheng Di Wen Kunyu Peng Jiaming Zhang and Rainer Stiefelhagen. 2025. Graph-based document structure analysis. arXiv:2502.02501. Retrieved from https:\/\/arxiv.org\/abs\/2502.02501"},{"key":"e_1_3_2_31_2","unstructured":"Yukang Chen Shengju Qian Haotian Tang Xin Lai Zhijian Liu Song Han and Jiaya Jia. 2023. Longlora: Efficient fine-tuning of long-context large language models. arXiv:2309.12307. Retrieved from https:\/\/arxiv.org\/abs\/2309.12307"},{"key":"e_1_3_2_32_2","unstructured":"Zhe Chen Weiyun Wang Yue Cao Yangzhou Liu Zhangwei Gao Erfei Cui Jinguo Zhu Shenglong Ye Hao Tian Zhaoyang Liu et al. 2024. Expanding performance boundaries of open-source multimodal models with model data and test-time scaling. arXiv:2412.05271. Retrieved from https:\/\/arxiv.org\/abs\/arXiv.2412.05271"},{"issue":"12","key":"e_1_3_2_33_2","doi-asserted-by":"crossref","first-page":"220101","DOI":"10.1007\/s11432-024-4231-5","article-title":"How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites","volume":"67","author":"Chen Zhe","year":"2024","unstructured":"Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. 2024. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences 67, 12 (2024), 220101.","journal-title":"Science China Information Sciences"},{"key":"e_1_3_2_34_2","first-page":"24185","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Chen Zhe","year":"2024","unstructured":"Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu et al. 2024. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 24185\u201324198."},{"key":"e_1_3_2_35_2","first-page":"15138","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Cheng Hiuyi","year":"2023","unstructured":"Hiuyi Cheng, Peirong Zhang, Sihang Wu, Jiaxin Zhang, Qiyuan Zhu, Zecheng Xie, Jing Li, Kai Ding, and Lianwen Jin. 2023. M6doc: A large-scale multi-format, multi-type, multi-layout, multi-language, multi-annotation category dataset for modern document layout analysis. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 15138\u201315147."},{"key":"e_1_3_2_36_2","doi-asserted-by":"crossref","first-page":"3829","DOI":"10.18653\/v1\/2023.emnlp-main.232","volume-title":"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing","author":"Chevalier Alexis","year":"2023","unstructured":"Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. 2023. Adapting language models to compress contexts. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 3829\u20133846. DOI: 10.18653\/v1\/2023.emnlp-main.232"},{"key":"e_1_3_2_37_2","unstructured":"Jaemin Cho Debanjan Mahata Ozan Irsoy Yujie He and Mohit Bansal. 2024. M3docrag: Multi-modal retrieval is what you need for multi-page multi-document understanding. arXiv:2411.04952. Retrieved from https:\/\/arxiv.org\/abs\/2411.04952"},{"issue":"70","key":"e_1_3_2_38_2","first-page":"1","article-title":"Scaling instruction-finetuned language models","volume":"25","author":"Won Chung Hyung","year":"2024","unstructured":"Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2024. Scaling instruction-finetuned language models. Journal of Machine Learning Research 25, 70 (2024), 1\u201353.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_2_39_2","unstructured":"Lei Cui. 2021. Document AI: Benchmarks Models and Applications (Presentation@ICDAR 2021). Retrieved from https:\/\/www.microsoft.com\/en-us\/research\/publication\/document-ai-benchmarks-models-and-applications-presentationicdar-2021\/DIL workshop in ICDAR"},{"key":"e_1_3_2_40_2","first-page":"19405","volume-title":"2023 IEEE\/CVF International Conference on Computer Vision (ICCV)","author":"Da Cheng","year":"2023","unstructured":"Cheng Da, Chuwei Luo, Qi Zheng, and Cong Yao. 2023. Vision grid transformer for document layout analysis. In 2023 IEEE\/CVF International Conference on Computer Vision (ICCV). IEEE Computer Society, 19405\u201319415. DOI: 10.1109\/ICCV51070.2023.01783"},{"key":"e_1_3_2_41_2","first-page":"19462","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Da Cheng","year":"2023","unstructured":"Cheng Da, Chuwei Luo, Qi Zheng, and Cong Yao. 2023. Vision grid transformer for document layout analysis. In Proceedings of the IEEE\/CVF International Conference on Computer Vision, 19462\u201319472."},{"key":"e_1_3_2_42_2","doi-asserted-by":"crossref","unstructured":"Longchao Da Parth Mitesh Shah Ananya Singh and Hua Wei. 2024. EvidenceChat: A RAG Enhanced LLM Framework for Trustworthy and Evidential Response Generation. In 2024 International Conference on Image Processing Computer Vision and Machine Learning (ICICML) 1532\u20131536. DOI: https:\/\/dx.doi.org\/10.1109\/ICICML63543.2024.10957858","DOI":"10.1109\/ICICML63543.2024.10957858"},{"key":"e_1_3_2_43_2","first-page":"2978","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Dai Zihang","year":"2019","unstructured":"Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 2978\u20132988. DOI: 10.18653\/v1\/P19-1285"},{"key":"e_1_3_2_44_2","volume-title":"The 11th International Conference on Learning Representations","author":"Dai Zhuyun","year":"2023","unstructured":"Zhuyun Dai, Vincent Y. Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith Hall, and Ming-Wei Chang. 2023. Promptagator: Few-shot dense retrieval from 8 examples. In The 11th International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=gmL46YMpu2J"},{"key":"e_1_3_2_45_2","first-page":"16344","article-title":"Flashattention: Fast and memory-efficient exact attention with io-awareness","volume":"35","author":"Dao Tri","year":"2022","unstructured":"Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R\u00e9. 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems 35 (2022), 16344\u201316359.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_46_2","first-page":"4599","volume-title":"Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Dasigi Pradeep","year":"2021","unstructured":"Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. 2021. A dataset of information-seeking questions and answers anchored in research papers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4599\u20134610. DOI: 10.18653\/v1\/2021.naacl-main.365"},{"key":"e_1_3_2_47_2","first-page":"4171","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 4171\u20134186. DOI: 10.18653\/v1\/N19-1423"},{"key":"e_1_3_2_48_2","volume-title":"International Conference on Learning Representations","author":"Dinan Emily","year":"2019","unstructured":"Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. 2019. Wizard of wikipedia: Knowledge-powered conversational agents. In International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=r1l73iRqKm"},{"key":"e_1_3_2_49_2","unstructured":"Jiayu Ding Shuming Ma Li Dong Xingxing Zhang Shaohan Huang Wenhui Wang and Furu Wei. 2023. LongNet: Scaling transformers to 1 000 000 000 tokens. Retrieved from https:\/\/www.microsoft.com\/en-us\/research\/publication\/longnet-scaling-transformers-to-1000000000-tokens\/"},{"key":"e_1_3_2_50_2","doi-asserted-by":"crossref","first-page":"2807","DOI":"10.1145\/3539618.3591886","volume-title":"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval","author":"Ding Yihao","year":"2023","unstructured":"Yihao Ding, Siqu Long, Jiabin Huang, Kaixuan Ren, Xingxiang Luo, Hyunsuk Chung, and Soyeon Caren Han. 2023. Form-NLU: Dataset for the form natural language understanding. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2807\u20132816."},{"key":"e_1_3_2_51_2","unstructured":"Yiran Ding Li Lyna Zhang Chengruidong Zhang Yuanyuan Xu Ning Shang Jiahang Xu Fan Yang and Mao Yang. 2024. LongRoPE: Extending LLM context window beyond 2 million tokens. arXiv:2402.13753. Retrieved from https:\/\/arxiv.org\/abs\/2402.13753"},{"key":"e_1_3_2_52_2","unstructured":"Kuicai Dong Yujing Chang Shijie Huang Yasheng Wang Ruiming Tang and Yong Liu. 2025. Benchmarking retrieval-augmented multimomal generation for document question answering. arXiv:2505.16470. Retrieved from https:\/\/arxiv.org\/abs\/2505.16470"},{"key":"e_1_3_2_53_2","first-page":"57","volume-title":"Proceedings of the 16th IAPR International Workshop on Document Analysis Systems (DAS \u201924)","author":"Dong Qi","year":"2024","unstructured":"Qi Dong, Lei Kang, and Dimosthenis Karatzas. 2024. Multi-page document VQA with recurrent memory transformer. In Proceedings of the 16th IAPR International Workshop on Document Analysis Systems (DAS \u201924), 57\u201370."},{"key":"e_1_3_2_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/TPAMI.2025.3545453"},{"key":"e_1_3_2_55_2","first-page":"4026","volume-title":"Proceedings of the Computer Vision and Pattern Recognition Conference","author":"Duan Yuchen","year":"2025","unstructured":"Yuchen Duan, Zhe Chen, Yusong Hu, Weiyun Wang, Shenglong Ye, Botian Shi, Lewei Lu, Qibin Hou, Tong Lu, Hongsheng Li, et al. 2025. Docopilot: Improving multimodal models for document-level understanding. In Proceedings of the Computer Vision and Pattern Recognition Conference, 4026\u20134037."},{"key":"e_1_3_2_56_2","unstructured":"Emozilla. 2023. Dynamically Scaled RoPE Further Increases Performance of Long Context LLaMA with Zero Fine-Tuning. Retrieved from https:\/\/www.reddit.com\/r\/LocalLLaMA\/comments\/14mrgpr\/dynamically_scaled_rope_further_increases\/"},{"key":"e_1_3_2_57_2","unstructured":"Felipe Escall\u00f3n et al. 2025. deepDoctection: A Document AI Python Library for Modular Layout Analysis and Information Extraction Pipelines. Open-source toolkit integrates layout OCR and NLP modules."},{"key":"e_1_3_2_58_2","unstructured":"Yixing Fan Qiang Yan Wenshan Wang Jiafeng Guo Ruqing Zhang and Xueqi Cheng. 2025. TrustRAG: An information assistant with retrieval augmented generation. arXiv:2502.13719. Retrieved from https:\/\/arxiv.org\/abs\/2502.13719"},{"key":"e_1_3_2_59_2","volume-title":"First Conference on Language Modeling","author":"Fang Junjie","year":"2024","unstructured":"Junjie Fang, Likai Tang, Hongzhe Bi, Yujia Qin, Si Sun, Zhenyu Li, Haolun Li, Yongjian Li, Xin Cong, and Yankai Lin, et al. 2024. UniMem: Towards a unified view of long-context large language models. In First Conference on Language Modeling. Retrieved from https:\/\/openreview.net\/forum?id=gQAEGSGVnN"},{"key":"e_1_3_2_60_2","article-title":"Large language models (LLMs) on tabular data: Prediction, generation, and understanding\u2014A survey","author":"Fang Xi","year":"2024","unstructured":"Xi Fang, Weijie Xu, Fiona Anting Tan, Ziqing Hu, Jiani Zhang, Yanjun Qi, Srinivasan H. Sengamedu, and Christos Faloutsos. 2024. Large language models (LLMs) on tabular data: Prediction, generation, and understanding\u2014A survey. Transactions on Machine Learning Research. 2024. Retrieved from https:\/\/openreview.net\/forum?id=IZnrCGF9WI","journal-title":"Transactions on Machine Learning Research"},{"key":"e_1_3_2_61_2","doi-asserted-by":"crossref","unstructured":"Hao Feng Qi Liu Hao Liu Jingqun Tang Wengang Zhou Houqiang Li and Can Huang. 2023. DocPedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding. arXiv:2311.11810. Retrieved from https:\/\/arxiv.org\/abs\/2311.11810","DOI":"10.1007\/s11432-024-4250-y"},{"key":"e_1_3_2_62_2","unstructured":"Hao Feng Zijian Wang Jingqun Tang Jinghuie Lu Wengang Zhou Houqiang Li and Can Huang. 2023. Unidoc: A universal large multimodal model for simultaneous text detection recognition spotting and understanding. arXiv:2308.11592. Retrieved from https:\/\/arxiv.org\/abs\/2308.11592"},{"key":"e_1_3_2_63_2","unstructured":"Paulo Finardi Leonardo Avila Rodrigo Castaldoni Pedro Gengo Celio Larcher Marcos Piau Pablo Costa and Vinicius Carid\u00e1. 2024. The chronicles of RAG: The retriever the chunk and the generator. arXiv:2401.07883. Retrieved from https:\/\/arxiv.org\/abs\/2401.07883"},{"key":"e_1_3_2_64_2","unstructured":"Thibault Formal Carlos Lassance Benjamin Piwowarski and St\u00e9phane Clinchant. 2021. SPLADE v2: Sparse lexical and expansion model for information retrieval. arXiv:2109.10086. Retrieved from https:\/\/arxiv.org\/abs\/2109.10086"},{"key":"e_1_3_2_65_2","doi-asserted-by":"crossref","first-page":"2288","DOI":"10.1145\/3404835.3463098","volume-title":"Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval","author":"Formal Thibault","year":"2021","unstructured":"Thibault Formal, Benjamin Piwowarski, and St\u00e9phane Clinchant. 2021. SPLADE: Sparse lexical and expansion model for first stage ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2288\u20132292."},{"key":"e_1_3_2_66_2","unstructured":"Ling Fu Zhebin Kuang Jiajun Song Mingxin Huang Biao Yang Yuzhe Li Linghao Zhu Qidi Luo Xinyu Wang Hao Lu et al. 2024. OCRBench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning. arXiv:2501.00321. Retrieved from https:\/\/arxiv.org\/abs\/2501.00321"},{"issue":"8","key":"e_1_3_2_67_2","doi-asserted-by":"crossref","first-page":"2435","DOI":"10.1016\/j.patcog.2008.03.015","article-title":"Forty years of research in character and document recognition\u2014An industrial perspective","volume":"41","author":"Fujisawa Hiromichi","year":"2008","unstructured":"Hiromichi Fujisawa. 2008. Forty years of research in character and document recognition\u2014An industrial perspective. Pattern Recognition 41, 8 (2008), 2435\u20132446.","journal-title":"Pattern Recognition"},{"key":"e_1_3_2_68_2","first-page":"10219","volume-title":"Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING \u201924)","author":"Fujitake Masato","year":"2024","unstructured":"Masato Fujitake. 2024. LayoutLLM: Large language model instruction tuning for visually rich document understanding. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING \u201924), 10219\u201310224. Retrieved from https:\/\/aclanthology.org\/2024.lrec-main.892"},{"key":"e_1_3_2_69_2","first-page":"1762","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Gao Luyu","year":"2023","unstructured":"Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. 2023. Precise zero-shot dense retrieval without relevance labels. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1762\u20131777. DOI: 10.18653\/v1\/2023.acl-long.99"},{"key":"e_1_3_2_70_2","unstructured":"Yunfan Gao Yun Xiong Xinyu Gao Kangxiang Jia Jinliu Pan Yuxi Bi Yi Dai Jiawei Sun Meng Wang and Haofen Wang. 2023. Retrieval-augmented generation for large language models: A survey. arXiv:2312.10997. Retrieved from https:\/\/arxiv.org\/abs\/2312.10997"},{"key":"e_1_3_2_71_2","volume-title":"The 12th International Conference on Learning Representations","author":"Ge Tao","year":"2024","unstructured":"Tao Ge, Jing Hu, Lei Wang, Xun Wang, Si-Qing Chen, and Furu Wei. 2024. In-context autoencoder for context compression in a large language model. In The 12th International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=uREj4ZuGJE"},{"key":"e_1_3_2_72_2","first-page":"329","volume-title":"European Conference on Computer Vision","author":"Gemelli Andrea","year":"2022","unstructured":"Andrea Gemelli, Sanket Biswas, Enrico Civitelli, Josep Llad\u00f3s, and Simone Marinai. 2022. Doc2graph: A task agnostic document understanding framework based on graph neural networks. In European Conference on Computer Vision, 329\u2013344."},{"key":"e_1_3_2_73_2","doi-asserted-by":"crossref","first-page":"16137","DOI":"10.18653\/v1\/2023.emnlp-main.1003","volume-title":"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing","author":"Gemmell Carlos","year":"2023","unstructured":"Carlos Gemmell and Jeff Dalton. 2023. ToolWriter: Question specific tool synthesis for tabular data. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 16137\u201316148. DOI: 10.18653\/v1\/2023.emnlp-main.1003"},{"key":"e_1_3_2_74_2","first-page":"2701","volume-title":"Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Glass Michael","year":"2022","unstructured":"Michael Glass, Gaetano Rossiello, Md Faisal Mahbub Chowdhury, Ankita Naik, Pengshan Cai, and Alfio Gliozzo. 2022. Re2G: Retrieve, rerank, generate. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2701\u20132715. DOI: 10.18653\/v1\/2022.naacl-main.194"},{"key":"e_1_3_2_75_2","first-page":"3690","volume-title":"International Conference on Machine Learning","author":"Goyal Saurabh","year":"2020","unstructured":"Saurabh Goyal, Anamitra Roy Choudhury, Saurabh Raje, Venkatesan Chakaravarthy, Yogish Sabharwal, and Ashish Verma. 2020. Power-bert: Accelerating bert inference via progressive word-vector elimination. In International Conference on Machine Learning, 3690\u20133699."},{"key":"e_1_3_2_76_2","doi-asserted-by":"crossref","first-page":"1167","DOI":"10.1145\/3616855.3635739","volume-title":"Proceedings of the 17th ACM International Conference on Web Search and Data Mining","author":"Goyal Sagar","year":"2024","unstructured":"Sagar Goyal, Eti Rastogi, Sree Prasanna Rajagopal, Dong Yuan, Fen Zhao, Jai Chintagunta, Gautam Naik, and Jeff Ward. 2024. HealAI: A healthcare LLM for effective medical documentation. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, 1167\u20131168. DOI: 10.1145\/3616855.3635739"},{"key":"e_1_3_2_77_2","first-page":"39","volume-title":"Advances in Neural Information Processing Systems","volume":"34","author":"Gu Jiuxiang","year":"2021","unstructured":"Jiuxiang Gu, Jason Kuen, Vlad I. Morariu, Handong Zhao, Nikolaos Barmpalios, Rajiv Jain, Ani Nenkova, and Tong Sun. 2021. UniDoc: Unified pretraining framework for document understanding. In Advances in Neural Information Processing Systems, Vol. 34. Curran Associates, Inc., 39\u201350. Retrieved from https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2021\/file\/0084ae4bc24c0795d1e6a4f58444d39b-Paper.pdf"},{"key":"e_1_3_2_78_2","first-page":"434","volume-title":"Proceedings of the 2024 10th International Conference on Computing and Artificial Intelligence (ICCAI \u201924)","author":"Guan Che","year":"2024","unstructured":"Che Guan, Mengyu Huang, and Peng Zhang. 2024. MFORT-QA: Multi-hop few-shot open rich table question answering. In Proceedings of the 2024 10th International Conference on Computing and Artificial Intelligence (ICCAI \u201924). ACM, New York, NY, 434\u2013442. DOI: 10.1145\/3669754.3669822"},{"key":"e_1_3_2_79_2","volume-title":"Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025, Long Papers)","author":"Guan Shuhao","year":"2025","unstructured":"Shuhao Guan, Moule Lin, Cheng Xu, Xinyi Liu, Jinman Zhao, Jiexin Fan, Qi Xu, and Derek Greene. 2025. PrePOCR: A complete pipeline for document image restoration and enhanced OCR accuracy. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025, Long Papers). arXiv:2505.20429. Retrieved from https:\/\/arxiv.org\/abs\/2505.20429"},{"key":"e_1_3_2_80_2","first-page":"14","volume-title":"Proceedings of the 2024 6th International Conference on Pattern Recognition and Intelligent Systems","author":"Guo Jun","year":"2024","unstructured":"Jun Guo, Bojian Chen, Zhichao Zhao, Jindong He, Shichun Chen, Donglan Hu, and Hao Pan. 2024. BKRAG: A BGE reranker RAG for similarity analysis of power project requirements. In Proceedings of the 2024 6th International Conference on Pattern Recognition and Intelligent Systems, 14\u201320."},{"key":"e_1_3_2_81_2","unstructured":"Zengyuan Guo Yuechen Yu Pengyuan Lv Chengquan Zhang Haojie Li Zhihui Wang Kun Yao Jingtuo Liu and Jingdong Wang. 2022. TRUST: An accurate and end-to-end table structure recognizer using splitting-based transformers. arXiv:2208.14687. Retrieved from https:\/\/arxiv.org\/abs\/2208.14687"},{"key":"e_1_3_2_82_2","first-page":"1","volume-title":"2023 42nd IEEE International Conference of the Chilean Computer Science Society (SCCC)","author":"Guti\u00e9rrez-Ben\u00edtez Rodrigo","year":"2023","unstructured":"Rodrigo Guti\u00e9rrez-Ben\u00edtez, Alejandro Vald\u00e9s-Jim\u00e9nez, and Alejandra Segura-Navarrete. 2023. A parallel approach to text data augmentation for sentiment analysis using the PoS wise synonym substitution algorithm. In 2023 42nd IEEE International Conference of the Chilean Computer Science Society (SCCC). IEEE, 1\u20135."},{"key":"e_1_3_2_83_2","doi-asserted-by":"crossref","unstructured":"Chi Han Qifan Wang Wenhan Xiong Yu Chen Heng Ji and Sinong Wang. 2024. LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models. Retrieved from https:\/\/openreview.net\/forum?id=pOujzgHIRY","DOI":"10.18653\/v1\/2024.naacl-long.222"},{"key":"e_1_3_2_84_2","unstructured":"Siwei Han Peng Xia Ruiyi Zhang Tong Sun Yun Li Hongtu Zhu and Huaxiu Yao. 2025. Mdocagent: A multi-modal multi-agent framework for document understanding. arXiv:2503.13964. Retrieved from https:\/\/arxiv.org\/abs\/2503.13964"},{"key":"e_1_3_2_85_2","doi-asserted-by":"crossref","first-page":"87663","DOI":"10.1109\/ACCESS.2021.3087865","article-title":"Current status and performance analysis of table recognition in document images with deep neural networks","volume":"9","author":"Hashmi Khurram Azeem","year":"2021","unstructured":"Khurram Azeem Hashmi, Marcus Liwicki, Didier Stricker, Muhammad Adnan Afzal, Muhammad Ahtsham Afzal, and Muhammad Zeshan Afzal. 2021. Current status and performance analysis of table recognition in document images with deep neural networks. IEEE Access 9 (2021), 87663\u201387685.","journal-title":"IEEE Access"},{"key":"e_1_3_2_86_2","first-page":"2961","volume-title":"Proceedings of the IEEE International Conference on Computer Vision (ICCV),","author":"He Kaiming","year":"2017","unstructured":"Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. 2017. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2961\u20132969."},{"key":"e_1_3_2_87_2","volume-title":"Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks","author":"Hendrycks Dan","year":"2021","unstructured":"Dan Hendrycks, Collin Burns, Anya Chen, and Spencer Ball. 2021. CUAD: An expert-annotated NLP dataset for legal contract review. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks. Retrieved from https:\/\/datasets-benchmarks-proceedings.neurips.cc\/paper_files\/paper\/2021\/file\/6ea9ab1baa0efb9e19094440c317e21b-Paper-round1.pdf"},{"key":"e_1_3_2_88_2","unstructured":"Anwen Hu Haiyang Xu Jiabo Ye Ming Yan Liang Zhang Bo Zhang Chen Li Ji Zhang Qin Jin Fei Huang et al. 2024. mplug-docowl 1.5: Unified structure learning for ocr-free document understanding. arXiv:2403.12895. Retrieved from https:\/\/arxiv.org\/abs\/2403.12895"},{"key":"e_1_3_2_89_2","first-page":"3096","volume-title":"Findings of the Association for Computational Linguistics (EMNLP \u201924)","author":"Hu Anwen","year":"2024","unstructured":"Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Ji Zhang, Qin Jin, Fei Huang, and Jingren Zhou. 2024. mPLUG-DocOwl 1.5: Unified structure learning for OCR-free document understanding. In Findings of the Association for Computational Linguistics (EMNLP \u201924), 3096\u20133120."},{"key":"e_1_3_2_90_2","volume-title":"Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 5817\u20135834","author":"Hu Anwen","year":"2025","unstructured":"Anwen Hu, Haiyang Xu, Liang Zhang, Jiabo Ye, Ming Yan, Ji Zhang, Qin Jin, Fei Huang, and Jingren Zhou. 2025. mPLUG-DocOwl2: High-resolution compressing for OCR-free multi-page document understanding. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 5817\u20135834."},{"key":"e_1_3_2_91_2","volume-title":"International Conference on Learning Representations","author":"Hu Edward J.","year":"2022","unstructured":"Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=nZeVKeeFYf9"},{"key":"e_1_3_2_92_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2023.110212"},{"key":"e_1_3_2_93_2","unstructured":"Xin Huang Ashish Khetan Milan Cvitkovic and Zohar Karnin. 2020. TabTransformer: Tabular data modeling using contextual embeddings. arXiv:2012.06678. Retrieved from https:\/\/arxiv.org\/abs\/2012.06678"},{"key":"e_1_3_2_94_2","doi-asserted-by":"crossref","first-page":"4083","DOI":"10.1145\/3503161.3548112","volume-title":"Proceedings of the 30th ACM International Conference on Multimedia","author":"Huang Yupan","year":"2022","unstructured":"Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. 2022. LayoutLMv3: Pre-training for document AI with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia, 4083\u20134091. DOI: 10.1145\/3503161.3548112"},{"key":"e_1_3_2_95_2","unstructured":"Yulong Hui Yao Lu and Huanchen Zhang. 2024. Uda: A benchmark suite for retrieval augmented generation in real-world document analysis. arXiv:2406.15187. Retrieved from https:\/\/arxiv.org\/abs\/2406.15187"},{"key":"e_1_3_2_96_2","unstructured":"Pranab Islam Anand Kannappan Douwe Kiela Rebecca Qian Nino Scherrer and Bertie Vidgen. 2023. FinanceBench: A new benchmark for financial question answering. arXiv:2311.11944. Retrieved from https:\/\/arxiv.org\/abs\/2311.11944"},{"key":"e_1_3_2_97_2","unstructured":"Gautier Izacard Mathilde Caron Lucas Hosseini Sebastian Riedel Piotr Bojanowski Armand Joulin and Edouard Grave. 2021. Unsupervised dense information retrieval with contrastive learning. arXiv:2112.09118. Retrieved from https:\/\/arxiv.org\/abs\/2112.09118"},{"key":"e_1_3_2_98_2","doi-asserted-by":"crossref","unstructured":"Juan Izquierdo-Domenech Jordi Linares-Pellicer and Isabel Ferri-Molla. 2024. Virtual Reality and Language Models a New Frontier in Learning. DOI: https:\/\/dx.doi.org\/10.9781\/ijimai.2024.02.007","DOI":"10.9781\/ijimai.2024.02.007"},{"key":"e_1_3_2_99_2","first-page":"1","volume-title":"2019 International Conference on Document Analysis and Recognition Workshops (ICDARW)","author":"Jaume Guillaume","year":"2019","unstructured":"Guillaume Jaume, Hazim Kemal Ekenel, and Jean-Philippe Thiran. 2019. Funsd: A dataset for form understanding in noisy scanned documents. In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), 1\u20136."},{"key":"e_1_3_2_100_2","unstructured":"Vitor Jeronymo Luiz Bonifacio Hugo Abonizio Marzieh Fadaee Roberto Lotufo Jakub Zavrel and Rodrigo Nogueira. 2023. Inpars-v2: Large language models as efficient dataset generators for information retrieval. arXiv:2301.01820. Retrieved from https:\/\/arxiv.org\/abs\/2301.01820"},{"key":"e_1_3_2_101_2","doi-asserted-by":"crossref","first-page":"13358","DOI":"10.18653\/v1\/2023.emnlp-main.825","volume-title":"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing","author":"Jiang Huiqiang","year":"2023","unstructured":"Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2023. LLMLingua: Compressing prompts for accelerated inference of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 13358\u201313376. DOI: 10.18653\/v1\/2023.emnlp-main.825"},{"key":"e_1_3_2_102_2","doi-asserted-by":"crossref","first-page":"1658","DOI":"10.18653\/v1\/2024.acl-long.91","volume-title":"Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Jiang Huiqiang","year":"2024","unstructured":"Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2024. LongLLMLingua: Accelerating and enhancing LLMs in long context scenarios via prompt compression. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1658\u20131677. Retrieved from https:\/\/aclanthology.org\/2024.acl-long.91"},{"key":"e_1_3_2_103_2","doi-asserted-by":"crossref","first-page":"9237","DOI":"10.18653\/v1\/2023.emnlp-main.574","volume-title":"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing","author":"Jiang Jinhao","year":"2023","unstructured":"Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Wayne Xin Zhao, and Ji-Rong Wen. 2023. StructGPT: A general framework for large language model to reason over structured data. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 9237\u20139251. DOI: 10.18653\/v1\/2023.emnlp-main.574"},{"key":"e_1_3_2_104_2","doi-asserted-by":"crossref","first-page":"7969","DOI":"10.18653\/v1\/2023.emnlp-main.495","volume-title":"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing","author":"Jiang Zhengbao","year":"2023","unstructured":"Zhengbao Jiang, Frank Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Active retrieval augmented generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 7969\u20137992. DOI: 10.18653\/v1\/2023.emnlp-main.495"},{"key":"e_1_3_2_105_2","unstructured":"Hongye Jin Xiaotian Han Jingfeng Yang Zhimeng Jiang Chia-Yuan Chang and Xia Hu. 2024. GrowLength: Accelerating LLMs Pretraining by Progressively Growing Training Length. Retrieved from https:\/\/openreview.net\/forum?id=vmlwllg7DJ"},{"key":"e_1_3_2_106_2","unstructured":"Mingyu Jin Qinkai Yu Dong Shu Chong Zhang Lizhou Fan Wenyue Hua Suiyuan Zhu Yanda Meng Zhenting Wang Mengnan Du et al. 2024. Health-LLM: Personalized retrieval-augmented disease prediction system. arXiv:2402.00746. Retrieved from https:\/\/arxiv.org\/abs\/2402.00746"},{"key":"e_1_3_2_107_2","unstructured":"Rihui Jin Yu Li Guilin Qi Nan Hu Yuan-Fang Li Jiaoyan Chen Jianan Wang Yongrui Chen Dehai Min and Sheng Bi. 2024. HGT: Leveraging heterogeneous graph-enhanced large language models for few-shot complex table understanding. arXiv:2403.19723. Retrieved from https:\/\/arxiv.org\/abs\/2403.19723"},{"key":"e_1_3_2_108_2","unstructured":"HyoJe Jung Yunha Kim Heejung Choi Hyeram Seo Minkyoung Kim JiYe Han Gaeun Kee Seohyun Park Soyoung Ko Byeolhee Kim et al. 2024. Enhancing clinical efficiency through LLM: Discharge note generation for cardiac patients. arXiv:2404.05144. Retrieved from https:\/\/arxiv.org\/abs\/2404.05144"},{"key":"e_1_3_2_109_2","unstructured":"Samira Ebrahimi Kahou Adam Atkinson Vincent Michalski \u00c1kos K\u00e1d\u00e1r Adam Trischler and Yoshua Bengio. 2018. FigureQA: An Annotated Figure Dataset for Visual Reasoning. Retrieved from https:\/\/openreview.net\/forum?id=SyunbfbAb"},{"key":"e_1_3_2_110_2","first-page":"28","volume-title":"Proceedings of the 42nd European Conference on IR Research on Advances in Information Retrieval (ECIR \u201920)","author":"Kamphuis Chris","year":"2020","unstructured":"Chris Kamphuis, Arjen P. De Vries, Leonid Boytsov, and Jimmy Lin. 2020. Which BM25 do you mean? A large-scale reproducibility study of scoring variants. In Proceedings of the 42nd European Conference on IR Research on Advances in Information Retrieval (ECIR \u201920), Part II. Springer, 28\u201334."},{"key":"e_1_3_2_111_2","doi-asserted-by":"crossref","first-page":"376","DOI":"10.1145\/3652583.3658086","volume-title":"Proceedings of the 2024 International Conference on Multimedia Retrieval (ICMR \u201924)","author":"Kang Zuheng","year":"2024","unstructured":"Zuheng Kang, Yayun He, Botao Zhao, Xiaoyang Qu, Junqing Peng, Jing Xiao, and Jianzong Wang. 2024. Retrieval-augmented audio deepfake detection. In Proceedings of the 2024 International Conference on Multimedia Retrieval (ICMR \u201924), 376\u2013384. DOI: 10.1145\/3652583.3658086"},{"key":"e_1_3_2_112_2","doi-asserted-by":"crossref","first-page":"6769","DOI":"10.18653\/v1\/2020.emnlp-main.550","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Karpukhin Vladimir","year":"2020","unstructured":"Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-Tau Yih. 2020. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 6769\u20136781. DOI: 10.18653\/v1\/2020.emnlp-main.550"},{"key":"e_1_3_2_113_2","unstructured":"Tristan Kenneweg Philip Kenneweg and Barbara Hammer. 2024. Retrieval augmented generation systems: Automatic dataset creation evaluation and Boolean agent setup. arXiv:2403.00820. Retrieved from https:\/\/arxiv.org\/abs\/2403.00820"},{"issue":"1","key":"e_1_3_2_114_2","first-page":"4","article-title":"A review of machine learning algorithms for text-documents classification","volume":"1","author":"Khan Aurangzeb","year":"2010","unstructured":"Aurangzeb Khan, Baharum Baharudin, Lam Hong Lee, and Khairullah Khan. 2010. A review of machine learning algorithms for text-documents classification. Journal of Advances in Information Technology 1, 1 (2010), 4\u201320.","journal-title":"Journal of Advances in Information Technology"},{"key":"e_1_3_2_115_2","first-page":"6501","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Kim Gyuwan","year":"2021","unstructured":"Gyuwan Kim and Kyunghyun Cho. 2021. Length-adaptive transformer: Train once with length drop, use anytime with search. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 6501\u20136511. DOI: 10.18653\/v1\/2021.acl-long.508"},{"key":"e_1_3_2_116_2","first-page":"720","volume-title":"European Conference on Computer Vision","author":"Kim Geewook","year":"2022","unstructured":"Geewook Kim, Dohyung Hong, Wonjoo Kim, and Jaegul Kim. 2022. Donut: Document understanding transformer without OCR. In European Conference on Computer Vision. Springer, 720\u2013738."},{"key":"e_1_3_2_117_2","first-page":"498","volume-title":"Proceedings of the 17th European Conference on Computer Vision (ECCV \u201922)","author":"Kim Geewook","year":"2022","unstructured":"Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. 2022. OCR-Free document understanding transformer. In Proceedings of the 17th European Conference on Computer Vision (ECCV \u201922), Part XXVIII, 498\u2013517. DOI: 10.1007\/978-3-031-19815-1_29"},{"key":"e_1_3_2_118_2","first-page":"1293","volume-title":"Proceedings of the 40th ACM\/SIGAPP Symposium on Applied Computing","author":"Kim Jaewoong","year":"2025","unstructured":"Jaewoong Kim, Minseok Hur, and Moohong Min. 2025. From rag to qa-rag: Integrating generative ai for pharmaceutical regulatory compliance process. In Proceedings of the 40th ACM\/SIGAPP Symposium on Applied Computing, 1293\u20131295."},{"key":"e_1_3_2_119_2","first-page":"5583","volume-title":"International Conference on Machine Learning","author":"Kim Wonjae","year":"2021","unstructured":"Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning. PMLR, 5583\u20135594."},{"key":"e_1_3_2_120_2","volume-title":"International Conference on Learning Representations (ICLR)","author":"Kitaev Nikita","year":"2020","unstructured":"Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The efficient transformer. In International Conference on Learning Representations (ICLR). Retrieved from https:\/\/openreview.net\/forum?id=rkgNKkHtvB"},{"key":"e_1_3_2_121_2","volume-title":"The 12th International Conference on Learning Representations","author":"Kong Kezhi","year":"2024","unstructured":"Kezhi Kong, Jiani Zhang, Zhengyuan Shen, Balasubramaniam Srinivasan, Chuan Lei, Christos Faloutsos, Huzefa Rangwala, and George Karypis. 2024. OpenTab: Advancing large language models as open-domain table reasoners. In The 12th International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=Qa0ULgosc9"},{"key":"e_1_3_2_122_2","doi-asserted-by":"crossref","first-page":"453","DOI":"10.1162\/tacl_a_00276","article-title":"Natural questions: A benchmark for question answering research","volume":"7","author":"Kwiatkowski Tom","year":"2019","unstructured":"Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics 7 (2019), 453\u2013466.","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"e_1_3_2_123_2","first-page":"9459","article-title":"Retrieval-augmented generation for knowledge-intensive nlp tasks","volume":"33","author":"Lewis Patrick","year":"2020","unstructured":"Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K\u00fcttler, Mike Lewis, Wen-Tau Yih, Tim Rockt\u00e4schel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459\u20139474.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_124_2","unstructured":"BUPT LH. 2022. CDLA: A Chinese Document Layout Analysis (CDLA) Dataset. Retrieved from https:\/\/github.com\/buptlihang\/CDLA"},{"key":"e_1_3_2_125_2","doi-asserted-by":"publisher","DOI":"10.1038\/s41598-025-07439-y"},{"key":"e_1_3_2_126_2","first-page":"12888","volume-title":"Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research)","author":"Li Junnan","year":"2022","unstructured":"Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. BLIP: Bootstrapping Language-Image pre-training for unified Vision-Language understanding and generation. In Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research). PMLR, 12888\u201312900. Retrieved from https:\/\/proceedings.mlr.press\/v162\/li22n.html"},{"key":"e_1_3_2_127_2","first-page":"1918","volume-title":"Proceedings of the 12th Language Resources and Evaluation Conference","author":"Li Minghao","year":"2020","unstructured":"Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, and Zhoujun Li. 2020. Tablebank: Table benchmark for image-based table detection and recognition. In Proceedings of the 12th Language Resources and Evaluation Conference, 1918\u20131925."},{"key":"e_1_3_2_128_2","unstructured":"Minghao Li Tengchao Lv Jingye Chen Lei Cui Yijuan Lu Dinei Florencio Cha Zhang Zhoujun Li and Furu Wei. 2022. TrOCR: Transformer-based optical character recognition with pre-trained models. arXiv:2109.10282. Retrieved from https:\/\/arxiv.org\/abs\/2109.10282"},{"key":"e_1_3_2_129_2","first-page":"949","volume-title":"Proceedings of the 28th International Conference on Computational Linguistics","author":"Li Minghao","year":"2020","unstructured":"Minghao Li, Yiheng Xu, Lei Cui, Shaohan Huang, Furu Wei, Zhoujun Li, and Ming Zhou. 2020. DocBank: A benchmark dataset for document layout analysis. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain, 949\u2013960. DOI: 10.18653\/v1\/2020.coling-main.82"},{"key":"e_1_3_2_130_2","doi-asserted-by":"publisher","DOI":"10.1145\/3654979"},{"key":"e_1_3_2_131_2","unstructured":"Siqi Li Yufan Shen Xiangnan Chen Jiayi Chen Hengwei Ju Haodong Duan Song Mao Hongbin Zhou Bo Zhang Bin Fu et al. 2025. GDI-Bench: A benchmark for general document intelligence with vision and reasoning decoupling. arXiv:2505.00063. Retrieved from https:\/\/arxiv.org\/abs\/2505.00063"},{"key":"e_1_3_2_132_2","unstructured":"Yucheng Li. 2023. Unlocking context constraints of llms: Enhancing context efficiency of llms with self-information-based content filtering. arXiv:2304.12102. Retrieved from https:\/\/arxiv.org\/abs\/2304.12102"},{"key":"e_1_3_2_133_2","first-page":"6342","volume-title":"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing","author":"Li Yucheng","year":"2023","unstructured":"Yucheng Li, Bo Dong, Frank Guerin, and Chenghua Lin. 2023. Compressing context to enhance inference efficiency of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, 6342\u20136353. DOI: 10.18653\/v1\/2023.emnlp-main.391"},{"key":"e_1_3_2_134_2","unstructured":"Zhonghao Li Xuming Hu Aiwei Liu Kening Zheng Sirui Huang and Hui Xiong. 2024. Refiner: Restructure retrieval content efficiently to advance question-answering capabilities. arXiv:2406.11357. Retrieved from https:\/\/arxiv.org\/abs\/2406.11357"},{"key":"e_1_3_2_135_2","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 26753\u201326763","author":"Li Zhang","year":"2024","unstructured":"Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. 2024. Monkey: Image resolution and text label are important things for large multi-modal models. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 26753\u201326763."},{"key":"e_1_3_2_136_2","doi-asserted-by":"crossref","unstructured":"Wenhui Liao Jiapeng Wang Hongliang Li Chengyu Wang Jun Huang and Lianwen Jin. 2025. DocLayLLM: An efficient multi-modal extension of large language models for text-rich document understanding. arXiv:2408.15045. Retrieved from https:\/\/arxiv.org\/abs\/2408.15045","DOI":"10.1109\/CVPR52734.2025.00382"},{"key":"e_1_3_2_137_2","unstructured":"Bin Lin Tao Peng Chen Zhang Minmin Sun Lanbo Li Hanyu Zhao Wencong Xiao Qi Xu Xiafei Qiu Shen Li et al. 2024. Infinite-LLM: Efficient LLM service for long context with DistAttention and distributed KVCache. arXiv:2401.02669. Retrieved from https:\/\/arxiv.org\/abs\/2401.02669"},{"key":"e_1_3_2_138_2","unstructured":"Chun-Hsien Lin and Pu-Jen Cheng. 2024. Legal documents drafting with fine-tuned pre-trained large language model. arXiv:2406.04202. Retrieved from https:\/\/arxiv.org\/abs\/2406.04202"},{"key":"e_1_3_2_139_2","unstructured":"Demiao Lin. 2024. Revolutionizing retrieval-augmented generation with enhanced PDF structure recognition. arXiv:2401.12599. Retrieved from https:\/\/arxiv.org\/abs\/2401.12599"},{"key":"e_1_3_2_140_2","unstructured":"Zening Lin Jiapeng Wang Teng Li Wenhui Liao Dayi Huang Longfei Xiong and Lianwen Jin. 2024. PEneo: Unifying line extraction line grouping and entity linking for end-to-end document pair extraction. arXiv:2401.03472. Retrieved from https:\/\/arxiv.org\/abs\/2401.03472"},{"key":"e_1_3_2_141_2","first-page":"1439","volume-title":"2023 IEEE 12th Data Driven Control and Learning Systems Conference (DDCLS)","author":"Liu Bao","year":"2023","unstructured":"Bao Liu and Jinlei Huang. 2023. Global-local attention mechanism based small object detection. In 2023 IEEE 12th Data Driven Control and Learning Systems Conference (DDCLS), 1439\u20131443. DOI: 10.1109\/DDCLS58216.2023.10165957"},{"issue":"6","key":"e_1_3_2_142_2","doi-asserted-by":"crossref","first-page":"1330","DOI":"10.11834\/jig.210044","article-title":"Deep learning methods for scene text detection and recognition","volume":"26","author":"Liu C. Y.","year":"2021","unstructured":"C. Y. Liu, X. X. Chen, C. J. Luo, L. W. Jin, Y. Xue, and Y. L. Liu. 2021. Deep learning methods for scene text detection and recognition. Journal of Image and Graphics 26, 6 (2021), 1330\u20131367.","journal-title":"Journal of Image and Graphics"},{"key":"e_1_3_2_143_2","article-title":"Frontiers of intelligent document analysis and recognition: Review and prospects","author":"Liu Cheng","year":"2023","unstructured":"Cheng Liu, Lianwen Jin, Bai Xiang, Xiaohui Li, and Yin Fei. 2023. Frontiers of intelligent document analysis and recognition: Review and prospects. Journal of Image and Graphics (2023). Retrieved from https:\/\/api.semanticscholar.org\/CorpusID:261135113","journal-title":"Journal of Image and Graphics"},{"key":"e_1_3_2_144_2","unstructured":"Chenglong Liu Haoran Wei Jinyue Chen Lingyu Kong Zheng Ge Zining Zhu Liang Zhao Jianjian Sun Chunrui Han and Xiangyu Zhang. 2024. Focus anywhere for fine-grained multi-page document understanding. arXiv:2405.14295. Retrieved from https:\/\/arxiv.org\/abs\/2405.14295"},{"key":"e_1_3_2_145_2","first-page":"149","volume-title":"AMIA Joint Summits on Translational Science Proceedings. AMIA Joint Summits on Translational Science","volume":"2013","author":"Liu Hongfang","year":"2013","unstructured":"Hongfang Liu, Suzette J. Bielinski, Sunghwan Sohn, Sean Murphy, Kavishwar B. Wagholikar, Siddhartha R. Jonnalagadda, K. E. Ravikumar, Stephen T. Wu, Iftikhar J. Kullo, and Christopher G. Chute. 2013. An information extraction framework for cohort identification using electronic health records. AMIA Joint Summits on Translational Science Proceedings. AMIA Joint Summits on Translational Science 2013 (Mar. 2013), 149\u2013153."},{"key":"e_1_3_2_146_2","volume-title":"NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following","author":"Liu Hao","year":"2023","unstructured":"Hao Liu, Matei Zaharia, and Pieter Abbeel. 2023. Ring attention with blockwise transformers for near-infinite context. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following. Retrieved from https:\/\/openreview.net\/forum?id=xulyCXgIWH"},{"key":"e_1_3_2_147_2","doi-asserted-by":"publisher","unstructured":"Jerry Liu. 2022. LlamaIndex. DOI: 10.5281\/zenodo.1234","DOI":"10.5281\/zenodo.1234"},{"key":"e_1_3_2_148_2","unstructured":"Lei Liu Xiaoyan Yang Yue Shen Binbin Hu Zhiqiang Zhang Jinjie Gu and Guannan Zhang. 2023. Think-in-memory: Recalling and post-thinking enable LLMs with long-term memory. arXiv:2311.08719. Retrieved from https:\/\/arxiv.org\/abs\/2311.08719"},{"key":"e_1_3_2_149_2","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00638"},{"issue":"3","key":"e_1_3_2_150_2","first-page":"225","article-title":"Learning to rank for information retrieval","volume":"3","author":"Liu Tie-Yan","year":"2009","unstructured":"Tie-Yan Liu. 2009. Learning to rank for information retrieval. Foundations and Trends\u00ae in Information Retrieval 3, 3 (2009), 225\u2013331.","journal-title":"Foundations and Trends\u00ae in Information Retrieval"},{"key":"e_1_3_2_151_2","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-46448-0_2"},{"key":"e_1_3_2_152_2","unstructured":"Yinhan Liu Jiatao Gu Naman Goyal Xian Li Sergey Edunov Marjan Ghazvininejad Mike Lewis and Luke Zettlemoyer. 2020. Multilingual denoising pre-training for neural machine translation. arXiv:2001.08210. Retrieved from https:\/\/arxiv.org\/abs\/2001.08210"},{"issue":"12","key":"e_1_3_2_153_2","doi-asserted-by":"crossref","first-page":"220102","DOI":"10.1007\/s11432-024-4235-6","article-title":"OCRBench: On the hidden mystery of OCR in large multimodal models","volume":"67","author":"Liu Yuliang","year":"2024","unstructured":"Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, Xiang Bai, et al. 2024. OCRBench: On the hidden mystery of OCR in large multimodal models. Science China Information Sciences 67, 12 (2024), 220102.","journal-title":"Science China Information Sciences"},{"key":"e_1_3_2_154_2","unstructured":"Yuliang Liu Biao Yang Qiang Liu Zhang Li Zhiyin Ma Shuo Zhang and Xiang Bai. 2024. Textmonkey: An ocr-free large multimodal model for understanding document. arXiv:2403.04473. Retrieved from https:\/\/arxiv.org\/abs\/2403.04473"},{"key":"e_1_3_2_155_2","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV), 86457\u201386478","author":"Liu Ze","year":"2021","unstructured":"Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE\/CVF International Conference on Computer Vision (ICCV), 86457\u201386478."},{"key":"e_1_3_2_156_2","unstructured":"Nikolaos Livathinos Christoph Auer Maksym Lysak Ahmed Nassar Michele Dolfi Panos Vagenas Cesar Berrospi Ramis Matteo Omenetti Kasper Dinkla Yusik Kim et al. 2025. Docling: An efficient open-source toolkit for AI-driven document conversion. arXiv:2501.17887. Retrieved from https:\/\/arxiv.org\/abs\/2501.17887"},{"issue":"10","key":"e_1_3_2_157_2","doi-asserted-by":"crossref","first-page":"110","DOI":"10.3390\/jimaging6100110","article-title":"Deep learning for historical document analysis and recognition\u2014A survey","volume":"6","author":"Lombardi Francesco","year":"2020","unstructured":"Francesco Lombardi and Simone Marinai. 2020. Deep learning for historical document analysis and recognition\u2014A survey. Journal of Imaging 6, 10 (2020), 110.","journal-title":"Journal of Imaging"},{"issue":"1","key":"e_1_3_2_158_2","doi-asserted-by":"crossref","first-page":"161","DOI":"10.1007\/s11263-020-01369-0","article-title":"Scene text detection and recognition: The deep learning era","volume":"129","author":"Long Shangbang","year":"2021","unstructured":"Shangbang Long, Xin He, and Cong Yao. 2021. Scene text detection and recognition: The deep learning era. International Journal of Computer Vision 129, 1 (2021), 161\u2013184.","journal-title":"International Journal of Computer Vision"},{"key":"e_1_3_2_159_2","first-page":"6227","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Lu Shuai","year":"2022","unstructured":"Shuai Lu, Nan Duan, Hojae Han, Daya Guo, Seung-won Hwang, and Alexey Svyatkovskiy. 2022. ReACC: A retrieval-augmented code completion framework. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 6227\u20136240. DOI: 10.18653\/v1\/2022.acl-long.431"},{"key":"e_1_3_2_160_2","volume-title":"ICCV Workshops","author":"Lu Yichong","year":"2021","unstructured":"Yichong Lu, Yuntian Zhang, Graham Neubig, and Taylor Berg Kirkpatrick. 2021. Im2markupCAM: Parsing handwritten mathematical expressions with visual attention. In ICCV Workshops."},{"key":"e_1_3_2_161_2","doi-asserted-by":"crossref","unstructured":"Yi Lu Xin Zhou Wei He Jun Zhao Tao Ji Tao Gui Qi Zhang and Xuanjing Huang. 2024. LongHeads: Multi-head attention is secretly a long context processor. Retrieved from https:\/\/openreview.net\/forum?id=lcqbEZkz26under review","DOI":"10.18653\/v1\/2024.findings-emnlp.417"},{"key":"e_1_3_2_162_2","first-page":"15630","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Luo Chuwei","year":"2024","unstructured":"Chuwei Luo, Yufan Shen, Zhaoqing Zhu, Qi Zheng, Zhi Yu, and Cong Yao. 2024. LayoutLLM: Layout instruction tuning with large language models for document understanding. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 15630\u201315640."},{"key":"e_1_3_2_163_2","unstructured":"Tengchao Lv Yupan Huang Jingye Chen Lei Cui Shuming Ma Yaoyao Chang Shaohan Huang Wenhui Wang Li Dong Weiyao Luo et al. 2023. Kosmos-2.5: A Multimodal Literate Model. Retrieved from https:\/\/www.microsoft.com\/en-us\/research\/publication\/kosmos-2-5-a-multimodal-literate-model\/"},{"issue":"1","key":"e_1_3_2_164_2","doi-asserted-by":"crossref","first-page":"65","DOI":"10.2478\/dim-2020-0031","article-title":"Exploring significant characteristics and models for classification of structure function of academic documents","volume":"5","author":"Ma Bowen","year":"2021","unstructured":"Bowen Ma, Chengzhi Zhang, and Yuzhuo Wang. 2021. Exploring significant characteristics and models for classification of structure function of academic documents. Data and Information Management 5, 1 (2021), 65\u201374.","journal-title":"Data and Information Management"},{"key":"e_1_3_2_165_2","unstructured":"Xueguang Ma Shengyao Zhuang Bevan Koopman Guido Zuccon Wenhu Chen and Jimmy Lin. 2024. VISA: Retrieval augmented generation with visual source attribution. arXiv:2412.14457. Retrieved from https:\/\/arxiv.org\/abs\/2412.14457"},{"key":"e_1_3_2_166_2","unstructured":"Matteo Marulli Glauco Panattoni and Marco Bertini. 2025. A document processing pipeline for the construction of a dataset for topic modeling based on the judgments of the Italian Supreme Court. arXiv:2505.08439. Retrieved from https:\/\/arxiv.org\/abs\/2505.08439"},{"key":"e_1_3_2_167_2","doi-asserted-by":"crossref","first-page":"2263","DOI":"10.18653\/v1\/2022.findings-acl.177","volume-title":"Findings of the Association for Computational Linguistics (ACL \u201922)","author":"Masry Ahmed","year":"2022","unstructured":"Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. 2022. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the Association for Computational Linguistics (ACL \u201922), 2263\u20132279. DOI: 10.18653\/v1\/2022.findings-acl.177"},{"key":"e_1_3_2_168_2","doi-asserted-by":"crossref","first-page":"2582","DOI":"10.1109\/WACV51458.2022.00264","volume-title":"2022 IEEE\/CVF Winter Conference on Applications of Computer Vision (WACV)","author":"Mathew Minesh","year":"2022","unstructured":"Minesh Mathew, Viraj Bagal, Rub\u00e8n Tito, Dimosthenis Karatzas, Ernest Valveny, and C. V. Jawahar. 2022. InfographicVQA. In 2022 IEEE\/CVF Winter Conference on Applications of Computer Vision (WACV), 2582\u20132591. DOI: 10.1109\/WACV51458.2022.00264"},{"key":"e_1_3_2_169_2","doi-asserted-by":"crossref","first-page":"2199","DOI":"10.1109\/WACV48630.2021.00225","volume-title":"2021 IEEE Winter Conference on Applications of Computer Vision (WACV)","author":"Mathew Minesh","year":"2021","unstructured":"Minesh Mathew, Dimosthenis Karatzas, and C. V. Jawahar. 2021. DocVQA: A dataset for VQA on document images. In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), 2199\u20132208."},{"key":"e_1_3_2_170_2","doi-asserted-by":"crossref","first-page":"53","DOI":"10.1109\/HICSS.1996.495298","volume-title":"Proceedings of HICSS-29: 29th Hawaii International Conference on System Sciences","author":"Meier Johannes","year":"1996","unstructured":"Johannes Meier and Ralph Sprague. 1996. Towards a better understanding of electronic document management. In Proceedings of HICSS-29: 29th Hawaii International Conference on System Sciences. IEEE, 53\u201361."},{"key":"e_1_3_2_171_2","unstructured":"Lingchen Meng Jianwei Yang Rui Tian Xiyang Dai Zuxuan Wu Jianfeng Gao and Yu-Gang Jiang. 2024. DeepStack: Deeply stacking visual tokens is surprisingly simple and effective for LMMs. arXiv:2406.04334. Retrieved from https:\/\/arxiv.org\/abs\/2406.04334"},{"key":"e_1_3_2_172_2","unstructured":"Meta. 2024. Introducing Meta Llama 3: The Most Capable Openly Available LLM to Date. Retrieved from https:\/\/ai.meta.com\/blog\/meta-llama-3"},{"key":"e_1_3_2_173_2","volume-title":"The IEEE Winter Conference on Applications of Computer Vision (WACV), 1527\u20131536","author":"Methani Nitesh","year":"2020","unstructured":"Nitesh Methani, Pritha Ganguly, Mitesh M. Khapra, and Pratyush Kumar. 2020. PlotQA: Reasoning over scientific plots. In The IEEE Winter Conference on Applications of Computer Vision (WACV), 1527\u20131536."},{"key":"e_1_3_2_174_2","unstructured":"Ye Mo Zirui Shao Kai Ye Xianwei Mao Bo Zhang Hangdi Xing Peng Ye Gang Huang Kehan Chen Zhou Huan et al. 2025. Doc-CoB: Enhancing multi-modal document understanding with visual chain-of-boxes reasoning. arXiv:2505.18603. Retrieved from https:\/\/arxiv.org\/abs\/2505.18603"},{"key":"e_1_3_2_175_2","first-page":"1","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Modarressi Ali","year":"2022","unstructured":"Ali Modarressi, Hosein Mohebbi, and Mohammad Taher Pilehvar. 2022. AdapLeR: Speeding up inference by adaptive length reduction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 1\u201315. DOI: 10.18653\/v1\/2022.acl-long.1"},{"key":"e_1_3_2_176_2","doi-asserted-by":"crossref","first-page":"1582","DOI":"10.18653\/v1\/2023.semeval-1.218","volume-title":"Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval \u201923)","author":"Modzelewski Arkadiusz","year":"2023","unstructured":"Arkadiusz Modzelewski, Witold Sosnowski, Magdalena Wilczynska, and Adam Wierzbicki. 2023. Dshacker at semeval-2023 task 3: Genres and persuasion techniques detection with multilingual data augmentation through machine translation and text generation. In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval \u201923), 1582\u20131591."},{"key":"e_1_3_2_177_2","unstructured":"Ahmad Mohammadshirazi Pinaki Prasad Guha Neogi Ser-Nam Lim and Rajiv Ramnath. 2024. DLaVA: Document language and vision assistant for answer localization with enhanced interpretability and trustworthiness. arXiv:2412.00151. Retrieved from https:\/\/arxiv.org\/abs\/2412.00151"},{"key":"e_1_3_2_178_2","first-page":"54567","article-title":"Random-access infinite context length for transformers","volume":"36","author":"Mohtashami Amirkeivan","year":"2024","unstructured":"Amirkeivan Mohtashami and Martin Jaggi. 2024. Random-access infinite context length for transformers. Advances in Neural Information Processing Systems 36 (2024), 54567\u201354585.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_179_2","first-page":"216","volume-title":"Proceedings of the 14th IAPR International Workshop on Document Analysis Systems (DAS \u201920)","author":"Mondal Ajoy","year":"2020","unstructured":"Ajoy Mondal, Peter Lipps, and C. V. Jawahar. 2020. IIIT-AR-13K: A new dataset for graphical object detection in documents. In Proceedings of the 14th IAPR International Workshop on Document Analysis Systems (DAS \u201920). Springer, 216\u2013230."},{"issue":"7","key":"e_1_3_2_180_2","doi-asserted-by":"crossref","first-page":"1029","DOI":"10.1109\/5.156468","article-title":"Historical review of OCR research and development","volume":"80","author":"Mori Shunji","year":"1992","unstructured":"Shunji Mori, Ching Y. Suen, and Kazuhiko Yamamoto. 1992. Historical review of OCR research and development. Proceedings of the IEEE 80, 7 (1992), 1029\u20131058.","journal-title":"Proceedings of the IEEE"},{"key":"e_1_3_2_181_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACCESS.2024.3379530"},{"key":"e_1_3_2_182_2","first-page":"19327","article-title":"Learning to compress prompts with gist tokens","volume":"36","author":"Mu Jesse","year":"2024","unstructured":"Jesse Mu, Xiang Li, and Noah Goodman. 2024. Learning to compress prompts with gist tokens. Advances in Neural Information Processing Systems 36 (2024), 19327\u201319352.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_183_2","unstructured":"Manisha Mukherjee Sungchul Kim Xiang Chen Dan Luo Tong Yu and Tung Mai. 2025. From documents to dialogue: Building KG-RAG enhanced AI assistants. arXiv:2502.15237. Retrieved from https:\/\/arxiv.org\/abs\/2502.15237"},{"issue":"1","key":"e_1_3_2_184_2","doi-asserted-by":"crossref","first-page":"38","DOI":"10.1109\/34.824820","article-title":"Twenty years of document image analysis in PAMI","volume":"22","author":"Nagy George","year":"2000","unstructured":"George Nagy. 2000. Twenty years of document image analysis in PAMI. IEEE Transactions on Pattern Analysis & Machine Intelligence 22, 1 (2000), 38\u201362.","journal-title":"IEEE Transactions on Pattern Analysis & Machine Intelligence"},{"issue":"15","key":"e_1_3_2_185_2","doi-asserted-by":"crossref","first-page":"7486","DOI":"10.3390\/app12157486","article-title":"Investigating attention mechanism for page object detection in document images","volume":"12","author":"Naik Shivam","year":"2022","unstructured":"Shivam Naik, Khurram Azeem Hashmi, Alain Pagani, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal. 2022. Investigating attention mechanism for page object detection in document images. Applied Sciences 12, 15 (2022), 7486.","journal-title":"Applied Sciences"},{"key":"e_1_3_2_186_2","doi-asserted-by":"crossref","first-page":"35","DOI":"10.1162\/tacl_a_00446","article-title":"FeTaQA: Free-form table question answering","volume":"10","author":"Nan Linyong","year":"2022","unstructured":"Linyong Nan, Chiachun Hsieh, Ziming Mao, Xi Victoria Lin, Neha Verma, Rui Zhang, Wojciech Kry\u015bci\u0144ski, Hailey Schoelkopf, Riley Kong, Xiangru Tang, et al. 2022. FeTaQA: Free-form table question answering. Transactions of the Association for Computational Linguistics 10 (2022), 35\u201349.","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"e_1_3_2_187_2","unstructured":"Ahmed Nassar Andres Marafioti Matteo Omenetti Maksym Lysak Nikolaos Livathinos Christoph Auer Lucas Morin Rafael Teixeira de Lima Yusik Kim et al. 2025. SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion. arXiv:2503.11576. Retrieved from https:\/\/arxiv.org\/abs\/2503.11576"},{"key":"e_1_3_2_188_2","unstructured":"Tri Nguyen Mir Rosenberg Xia Song Jianfeng Gao Saurabh Tiwary Rangan Majumder and Li Deng. 2016. Ms marco: A human-generated machine reading comprehension dataset. arXiv:1611.09268. Retrieved from https:\/\/arxiv.org\/abs\/1611.09268"},{"key":"e_1_3_2_189_2","first-page":"4114","volume-title":"Proceedings of the 13th Language Resources and Evaluation Conference","author":"Okur Eda","year":"2022","unstructured":"Eda Okur, Saurav Sahay, and Lama Nachman. 2022. Data augmentation with paraphrase generation and entity extraction for multimodal dialogue system. In Proceedings of the 13th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 4114\u20134125. Retrieved from https:\/\/aclanthology.org\/2022.lrec-1.437"},{"key":"e_1_3_2_190_2","unstructured":"OpenAI. 2022. Introducing ChatGPT. Retrieved from https:\/\/openai.com\/index\/chatgpt"},{"key":"e_1_3_2_191_2","unstructured":"OpenAI. 2023. GPT-4 Research Overview. Retrieved from https:\/\/openai.com\/index\/gpt-4-research\/"},{"key":"e_1_3_2_192_2","unstructured":"Charles Packer Vivian Fang Shishir G. Patil Kevin Lin Sarah Wooders and Joseph E. Gonzalez. 2023. Memgpt: Towards llms as operating systems. arXiv:2310.08560. Retrieved from https:\/\/arxiv.org\/abs\/2310.08560"},{"key":"e_1_3_2_193_2","unstructured":"Feifei Pan Mustafa Canim Michael Glass Alfio Gliozzo and James Hendler. 2022. End-to-end table question answering via retrieval-augmented generation. arXiv:2203.16714. Retrieved from https:\/\/arxiv.org\/abs\/2203.16714"},{"key":"e_1_3_2_194_2","doi-asserted-by":"crossref","first-page":"1173","DOI":"10.18653\/v1\/2020.emnlp-main.89","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Parikh Ankur","year":"2020","unstructured":"Ankur Parikh, Xuezhi Wang, Sebastian Gehrmann, Manaal Faruqui, Bhuwan Dhingra, Diyi Yang, and Dipanjan Das. 2020. ToTTo: A controlled table-to-text generation dataset. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1173\u20131186. DOI: 10.18653\/v1\/2020.emnlp-main.89"},{"key":"e_1_3_2_195_2","first-page":"1470","volume-title":"Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)","author":"Pasupat Panupong","year":"2015","unstructured":"Panupong Pasupat and Percy Liang. 2015. Compositional semantic parsing on semi-structured tables. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 1470\u20131480. DOI: 10.3115\/v1\/P15-1142"},{"key":"e_1_3_2_196_2","volume-title":"The 12th International Conference on Learning Representations","author":"Patnaik Sohan","year":"2024","unstructured":"Sohan Patnaik, Heril Changwal, Milan Aggarwal, Sumit Bhatia, Yaman Kumar, and Balaji Krishnamurthy. 2024. CABINET: Content relevance-based noise reduction for table question answering. In The 12th International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=SQrHpTllXa"},{"key":"e_1_3_2_197_2","volume-title":"The 12th International Conference on Learning Representations","author":"Peng Bowen","year":"2024","unstructured":"Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. 2024. YaRN: Efficient context window extension of large language models. In The 12th International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=wHBfxhZu1u"},{"key":"e_1_3_2_198_2","first-page":"2523","volume-title":"Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Petroni Fabio","year":"2021","unstructured":"Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, et al. 2021. KILT: A benchmark for knowledge intensive language tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2523\u20132544. DOI: 10.18653\/v1\/2021.naacl-main.200"},{"key":"e_1_3_2_199_2","doi-asserted-by":"crossref","first-page":"3743","DOI":"10.1145\/3534678.3539043","volume-title":"Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining","author":"Pfitzmann Birgit","year":"2022","unstructured":"Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staar. 2022. DocLayNet: A large human-annotated dataset for document-layout segmentation. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 3743\u20133751."},{"issue":"1","key":"e_1_3_2_200_2","doi-asserted-by":"crossref","first-page":"63","DOI":"10.1109\/34.824821","article-title":"Online and off-line handwriting recognition: A comprehensive survey","volume":"22","author":"Plamondon R\u00e9jean","year":"2000","unstructured":"R\u00e9jean Plamondon and Sargur N. Srihari. 2000. Online and off-line handwriting recognition: A comprehensive survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 1 (2000), 63\u201384.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_2_201_2","unstructured":"Raphael Powalski Christian Haug Sandra Scheible Christian Guder Gerwin Bouma and Andreas Dengel. 2021. LiLT: A simple yet effective language-independent layout transformer for structured document understanding. arXiv:2102.09550. Retrieved from https:\/\/arxiv.org\/abs\/2102.09550"},{"key":"e_1_3_2_202_2","unstructured":"Jake Poznanski Jon Borchardt Jason Dunkelberger Regan Huff Daniel Lin Aman Rangapur Christopher Wilhelm Kyle Lo and Luca Soldaini. 2025. olmOCR: Unlocking trillions of tokens in PDFs with vision language models. arXiv:2502.18443. Retrieved from https:\/\/arxiv.org\/abs\/2502.18443"},{"key":"e_1_3_2_203_2","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 572\u2013573","author":"Prasad Devashish","year":"2020","unstructured":"Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Manish Visave, and Kavita Sultanpure. 2020. CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 572\u2013573."},{"key":"e_1_3_2_204_2","volume-title":"International Conference on Learning Representations","author":"Press Ofir","year":"2022","unstructured":"Ofir Press, Noah Smith, and Mike Lewis. 2022. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=R8sQPpGCv0"},{"key":"e_1_3_2_205_2","doi-asserted-by":"crossref","unstructured":"Fabio Quattrini Carmine Zaccagnino Silvia Cascianelli Laura Righi and Rita Cucchiara. 2024. Mugat: Improving single-page document parsing by providing multi-page context. arXiv:2408.15646. Retrieved from https:\/\/arxiv.org\/abs\/2408.15646","DOI":"10.1007\/978-3-031-91572-7_13"},{"key":"e_1_3_2_206_2","unstructured":"Bowen Peng Jeffrey Quesnelle Honglu Fan Enrico Shippole. 2023. YaRN: Efficient Context Window Extension of Large Language Models. arXiv:2309.00071. Retrieved from https:\/\/arxiv.org\/abs\/2309.00071"},{"issue":"7","key":"e_1_3_2_207_2","doi-asserted-by":"crossref","first-page":"1361","DOI":"10.3390\/electronics13071361","article-title":"Web application for retrieval-augmented generation: Implementation and testing","volume":"13","author":"Radeva Irina","year":"2024","unstructured":"Irina Radeva, Ivan Popchev, Lyubka Doukovska, and Miroslava Dimitrova. 2024. Web application for retrieval-augmented generation: Implementation and testing. Electronics 13, 7 (2024), 1361.","journal-title":"Electronics"},{"key":"e_1_3_2_208_2","first-page":"8748","volume-title":"International Conference on Machine Learning","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748\u20138763."},{"issue":"1","key":"e_1_3_2_209_2","first-page":"1","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","volume":"21","author":"Raffel Colin","year":"2020","unstructured":"Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21, 1, Article 140 (Jan. 2020), 1\u201367.","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_3_2_210_2","doi-asserted-by":"crossref","first-page":"1530","DOI":"10.18653\/v1\/2025.acl-industry.109","volume-title":"Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)","author":"Rajendran Ravi K.","year":"2025","unstructured":"Ravi K. Rajendran, Biplob Debnath, Murugan Sankaradass, and Srimat Chakradhar. 2025. EcoDoc: A cost-efficient multimodal document processing system for enterprises using LLMs. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), 1530\u20131537. DOI: 10.18653\/v1\/2025.acl-industry.109"},{"key":"e_1_3_2_211_2","unstructured":"Sukrit Rao Rohith Bollineni Shaan Khosla Ting Fei Qian Wu Kyunghyun Cho and Vladimir A. Kobzar. [n.\u2009d.]. MARKUPMNA: Markup-Based Segmentation of M&A Agreements. Retrieved from https:\/\/api.semanticscholar.org\/CorpusID:264909951"},{"key":"e_1_3_2_212_2","unstructured":"Joseph Redmon and Ali Farhadi. 2018. YOLOv3: An incremental improvement. arXiv:1804.02767. Retrieved from https:\/\/arxiv.org\/abs\/1804.02767"},{"key":"e_1_3_2_213_2","unstructured":"Alibaba Research. 2023. D4LA: A Dataset for Document Layout Analysis. Retrieved from https:\/\/github.com\/AlibabaResearch\/AdvancedLiterateMachinery\/issues\/46"},{"key":"e_1_3_2_214_2","doi-asserted-by":"crossref","unstructured":"Shruti Rijhwani Antonios Anastasopoulos and Graham Neubig. 2020. OCR post correction for endangered language texts. arXiv:2011.05402. Retrieved from https:\/\/arxiv.org\/abs\/2011.05402","DOI":"10.18653\/v1\/2020.emnlp-main.478"},{"key":"e_1_3_2_215_2","doi-asserted-by":"publisher","DOI":"10.1561\/1500000019"},{"key":"e_1_3_2_216_2","doi-asserted-by":"crossref","unstructured":"Jon Saad-Falcon Joe Barrow Alexa Siu Ani Nenkova David Seunghyun Yoon Ryan A. Rossi and Franck Dernoncourt. 2023. PDFTriage: Question answering over long structured documents. arXiv:2309.08872. Retrieved from https:\/\/arxiv.org\/abs\/2309.08872","DOI":"10.18653\/v1\/2024.emnlp-industry.13"},{"key":"e_1_3_2_217_2","doi-asserted-by":"crossref","first-page":"7370","DOI":"10.18653\/v1\/2024.acl-long.399","volume-title":"Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Salemi Alireza","year":"2024","unstructured":"Alireza Salemi, Sheshera Mysore, Michael Bendersky, and Hamed Zamani. 2024. LaMP: When large language models meet personalization. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Bangkok, Thailand, 7370\u20137392. Retrieved from https:\/\/aclanthology.org\/2024.acl-long.399"},{"key":"e_1_3_2_218_2","first-page":"1","volume-title":"Proceedings of the 3rd International Conference on AI-ML Systems","author":"Sarmah Bhaskarjit","year":"2023","unstructured":"Bhaskarjit Sarmah, Dhagash Mehta, Stefano Pasquali, and Tianjie Zhu. 2023. Towards reducing hallucination in extracting information from financial reports using large language models. In Proceedings of the 3rd International Conference on AI-ML Systems, 1\u20135."},{"key":"e_1_3_2_219_2","volume-title":"International Conference on Learning Representations (ICLR)","author":"Sarthi Parth","year":"2024","unstructured":"Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D. Manning. 2024. RAPTOR: Recursive abstractive processing for tree-organized retrieval. In International Conference on Learning Representations (ICLR)."},{"key":"e_1_3_2_220_2","volume-title":"Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, 52\u201357","author":"Schaefer Robin","year":"2020","unstructured":"Robin Schaefer and Clemens Neudecker. 2020. A two-step approach for automatic OCR post-correction. In Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, 52\u201357."},{"issue":"7","key":"e_1_3_2_221_2","doi-asserted-by":"crossref","first-page":"1101","DOI":"10.1109\/5.156473","article-title":"Document analysis-from pixels to contents","volume":"80","author":"Schurmann J.","year":"1992","unstructured":"J. Schurmann, N. Bartneck, T. Bayer, J. Franke, E. Mandler, and M. Oberlander. 1992. Document analysis-from pixels to contents. Proceedings of the IEEE 80, 7 (1992), 1101\u20131119.","journal-title":"Proceedings of the IEEE"},{"key":"e_1_3_2_222_2","doi-asserted-by":"crossref","unstructured":"Shreya Shankar Tristan Chambers Tarak Shah Aditya G. Parameswaran and Eugene Wu. 2025. DocETL: Agentic query rewriting and evaluation for complex document processing. arXiv:2410.12189. Retrieved from https:\/\/arxiv.org\/abs\/2410.12189","DOI":"10.14778\/3746405.3746426"},{"key":"e_1_3_2_223_2","doi-asserted-by":"crossref","unstructured":"Zejiang Shen Ruochen Zhang Melissa Dell Benjamin Charles Germain Lee Jacob Carlson and Weining Li. 2021. LayoutParser: A unified toolkit for deep learning based document image analysis. arXiv:2103.15348. Retrieved from https:\/\/arxiv.org\/abs\/2103.15348","DOI":"10.1007\/978-3-030-86549-8_9"},{"key":"e_1_3_2_224_2","unstructured":"Yunxiao Shi Xing Zi Zijing Shi Haimin Zhang Qiang Wu and Min Xu. 2024. Enhancing retrieval and managing retrieval: A four-module synergy for improved quality and efficiency in RAG systems. arXiv:2407.10670. Retrieved from https:\/\/arxiv.org\/abs\/2407.10670"},{"key":"e_1_3_2_225_2","doi-asserted-by":"crossref","first-page":"2872","DOI":"10.1109\/BigData59044.2023.10386313","volume-title":"2023 IEEE International Conference on Big Data (BigData)","author":"Shukla Neelesh K.","year":"2023","unstructured":"Neelesh K. Shukla, Raghu Katikeri, Msp Raja, Sivam Gowtham, et al. 2023. Generative AI approach to distributed summarization of financial narratives. In 2023 IEEE International Conference on Big Data (BigData), 2872\u20132876. DOI: 10.1109\/BigData59044.2023.10386313"},{"key":"e_1_3_2_226_2","first-page":"8317","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Singh Amanpreet","year":"2019","unstructured":"Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards vqa models that can read. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition, 8317\u20138326."},{"key":"e_1_3_2_227_2","first-page":"393","volume-title":"Findings of the Association for Computational Linguistics (EACL \u201924)","author":"Song Eui Yul","year":"2024","unstructured":"Eui Yul Song, Sangryul Kim, Haeju Lee, Joonkee Kim, and James Thorne. 2024. Re3val: Reinforced and reranked generative retrieval. In Findings of the Association for Computational Linguistics (EACL \u201924). Association for Computational Linguistics, St. Julian\u2019s, Malta, 393\u2013409. Retrieved from https:\/\/aclanthology.org\/2024.findings-eacl.27"},{"key":"e_1_3_2_228_2","first-page":"1","volume-title":"Proceedings of the 37th AAAI Conference on Artificial Intelligence and 35th Conference on Innovative Applications of Artificial Intelligence and 13th Symposium on Educational Advances in Artificial Intelligence","volume":"259","author":"Souibgui Mohamed Ali","year":"2023","unstructured":"Mohamed Ali Souibgui, Sanket Biswas, Andres Mafla, Ali Furkan Biten, Alicia Forn\u00e9s, Yousri Kessentini, Josep Llad\u00f3s, Lluis Gomez, and Dimosthenis Karatzas. 2023. Text-DIAE: A self-supervised degradation invariant autoencoder for text recognition and document enhancement. In Proceedings of the 37th AAAI Conference on Artificial Intelligence and 35th Conference on Innovative Applications of Artificial Intelligence and 13th Symposium on Educational Advances in Artificial Intelligence (AAAI \u201923\/IAAI \u201923\/EAAI \u201923). AAAI Press, Article 259, 1\u20139. DOI: 10.1609\/aaai.v37i2.25328"},{"key":"e_1_3_2_229_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2023.127063"},{"key":"e_1_3_2_230_2","first-page":"12991","volume-title":"Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Su Weihang","year":"2024","unstructured":"Weihang Su, Yichen Tang, Qingyao Ai, Zhijing Wu, and Yiqun Liu. 2024. DRAGIN: Dynamic retrieval augmented generation based on the real-time information needs of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Bangkok, Thailand, 12991\u201313013. Retrieved from https:\/\/aclanthology.org\/2024.acl-long.702"},{"issue":"4","key":"e_1_3_2_231_2","doi-asserted-by":"crossref","first-page":"469","DOI":"10.1109\/PROC.1980.11675","article-title":"Automatic recognition of handprinted characters\u2014The state of the art","volume":"68","author":"Suen Ching Y.","year":"1980","unstructured":"Ching Y. Suen, Marc Berthod, and Shunji Mori. 1980. Automatic recognition of handprinted characters\u2014The state of the art. Proceedings of the IEEE 68, 4 (1980), 469\u2013487.","journal-title":"Proceedings of the IEEE"},{"key":"e_1_3_2_232_2","doi-asserted-by":"crossref","unstructured":"Yuan Sui Jiaru Zou Mengyu Zhou Xinyi He Lun Du Shi Han and Dongmei Zhang. 2023. TAP4LLM: Table provider on sampling augmenting and packing semi-structured data for large language model reasoning. arXiv:2312.09039. Retrieved from https:\/\/arxiv.org\/abs\/2312.09039","DOI":"10.18653\/v1\/2024.findings-emnlp.603"},{"key":"e_1_3_2_233_2","first-page":"14590","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Sun Yutao","year":"2023","unstructured":"Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. 2023. A length-extrapolatable transformer. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 14590\u201314604. DOI: 10.18653\/v1\/2023.acl-long.816"},{"key":"e_1_3_2_234_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.findings-emnlp.314"},{"key":"e_1_3_2_235_2","doi-asserted-by":"crossref","first-page":"2308","DOI":"10.1109\/COMPSAC61105.2024.00371","volume-title":"2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC)","author":"Sung Chih-Wei","year":"2024","unstructured":"Chih-Wei Sung, Yu-Kai Lee, and Yin-Te Tsai. 2024. A new pipeline for generating instruction dataset via RAG and self fine-tuning. In 2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC). IEEE, 2308\u20132312."},{"key":"e_1_3_2_236_2","unstructured":"Manan Suri Puneet Mathur Franck Dernoncourt Kanika Goswami Ryan A. Rossi and Dinesh Manocha. 2024. VisDoM: Multi-document QA with visually rich elements using multimodal retrieval-augmented generation. arXiv:2412.10704. Retrieved from https:\/\/arxiv.org\/abs\/2412.10704"},{"key":"e_1_3_2_237_2","doi-asserted-by":"publisher","DOI":"10.3390\/computers12040072"},{"key":"e_1_3_2_238_2","unstructured":"Hanzhuo Tan Qi Luo Ling Jiang Zizheng Zhan Jing Li Haotian Zhang and Yuqun Zhang. 2024. Prompt-based code completion via multi-retrieval augmented generation. arXiv:2405.07530. Retrieved from https:\/\/arxiv.org\/abs\/2405.07530"},{"key":"e_1_3_2_239_2","unstructured":"Jiejun Tan Zhicheng Dou Wen Wang Mang Wang Weipeng Chen and Ji-Rong Wen. 2024. Htmlrag: Html is better than plain text for modeling retrieved knowledge in rag systems. arXiv:2411.02959. Retrieved from https:\/\/arxiv.org\/abs\/2411.02959"},{"key":"e_1_3_2_240_2","first-page":"19071","volume-title":"Proceedings of the 38th AAAI Conference on Artificial Intelligence and 36th Conference on Innovative Applications of Artificial Intelligence and 14th Symposium on Educational Advances in Artificial Intelligence","author":"Tanaka Ryota","year":"2024","unstructured":"Ryota Tanaka, Taichi Iki, Kyosuke Nishida, Kuniko Saito, and Jun Suzuki. 2024. InstructDoc: A dataset for zero-shot generalization of visual document understanding with instructions. In Proceedings of the 38th AAAI Conference on Artificial Intelligence and 36th Conference on Innovative Applications of Artificial Intelligence and 14th Symposium on Educational Advances in Artificial Intelligence, 19071\u201319079."},{"key":"e_1_3_2_241_2","first-page":"13878","volume-title":"Proceedings of the 35th AAAI Conference on Artificial Intelligence","author":"Tanaka Ryota","year":"2021","unstructured":"Ryota Tanaka, Kyosuke Nishida, and Sen Yoshida. 2021. Visualmrc: Machine reading comprehension on document images. In Proceedings of the 35th AAAI Conference on Artificial Intelligence, 13878\u201313888."},{"key":"e_1_3_2_242_2","unstructured":"Jingqun Tang Chunhui Lin Zhen Zhao Shu Wei Binghong Wu Qi Liu Yangfan He Kuan Lu Hao Feng Yang Li et al. 2024. Textsquare: Scaling up text-centric visual instruction tuning. arXiv:2404.12803. Retrieved from https:\/\/arxiv.org\/abs\/2404.12803"},{"issue":"12","key":"e_1_3_2_243_2","doi-asserted-by":"crossref","first-page":"1931","DOI":"10.1016\/S0031-3203(96)00044-1","article-title":"Automatic document processing: A survey","volume":"29","author":"Tang Yuan Y.","year":"1996","unstructured":"Yuan Y. Tang, Seong-Whan Lee, and Ching Y. Suen. 1996. Automatic document processing: A survey. Pattern Recognition 29, 12 (1996), 1931\u20131952.","journal-title":"Pattern Recognition"},{"key":"e_1_3_2_244_2","doi-asserted-by":"publisher","unstructured":"Yi Tay Mostafa Dehghani Dara Bahri and Donald Metzler. 2022. Efficient transformers: A survey. 55 6 Article 109 (Dec. 2022) 1\u201328. DOI: 10.1145\/3530811","DOI":"10.1145\/3530811"},{"key":"e_1_3_2_245_2","unstructured":"InfiniFlow Team. 2025. Ragflow: A Lightweight and Flexible Framework for Retrieval-Augmented Generation (RAG). Retrieved April 25 2025 from https:\/\/github.com\/infiniflow\/ragflow"},{"key":"e_1_3_2_246_2","unstructured":"PaddleOCR Team. 2023. PaddleOCR Document Understanding Pipeline. Visuallanguage pipeline combining OCR layout parsing and querybased extraction. Retrived from https:\/\/www.paddleocr.ai\/main\/en\/version3.x\/pipeline_usage\/doc_understanding.html"},{"key":"e_1_3_2_247_2","unstructured":"PaddlePaddle Team. 2025. PaddleOCR: Document Layout Analysis Module (ppstructure). Retrieved April 25 2025 from https:\/\/github.com\/PaddlePaddle\/PaddleOCR\/tree\/main\/ppstructure"},{"key":"e_1_3_2_248_2","volume-title":"The 35th Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)","author":"Thakur Nandan","year":"2021","unstructured":"Nandan Thakur, Nils Reimers, Andreas R\u00fcckl\u00e9, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In The 35th Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). Retrieved from https:\/\/openreview.net\/forum?id=wCu6T5xFjeJ"},{"key":"e_1_3_2_249_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.patcog.2023.109834"},{"issue":"3","key":"e_1_3_2_250_2","doi-asserted-by":"crossref","first-page":"329","DOI":"10.1016\/j.eij.2020.12.004","article-title":"Aligning document layouts extracted with different OCR engines with clustering approach","volume":"22","author":"Tomovic S.","year":"2021","unstructured":"S. Tomovic, K. Pavlovic, and M. Bajceta. 2021. Aligning document layouts extracted with different OCR engines with clustering approach. Egyptian Informatics Journal 22, 3 (2021), 329\u2013338.","journal-title":"Egyptian Informatics Journal"},{"key":"e_1_3_2_251_2","unstructured":"Hugo Touvron Louis Martin Kevin Stone Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra Prajjwal Bhargava Shruti Bhosale et al. 2023. LLaMA 2: Open Foundation and Fine-Tuned Chat Models. Retrieved from https:\/\/ai.meta.com\/research\/publications\/llama-2-open-foundation-and-fine-tuned-chat-models\/"},{"key":"e_1_3_2_252_2","doi-asserted-by":"crossref","first-page":"10014","DOI":"10.18653\/v1\/2023.acl-long.557","volume-title":"Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers","author":"Trivedi Harsh","year":"2023","unstructured":"Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2023. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 10014\u201310037. DOI: 10.18653\/v1\/2023.acl-long.557"},{"key":"e_1_3_2_253_2","first-page":"40","volume-title":"Workshop on Open Source Information Retrieval SIGIR","author":"Trotman Andrew","year":"2012","unstructured":"Andrew Trotman, Xiangfei Jia, and Matt Crane. 2012. Towards an efficient and effective search engine. In Workshop on Open Source Information Retrieval SIGIR, 40\u201347."},{"key":"e_1_3_2_254_2","first-page":"42661","volume-title":"Advances in Neural Information Processing Systems","author":"Tworkowski Szymon","year":"2023","unstructured":"Szymon Tworkowski, Konrad Staniszewski, Miko\u0142aj Pacek, Yuhuai Wu, Henryk Michalewski, and Piotr Mi\u0142o\u015b. 2023. Focused transformer: Contrastive training for context scaling. In Advances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36, Curran Associates, Inc., 42661\u201342688. Retrieved from https:\/\/proceedings.neurips.cc\/paper_files\/paper\/2023\/file\/8511d06d5590f4bda24d42087802cc81-Paper-Conference.pdf"},{"key":"e_1_3_2_255_2","first-page":"19528","volume-title":"Proceedings of the IEEE\/CVF International Conference on Computer Vision","author":"Van Landeghem Jordy","year":"2023","unstructured":"Jordy Van Landeghem, Rub\u00e8n Tito, \u0141ukasz Borchmann, Micha\u0142 Pietruszka, Pawel Joziak, Rafal Powalski, Dawid Jurkiewicz, Micka\u00ebl Coustaty, Bertrand Anckaert, Ernest Valveny, et al. 2023. Document understanding dataset and evaluation (DUDE). In Proceedings of the IEEE\/CVF International Conference on Computer Vision, 19528\u201319540."},{"key":"e_1_3_2_256_2","first-page":"6000","volume-title":"Attention is All You Need (NIPS \u201917)","author":"Vaswani Ashish","year":"2017","unstructured":"Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \u0141ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need (NIPS \u201917). Curran Associates Inc., Red Hook, NY, 6000\u20136010."},{"key":"e_1_3_2_257_2","unstructured":"Bin Wang Zhuangcheng Gu Guang Liang Chao Xu Bo Zhang Botian Shi and Conghui He. 2024. UniMERNet: A universal network for real-world mathematical expression recognition. arXiv:2404.15254. Retrieved from https:\/\/arxiv.org\/abs\/2404.15254"},{"key":"e_1_3_2_258_2","unstructured":"Bin Wang Fan Wu Linke Ouyang Zhuangcheng Gu Rui Zhang Renqiu Xia Bo Zhang and Conghui He. 2024. Cdm: A reliable metric for fair and accurate formula recognition evaluation. arXiv:2409.03643. Retrieved from https:\/\/arxiv.org\/abs\/2409.03643"},{"key":"e_1_3_2_259_2","unstructured":"Bin Wang Chao Xu Xiaomeng Zhao Linke Ouyang Fan Wu Zhiyuan Zhao Rui Xu Kaiwen Liu Yuan Qu Fukai Shang et al. 2024. MinerU: An open-source solution for precise document content extraction. arXiv:2409.18839. Retrieved from https:\/\/arxiv.org\/abs\/2409.18839"},{"key":"e_1_3_2_260_2","doi-asserted-by":"crossref","first-page":"8529","DOI":"10.18653\/v1\/2024.acl-long.463","volume-title":"Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Wang Dongsheng","year":"2024","unstructured":"Dongsheng Wang, Natraj Raman, Mathieu Sibue, Zhiqiang Ma, Petr Babkin, Simerjot Kaur, Yulong Pei, Armineh Nourbakhsh, and Xiaomo Liu. 2024. DocLLM: A Layout-Aware generative language model for multimodal document understanding. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Bangkok, Thailand, 8529\u20138548. Retrieved from https:\/\/aclanthology.org\/2024.acl-long.463"},{"key":"e_1_3_2_261_2","first-page":"53","volume-title":"Proceedings of the 17th International Conference on Document Analysis and Recognition (ICDAR \u201923)","author":"Wang Jilin","year":"2023","unstructured":"Jilin Wang, Michael Krumdick, Baojia Tong, Halim Hamima, Maxim Sokolov, Vadym Barda, Delphine Vendryes, and Chris Tanner. 2023. A graphical approach to document layout analysis. In Proceedings of the 17th International Conference on Document Analysis and Recognition (ICDAR \u201923), Part V, 53\u201369. DOI: 10.1007\/978-3-031-41734-4_4"},{"key":"e_1_3_2_262_2","doi-asserted-by":"crossref","first-page":"9414","DOI":"10.18653\/v1\/2023.emnlp-main.585","volume-title":"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing","author":"Wang Liang","year":"2023","unstructured":"Liang Wang, Nan Yang, and Furu Wei. 2023. Query2doc: Query expansion with large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, 9414\u20139423. DOI: 10.18653\/v1\/2023.emnlp-main.585"},{"key":"e_1_3_2_263_2","doi-asserted-by":"crossref","unstructured":"Qiuchen Wang Ruixue Ding Zehui Chen Weiqi Wu Shihang Wang Pengjun Xie and Feng Zhao. 2025. ViDoRAG: Visual document retrieval-augmented generation via dynamic iterative reasoning agents. arXiv:2502.18017. Retrieved from https:\/\/arxiv.org\/abs\/2502.18017","DOI":"10.18653\/v1\/2025.emnlp-main.464"},{"key":"e_1_3_2_264_2","first-page":"74530","article-title":"Augmenting language models with long-term memory","volume":"36","author":"Wang Weizhi","year":"2024","unstructured":"Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, and Furu Wei. 2024. Augmenting language models with long-term memory. Advances in Neural Information Processing Systems 36 (2024), 74530\u201374543.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_3_2_265_2","first-page":"8299","volume-title":"Proceedings of the 33rd International Joint Conference on Artificial Intelligence (IJCAI \u201924)","author":"Wang Xindi","year":"2024","unstructured":"Xindi Wang, Mahsa Salmani, Parsa Omidi, Xiangyu Ren, Mehdi Rezagholizadeh, and Armaghan Eshaghi. 2024. Beyond the limits: A survey of techniques to extend the context length in large language models. In Proceedings of the 33rd International Joint Conference on Artificial Intelligence (IJCAI \u201924). International Joint Conferences on Artificial Intelligence Organization, 8299\u20138307. DOI: 10.24963\/ijcai.2024\/917"},{"key":"e_1_3_2_266_2","volume-title":"The 11th International Conference on Learning Representations","author":"Wang Xuezhi","year":"2023","unstructured":"Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-consistency improves chain of thought reasoning in language models. In The 11th International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=1PL1NIMMrw"},{"key":"e_1_3_2_267_2","unstructured":"Y. Wang D. Ma and D. Cai. 2024. With greater text comes greater necessity: Inference-time training helps long text generation. arXiv:2401.11504. Retrieved from https:\/\/arxiv.org\/abs\/2401.11504"},{"key":"e_1_3_2_268_2","unstructured":"Yonghui Wang Wengang Zhou Hao Feng Keyi Zhou and Houqiang Li. 2023. Towards improving document understanding: An exploration on text-grounding via mllms. arXiv:2311.13194. Retrieved from https:\/\/arxiv.org\/abs\/2311.13194"},{"issue":"4","key":"e_1_3_2_269_2","doi-asserted-by":"crossref","first-page":"600","DOI":"10.1109\/TIP.2003.819861","article-title":"Image quality assessment: From error visibility to structural similarity","volume":"13","author":"Wang Zhou","year":"2004","unstructured":"Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. 2004. Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing 13, 4 (2004), 600\u2013612.","journal-title":"IEEE Transactions on Image Processing"},{"key":"e_1_3_2_270_2","unstructured":"Zilong Wang Hao Zhang Chun-Liang Li Julian Martin Eisenschlos Vincent Perot Zifeng Wang Lesly Miculicich Yasuhisa Fujii Jingbo Shang Chen-Yu Lee and Tomas Pfister. 2024. Chain-of-table: Evolving tables in the reasoning chain for table understanding. In International Conference on Learning Representations."},{"key":"e_1_3_2_271_2","doi-asserted-by":"crossref","unstructured":"Navve Wasserman Oliver Heinimann Yuval Golbari Tal Zimbalist Eli Schwartz and Michal Irani. 2025. DocReRank: Single-page hard negative query generation for training multi-modal RAG rerankers. arXiv:2505.22584. Retrieved from https:\/\/arxiv.org\/abs\/2505.22584","DOI":"10.18653\/v1\/2025.emnlp-main.436"},{"key":"e_1_3_2_272_2","unstructured":"Haoran Wei Lingyu Kong Jinyue Chen Liang Zhao Zheng Ge En Yu Jianjian Sun Chunrui Han and Xiangyu Zhang. 2024. Small language model meets with reinforced vision vocabulary. arXiv:2401.12503. Retrieved from https:\/\/arxiv.org\/abs\/2401.12503"},{"key":"e_1_3_2_273_2","first-page":"408","volume-title":"Proceedings of the 18th European Conference on Computer Vision (ECCV \u201924)","author":"Wei Haoran","year":"2024","unstructured":"Haoran Wei, Lingyu Kong, Jinyue Chen, Zhao Liang, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, and Xiangyu Zhang. 2024. Vary: Scaling up the vision vocabulary for large vision-language model. In Proceedings of the 18th European Conference on Computer Vision (ECCV \u201924), Part IV. Springer-Verlag, Berlin, 408\u2013424. DOI: 10.1007\/978-3-031-73235-5_23"},{"key":"e_1_3_2_274_2","unstructured":"Haoran Wei Chenglong Liu Jinyue Chen Jia Wang Lingyu Kong Yanming Xu Zheng Ge Liang Zhao Jianjian Sun Yuang Peng et al. 2024. General OCR theory: Towards OCR-2.0 via a unified end-to-end model. arXiv:2409.01704. Retrieved from https:\/\/arxiv.org\/abs\/2409.01704"},{"key":"e_1_3_2_275_2","first-page":"2367","volume-title":"Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval","author":"Wei Mengxi","year":"2020","unstructured":"Mengxi Wei, Yifan He, and Qiong Zhang. 2020. Robust layout-aware IE for visually rich documents with pre-trained language models. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2367\u20132376."},{"key":"e_1_3_2_276_2","doi-asserted-by":"publisher","DOI":"10.1056\/AIdbp2400537"},{"key":"e_1_3_2_277_2","volume-title":"International Conference on Learning Representations","author":"Wu Yuhuai","year":"2022","unstructured":"Yuhuai Wu, Markus Norman Rabe, DeLesley Hutchins, and Christian Szegedy. 2022. Memorizing transformers. In International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=TrjbxzRcnf"},{"key":"e_1_3_2_278_2","volume-title":"The 12th International Conference on Learning Representations","author":"Xiao Guangxuan","year":"2024","unstructured":"Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. Efficient streaming language models with attention sinks. In The 12th International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=NG7sS51zVF"},{"key":"e_1_3_2_279_2","unstructured":"Weijian Xie Xuefeng Liang Yuhui Liu Kaihua Ni Hong Cheng and Zetian Hu. 2024. Weknow-rag: An adaptive approach for retrieval-augmented generation integrating web search and knowledge graphs. arXiv:2408.07611. Retrieved from https:\/\/arxiv.org\/abs\/2408.07611"},{"key":"e_1_3_2_280_2","unstructured":"Xudong Xie Hao Yan Liang Yin Liu Yang Jing Ding Minghui Liao Yuliang Liu Wei Chen and Xiang Bai. 2024. WuKong: A large multimodal model for efficient long PDF reading with end-to-end sparse sampling. arXiv:2410.05970. Retrieved from https:\/\/arxiv.org\/abs\/2410.05970"},{"key":"e_1_3_2_281_2","unstructured":"Xudong Xie Hao Yan Liang Yin Yang Liu Jing Ding Minghui Liao Yuliang Liu Wei Chen and Xiang Bai. 2025. PDF-WuKong: A large multimodal model for efficient long PDF reading with end-to-end sparse sampling. arXiv:2410.05970. Retrieved from https:\/\/arxiv.org\/abs\/2410.05970"},{"key":"e_1_3_2_282_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.findings-acl.281"},{"key":"e_1_3_2_283_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.findings-emnlp.695"},{"key":"e_1_3_2_284_2","first-page":"1192","volume-title":"Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","author":"Xu Yiheng","year":"2020","unstructured":"Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, and Ming Zhou. 2020. LayoutLM: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1192\u20131200. DOI: 10.1145\/3394486.3403172"},{"key":"e_1_3_2_285_2","first-page":"3214","volume-title":"Findings of the Association for Computational Linguistics (ACL \u201922)","author":"Xu Yiheng","year":"2022","unstructured":"Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, and Furu Wei. 2022. XFUND: A benchmark dataset for multilingual visually rich form understanding. In Findings of the Association for Computational Linguistics (ACL \u201922), 3214\u20133224."},{"key":"e_1_3_2_286_2","doi-asserted-by":"crossref","unstructured":"Shi-Qi Yan Jia-Chen Gu Yun Zhu and Zhen-Hua Ling. 2024. Corrective retrieval augmented generation. arXiv:2401.15884. Retrieved from https:\/\/arxiv.org\/abs\/2401.15884","DOI":"10.2139\/ssrn.5267341"},{"key":"e_1_3_2_287_2","doi-asserted-by":"crossref","first-page":"279","DOI":"10.1109\/ISRITI60336.2023.10467285","volume-title":"2023 6th International Seminar on Research of Information Technology and Intelligent Systems (ISRITI)","author":"Yang Hao","year":"2023","unstructured":"Hao Yang, Min Zhang, and Daimeng Wei. 2023. Srag: speech retrieval augmented generation for spoken language understanding. In 2023 6th International Seminar on Research of Information Technology and Intelligent Systems (ISRITI). IEEE, 279\u2013283."},{"key":"e_1_3_2_288_2","unstructured":"Jianxin Yang. 2023. Longqlora: Efficient and effective method to extend context length of large language models. arXiv:2311.04879. Retrieved from https:\/\/arxiv.org\/abs\/2311.04879"},{"key":"e_1_3_2_289_2","unstructured":"Yazheng Yang Yuqi Wang Sankalok Sen Lei Li and Qi Liu. 2024. Unleashing the potential of large language models for predictive tabular tasks in data dcience. arXiv:2403.20208. Retrieved from https:\/\/arxiv.org\/abs\/2403.20208"},{"key":"e_1_3_2_290_2","unstructured":"Zhibo Yang Jun Tang Zhaohai Li Pengfei Wang Jianqiang Wan Humen Zhong Xuejing Liu Mingkun Yang Peng Wang Shuai Bai et al. 2024. CC-OCR: A comprehensive and challenging OCR benchmark for evaluating large multimodal models in literacy. arXiv:2412.02210. Retrieved from https:\/\/arxiv.org\/abs\/2412.02210"},{"key":"e_1_3_2_291_2","doi-asserted-by":"crossref","first-page":"297","DOI":"10.1007\/s10115-017-1042-4","article-title":"Recent advances in document summarization","volume":"53","author":"Yao Jin-Ge","year":"2017","unstructured":"Jin-Ge Yao, Xiaojun Wan, and Jianguo Xiao. 2017. Recent advances in document summarization. Knowledge and Information Systems 53 (2017), 297\u2013336.","journal-title":"Knowledge and Information Systems"},{"key":"e_1_3_2_292_2","unstructured":"Jiabo Ye Anwen Hu Haiyang Xu Qinghao Ye Ming Yan Yuhao Dan Chenlin Zhao Guohai Xu Chenliang Li Junfeng Tian et al. 2023. mPLUG-DocOwl: Modularized multimodal large language model for document understanding. arXiv:2307.02499. Retrieved from https:\/\/arxiv.org\/abs\/2307.02499"},{"key":"e_1_3_2_293_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.findings-emnlp.187"},{"key":"e_1_3_2_294_2","volume-title":"Proceedings of the International Conference on Learning Representations (ICLR)","author":"Ye Jiabo","year":"2025","unstructured":"Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. 2025. mPLUG-Owl3: Towards long image-sequence understanding in multi-modal large language models. In Proceedings of the International Conference on Learning Representations (ICLR)."},{"key":"e_1_3_2_295_2","unstructured":"Qinghao Ye Haiyang Xu Guohai Xu Jiabo Ye Ming Yan Yiyang Zhou Junyang Wang Anwen Hu Pengcheng Shi et al. 2024. mPLUG-Owl: Modularization empowers large language models with multimodality. arXiv:2304.14178. Retrieved from https:\/\/arxiv.org\/abs\/2304.14178"},{"key":"e_1_3_2_296_2","first-page":"174","volume-title":"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR \u201923)","author":"Ye Yunhu","year":"2023","unstructured":"Yunhu Ye, Binyuan Hui, Min Yang, Binhua Li, Fei Huang, and Yongbin Li. 2023. Large language models are versatile decomposers: Decomposing evidence and questions for table-based reasoning. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR \u201923). ACM, New York, NY, 174\u2013184. DOI: 10.1145\/3539618.3591708"},{"key":"e_1_3_2_297_2","unstructured":"Antonio Jimeno Yepes Yao You Jan Milczek Sebastian Laverde and Leah Li. 2024. Financial report chunking for effective retrieval augmented generation. arXiv:2402.05131. Retrieved from https:\/\/arxiv.org\/abs\/2402.05131"},{"key":"e_1_3_2_298_2","unstructured":"Wenwen Yu Zhibo Yang Jianqiang Wan Sibo Song Jun Tang Wenqing Cheng Yuliang Liu and Xiang Bai. 2025. OmniParser V2: Structured-points-of-thought for unified visual text parsing and its generality to multimodal large language models. arXiv:2502.16161. Retrieved from https:\/\/arxiv.org\/abs\/2502.16161"},{"key":"e_1_3_2_299_2","unstructured":"Ya-Qi Yu Minghui Liao Jihao Wu Yongxin Liao Xiaoyu Zheng and Wei Zeng. 2024. TextHawk: Exploring efficient fine-grained perception of multimodal large language models. arXiv:2404.09204. Retrieved from https:\/\/arxiv.org\/abs\/2404.09204"},{"key":"e_1_3_2_300_2","unstructured":"Chongjian Yue Xinrun Xu Xiaojun Ma Lun Du Hengyu Liu Zhiming Ding Yanbing Jiang Shi Han and Dongmei Zhang. 2024. Enabling and analyzing how to efficiently extract information from hybrid long documents with LLMs. arXiv:2305.16344. Retrieved from https:\/\/arxiv.org\/abs\/2305.16344"},{"key":"e_1_3_2_301_2","unstructured":"Zhenrui Yue Honglei Zhuang Aijun Bai Kai Hui Rolf Jagerman Hansi Zeng Zhen Qin Dong Wang Xuanhui Wang and Michael Bendersky. 2024. Inference scaling for long-context retrieval augmented generation. arXiv:2410.04343. Retrieved from https:\/\/arxiv.org\/abs\/2410.04343"},{"key":"e_1_3_2_302_2","unstructured":"Chi Zhang and Qiyang Chen. 2025. HD-RAG: Retrieval-augmented generation for hybrid documents containing text and hierarchical tables. arXiv:2504.09554. Retrieved from https:\/\/arxiv.org\/abs\/2504.09554"},{"key":"e_1_3_2_303_2","unstructured":"Junyuan Zhang Qintong Zhang Bin Wang Ouyang Linke Zichen Wen Ying Li Ka-Ho Chow Conghui He and Wentao Zhang. 2024. OCR hinders RAG: Evaluating the cascading impact of OCR on retrieval-augmented generation. arXiv:2412.02592. Retrieved from https:\/\/arxiv.org\/abs\/2412.02592"},{"issue":"6","key":"e_1_3_2_304_2","doi-asserted-by":"crossref","first-page":"1245","DOI":"10.1137\/0218082","article-title":"Simple fast algorithms for the editing distance between trees and related problems","volume":"18","author":"Zhang Kaizhong","year":"1989","unstructured":"Kaizhong Zhang and Dennis Shasha. 1989. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing 18, 6 (1989), 1245\u20131262.","journal-title":"SIAM Journal on Computing"},{"key":"e_1_3_2_305_2","first-page":"115","volume-title":"Proceedings of the 16th International Conference on Document Analysis and Recognition (ICDAR \u201921)","author":"Zhang Peng","year":"2021","unstructured":"Peng Zhang, Can Li, Liang Qiao, Zhanzhan Cheng, Shiliang Pu, Yi Niu, and Fei Wu. 2021. VSR: A unified framework for document layout analysis combining vision, semantics and relations. In Proceedings of the 16th International Conference on Document Analysis and Recognition (ICDAR \u201921), Part I, 115\u2013130. DOI: 10.1007\/978-3-030-86549-8_8"},{"key":"e_1_3_2_306_2","unstructured":"Peitian Zhang Zheng Liu Shitao Xiao Ninglu Shao Qiwei Ye and Zhicheng Dou. 2024. Soaring from 4K to 400K: Extending LLM\u2019s context with activation beacon. arXiv:2401.03462. Retrieved from https:\/\/arxiv.org\/abs\/2401.03462"},{"key":"e_1_3_2_307_2","doi-asserted-by":"crossref","unstructured":"Qintong Zhang Bin Wang Victor Shea Jay Huang Junyuan Zhang Zhengren Wang Hao Liang Conghui He and Wentao Zhang. 2025. Document parsing unveiled: Techniques challenges and prospects for structured information extraction. arXiv:2410.21169. Retrieved from https:\/\/arxiv.org\/abs\/2410.21169","DOI":"10.1039\/D5CP02471D"},{"key":"e_1_3_2_308_2","doi-asserted-by":"crossref","first-page":"6024","DOI":"10.18653\/v1\/2024.naacl-long.335","volume-title":"Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)","author":"Zhang Tianshu","year":"2024","unstructured":"Tianshu Zhang, Xiang Yue, Yifei Li, and Huan Sun. 2024. TableLlama: Towards open large generalist models for tables. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Association for Computational Linguistics, Mexico City, Mexico, 6024\u20136044. Retrieved from https:\/\/aclanthology.org\/2024.naacl-long.335"},{"key":"e_1_3_2_309_2","first-page":"2861","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","author":"Zhang Xiaobing","year":"2020","unstructured":"Xiaobing Zhang, Jian Chen, and Chunhua Shen. 2020. Wildreceipt: Receipt text recognition in the wild with arbitrary orientation and corrupted data. In Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2861\u20132870."},{"key":"e_1_3_2_310_2","unstructured":"Xiaokang Zhang Jing Zhang Zeyao Ma Yang Li Bohan Zhang Guanlin Li Zijun Yao Kangli Xu Jinchang Zhou Daniel Zhang-Li et al. 2024. TableLLM: Enabling tabular data manipulation by LLMs in real office usage scenarios. arXiv:2403.19318. Retrieved from https:\/\/arxiv.org\/abs\/2403.19318"},{"key":"e_1_3_2_311_2","first-page":"2214","volume-title":"Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics","author":"Zhang Yuwei","year":"2022","unstructured":"Yuwei Zhang, Yoon Kim, Xingyu Chen, and Robin Jia. 2022. TILT: A transformer-based document information extraction framework via pretraining. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2214\u20132229."},{"key":"e_1_3_2_312_2","doi-asserted-by":"crossref","first-page":"14786","DOI":"10.18653\/v1\/2023.emnlp-main.914","volume-title":"Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing","author":"Zhao Bowen","year":"2023","unstructured":"Bowen Zhao, Changkai Ji, Yuejie Zhang, He Wen, Yingwen Wang, Qing Wang, Rui Feng, and Xiaobo Zhang. 2023. Large language models are complex table parsers. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 14786\u201314802. DOI: 10.18653\/v1\/2023.emnlp-main.914"},{"key":"e_1_3_2_313_2","first-page":"564","volume-title":"European Conference on Computer Vision","author":"Zhong Xu","year":"2020","unstructured":"Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. 2020. Image-based table recognition: Data, model, and evaluation. In European Conference on Computer Vision. Springer, 564\u2013580."},{"key":"e_1_3_2_314_2","doi-asserted-by":"crossref","first-page":"1015","DOI":"10.1109\/ICDAR.2019.00166","volume-title":"2019 International Conference on Document Analysis and Recognition (ICDAR)","author":"Zhong Xu","year":"2019","unstructured":"Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. 2019. Publaynet: Largest dataset ever for document layout analysis. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 1015\u20131022."},{"key":"e_1_3_2_315_2","unstructured":"Zijie Zhong Hanwen Liu Xiaoya Cui Zhang Xiaofan and Zengchang Qin. 2024. Mix-of-granularity: Optimize the chunking granularity for retrieval-augmented generation. arXiv:2406.00456. Retrieved from https:\/\/arxiv.org\/abs\/2406.00456"},{"key":"e_1_3_2_316_2","volume-title":"The 12th International Conference on Learning Representations","author":"Zhu Dawei","year":"2024","unstructured":"Dawei Zhu, Nan Yang, Liang Wang, Yifan Song, Wenhao Wu, Furu Wei, and Sujian Li. 2024. PoSE: Efficient context window extension of LLMs via positional skip-wise training. In The 12th International Conference on Learning Representations. Retrieved from https:\/\/openreview.net\/forum?id=3Z1gxuAQrA"},{"key":"e_1_3_2_317_2","first-page":"21","volume-title":"Proceedings of the 18th International Conference on Document Analysis and Recognition (ICDAR \u201924)","author":"Zhu Jianhua","year":"2024","unstructured":"Jianhua Zhu, Liangcai Gao, and Wenqi Zhao. 2024. ICAL: Implicit character-aided learning for enhanced handwritten mathematical expression recognition. In Proceedings of the 18th International Conference on Document Analysis and Recognition (ICDAR \u201924), Part V. Springer-Verlag, Berlin, 21\u201337. DOI: 10.1007\/978-3-031-70549-6_2"},{"key":"e_1_3_2_318_2","unstructured":"Wei Zou Runpeng Geng Binghui Wang and Jinyuan Jia. 2024. PoisonedRAG: Knowledge poisoning attacks to retrieval-augmented generation of large language models. arXiv:2402.07867. Retrieved from https:\/\/arxiv.org\/abs\/2402.07867"},{"key":"e_1_3_2_319_2","article-title":"Accelerated evidence synthesis in orthopaedics\u2014The roles of natural language processing, expert annotation and large language models","volume":"10","author":"Zsidai B\u00e1lint","year":"2023","unstructured":"B\u00e1lint Zsidai, Janina Kaarre, Ann-Sophie Hilkert, Eric Narup, Eric Hamrin Senorski, Alberto Grassi, Olufemi R. Ayeni, Volker Musahl, Christophe Ley, Elmar Herbst, et al. 2023. Accelerated evidence synthesis in orthopaedics\u2014The roles of natural language processing, expert annotation and large language models. Journal of Experimental Orthopaedics 10 (2023). Retrieved from https:\/\/api.semanticscholar.org\/CorpusID:263095798","journal-title":"Journal of Experimental Orthopaedics"}],"container-title":["ACM Transactions on Information Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3768156","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,11,14]],"date-time":"2025-11-14T14:12:35Z","timestamp":1763129555000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3768156"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,11,14]]},"references-count":318,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2026,1,31]]}},"alternative-id":["10.1145\/3768156"],"URL":"https:\/\/doi.org\/10.1145\/3768156","relation":{},"ISSN":["1046-8188","1558-2868"],"issn-type":[{"value":"1046-8188","type":"print"},{"value":"1558-2868","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,11,14]]},"assertion":[{"value":"2024-09-24","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-09-03","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-11-14","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}