{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,10]],"date-time":"2026-02-10T17:46:17Z","timestamp":1770745577241,"version":"3.49.0"},"reference-count":25,"publisher":"Emerald","issue":"1","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2026,2,11]]},"abstract":"<jats:sec>\n                    <jats:title>Purpose<\/jats:title>\n                    <jats:p>The purpose of this study is to address the task of visually rich document understanding, which extracts structured semantics from complex multimodal document images. Core challenges, including feature distortion from geometric deformations and structural deviations from complex layouts, jointly undermine extraction precision and robustness \u2013 the former due to inadequate cross-modal adaptation to local distortions and the latter due to constrained global topology modeling.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Design\/methodology\/approach<\/jats:title>\n                    <jats:p>To address these limitations, the Geometry-Topology Collaborative Parsing Framework is proposed. This framework achieves robust document parsing through dual technical approaches. First, deformable kernels are used to correct geometrically warped text features, with channel attention mechanisms integrated for feature noise suppression. Second, cross-attention mechanisms and the XY-Cut algorithm work collaboratively to construct reading orders adapted to document topological structures. This co-design ensures simultaneous optimization of text feature enhancement and reading-order construction.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Findings<\/jats:title>\n                    <jats:p>The framework demonstrates superior performance in document classification and structured information extraction through synergistic optimization of geometric correction and structural relationship decoding, significantly reducing reliance on error-prone optical character recognition (OCR) preprocessing.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Originality\/value<\/jats:title>\n                    <jats:p>This work establishes a novel paradigm for joint spatial-semantic document understanding via three key advancements: adaptive distortion rectification through deformable-convolutional text localization; density-aware reading sequence generation for robust layout parsing; and unified representation learning bridging visual features with structural topology without intermediate OCR dependencies.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1108\/ijwis-07-2025-0204","type":"journal-article","created":{"date-parts":[[2025,12,8]],"date-time":"2025-12-08T08:45:54Z","timestamp":1765183554000},"page":"1-16","source":"Crossref","is-referenced-by-count":0,"title":["Geometry-aware and autonomous reading-order learning for OCR-free document parsing"],"prefix":"10.1108","volume":"22","author":[{"given":"Weina","family":"Zhang","sequence":"first","affiliation":[{"name":"Shanghai University of Electric Power College of Computer Science and Technology, , Shanghai,","place":["China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Fangyu","family":"Liu","sequence":"additional","affiliation":[{"name":"Shanghai University of Electric Power College of Computer Science and Technology, , Shanghai,","place":["China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zhe","family":"Xu","sequence":"additional","affiliation":[{"name":"Shanghai University of Electric Power College of Computer Science and Technology, , Shanghai,","place":["China"]}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Zhongqin","family":"Bi","sequence":"additional","affiliation":[{"name":"Shanghai University of Electric Power College of Computer Science and Technology, , Shanghai,","place":["China"]}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"140","published-online":{"date-parts":[[2025,12,9]]},"reference":[{"key":"2026020922465767600_ref001","doi-asserted-by":"crossref","first-page":"241","DOI":"10.1007\/978-3-031-73242-3_14","volume-title":"Computer Vision \u2013 ECCV 2024","author":"Abramovich","year":"2025"},{"key":"2026020922465767600_ref002","doi-asserted-by":"crossref","unstructured":"Appalaraju, S.\n          , et al. (2021), \u201cDocFormer: end-to-end transformer for document understanding\u201d, arXiv: 2106.11539 [cs.CV], available at:Link to the cited article.","DOI":"10.1109\/ICCV48922.2021.00103"},{"key":"2026020922465767600_ref003","unstructured":"Cheng, Z.\n          , et al. (2022), \u201cTRIE++: towards end-to-end information extraction from visually rich documents\u201d, arXiv: 2207.06744 [cs.CV], available at:Link to the cited article."},{"key":"2026020922465767600_ref004","unstructured":"Devlin, J.\n          , et al. (2019), \u201cBERT: pre-training of deep bidirectional transformers for language understanding\u201d, arXiv: 1810.04805 [cs.CL], available at:Link to the cited article."},{"key":"2026020922465767600_ref005","doi-asserted-by":"crossref","unstructured":"Feng, H.\n          , et al. (2024), \u201cDocPedia: unleashing the power of large multimodal model in the frequency domain for versatile document understanding\u201d, arXiv: 2311.11810 [cs.CV], available at:Link to the cited article.","DOI":"10.1007\/s11432-024-4250-y"},{"key":"2026020922465767600_ref006","doi-asserted-by":"crossref","unstructured":"Gu, Z.\n          , et al. (2022), \u201cXYLayoutLM: towards layout-aware multimodal networks for Visually-Rich document understanding\u201d, available at:Link to the cited article.","DOI":"10.1109\/CVPR52688.2022.00454"},{"key":"2026020922465767600_ref007","doi-asserted-by":"publisher","first-page":"952","DOI":"10.1109\/ICDAR.1995.602059","article-title":"Recursive X-Y cut using bounding boxes of connected components","author":"Ha","year":"1995"},{"key":"2026020922465767600_ref008","doi-asserted-by":"crossref","unstructured":"Huang, Y.\n          , et al. (2022), \u201cLayoutLMv3: pre-training for document AI with unified text and image masking\u201d, arXiv: 2204.08387 [cs.CL], available at:Link to the cited article.","DOI":"10.1145\/3503161.3548112"},{"key":"2026020922465767600_ref009","doi-asserted-by":"publisher","first-page":"1516","DOI":"10.1109\/ICDAR.2019.00244","article-title":"ICDAR2019 competition on scanned receipt OCR and information extraction","author":"Huang","year":"2019"},{"key":"2026020922465767600_ref010","article-title":"Post-OCR parsing: building simple and robust parser via BIO tagging","author":"Hwang","year":"2019"},{"key":"2026020922465767600_ref011","doi-asserted-by":"crossref","unstructured":"Jaume, G., Ekenel, H., Kemal. and Thiran, J.-P. (2019), \u201cFUNSD: a dataset for form understanding in noisy scanned documents\u201d, arXiv: 1905.13538 [cs.IR], available at:Link to the cited article.","DOI":"10.1109\/ICDARW.2019.10029"},{"key":"2026020922465767600_ref012","unstructured":"Kim, G.\n          , et al. (2022), \u201cOCR-free document understanding transformer\u201d, arXiv: 2111.15664 [cs.LG], available at:Link to the cited article."},{"key":"2026020922465767600_ref013","doi-asserted-by":"crossref","unstructured":"Kuang, J.\n          , et al. (2023), \u201cVisual information extraction in the wild: practical dataset and end-to-end solution\u201d, arXiv: 2305.07498 [cs.CV], available at:Link to the cited article.","DOI":"10.1007\/978-3-031-41731-3_3"},{"key":"2026020922465767600_ref014","unstructured":"Lee, K.\n          , et al (2023), \u201cPix2Struct: Screenshot parsing as pretraining for visual language understanding\u201d, arXiv: 2210.03347 [cs.CL], available at:Link to the cited article."},{"key":"2026020922465767600_ref015","doi-asserted-by":"publisher","first-page":"6309","DOI":"10.18653\/v1\/2021.acl-long.493","article-title":"StructuralLM: structural pre-training for form understanding","volume":"1","author":"Li","year":"2021","journal-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing"},{"key":"2026020922465767600_ref016","unstructured":"Li, Z.\n          , et al. (2024), \u201cMonkey: image resolution and text label are important things for large multi-modal models\u201d, arXiv: 2311.06607 [cs.CV],available at:Link to the cited article."},{"key":"2026020922465767600_ref017","unstructured":"Liu, Y.\n          , et al. (2024), \u201cTextMonkey: an OCR-free large multimodal model for understanding document\u201d, arXiv: 2403.04473 [cs.CV], available at:Link to the cited article."},{"key":"2026020922465767600_ref018","doi-asserted-by":"crossref","unstructured":"Liu, Z.\n          , et al. (2021), \u201cSwin transformer: hierarchical vision transformer using shifted windows\u201d, arXiv: 2103.14030 [cs.CV], available at:Link to the cited article.","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"2026020922465767600_ref019","doi-asserted-by":"publisher","first-page":"3","DOI":"10.1007\/978-3-031-72437-4\\_1","volume-title":"Linking Theory and Practice of Digital Libraries","author":"Nguyen","year":"2024"},{"key":"2026020922465767600_ref020","first-page":"2579","article-title":"LayoutLMv2: Multi-modal pre-training for visually-rich document understanding","author":"Xu","year":"2021"},{"key":"2026020922465767600_ref021","doi-asserted-by":"publisher","first-page":"1192","DOI":"10.1145\/3394486.3403172","article-title":"LayoutLM: pre-training of text and layout for document image understanding","author":"Xu","year":"2020"},{"key":"2026020922465767600_ref022","doi-asserted-by":"crossref","unstructured":"Ye, J.\n          , et al. (2023a), \u201cUReader: universal OCR-free visually-situated language understanding with multimodal large language model\u201d, arXiv: 2310.05126 [cs.CV], available at:Link to the cited article.","DOI":"10.18653\/v1\/2023.findings-emnlp.187"},{"key":"2026020922465767600_ref023","unstructured":"Ye, Q.\n          , et al. (2023b), \u201cmPLUG-Owl2: revolutionizing multi-modal large language model with modality collaboration\u201d, arXiv: 2311.04257 [cs.CL], available at:https:\/\/arxiv.org\/abs\/2311.04257"},{"key":"2026020922465767600_ref024","unstructured":"Zhang, P.\n          , et al. (2021), \u201cTRIE: end-to-end text reading and information extraction for document understanding\u201d, arXiv: 2005.13118 [cs.CV], available at:Link to the cited article."},{"key":"2026020922465767600_ref025","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2407.14439","article-title":"Token-level correlation-guided compression for efficient multimodal document understanding","author":"Zhang","year":"2024"}],"container-title":["International Journal of Web Information Systems"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.emerald.com\/ijwis\/article-pdf\/22\/1\/1\/10966867\/ijwis-07-2025-0204en.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/www.emerald.com\/ijwis\/article-pdf\/22\/1\/1\/10966867\/ijwis-07-2025-0204en.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,10]],"date-time":"2026-02-10T03:47:19Z","timestamp":1770695239000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.emerald.com\/ijwis\/article\/22\/1\/1\/1323786\/Geometry-aware-and-autonomous-reading-order"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12,9]]},"references-count":25,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2026,2,11]]}},"URL":"https:\/\/doi.org\/10.1108\/ijwis-07-2025-0204","relation":{},"ISSN":["1744-0084","1744-0092"],"issn-type":[{"value":"1744-0084","type":"print"},{"value":"1744-0092","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,12,9]]}}}