{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,20]],"date-time":"2025-09-20T14:23:21Z","timestamp":1758378201678,"version":"3.44.0"},"reference-count":52,"publisher":"Springer Science and Business Media LLC","issue":"3","license":[{"start":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T00:00:00Z","timestamp":1750204800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T00:00:00Z","timestamp":1750204800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"name":"Hochschule RheinMain"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["IJDAR"],"published-print":{"date-parts":[[2025,9]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Deploying state-of-the-art document understanding models remains resource-intensive and impractical in many real-world scenarios, particularly where labeled data is scarce and computational budgets are constrained. To address these challenges, this work proposes a novel approach towards parameter-efficient document understanding models capable of adapting to specific tasks and document types without the need for labeled data. Specifically, we propose an approach coined <jats:italic>SlimDoc<\/jats:italic> to distill multimodal document transformer encoder models into smaller student models, using internal signals at different training stages, followed by external signals. Our approach is inspired by TinyBERT and adapted to the domain of document understanding transformers. We demonstrate SlimDoc to outperform both a single-stage distillation and a direct fine-tuning of the student. Experimental results across six document understanding datasets demonstrate our approach\u2019s effectiveness: Our distilled student models achieve on average <jats:inline-formula>\n              <jats:alternatives>\n                <jats:tex-math>$$93.0\\%$$<\/jats:tex-math>\n                <mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                  <mml:mrow>\n                    <mml:mn>93.0<\/mml:mn>\n                    <mml:mo>%<\/mml:mo>\n                  <\/mml:mrow>\n                <\/mml:math>\n              <\/jats:alternatives>\n            <\/jats:inline-formula> of the teacher\u2019s performance, while the fine-tuned students achieve <jats:inline-formula>\n              <jats:alternatives>\n                <jats:tex-math>$$87.0\\%$$<\/jats:tex-math>\n                <mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                  <mml:mrow>\n                    <mml:mn>87.0<\/mml:mn>\n                    <mml:mo>%<\/mml:mo>\n                  <\/mml:mrow>\n                <\/mml:math>\n              <\/jats:alternatives>\n            <\/jats:inline-formula> of the teacher\u2019s performance. Without requiring any labeled data, we create a compact student which achieves <jats:inline-formula>\n              <jats:alternatives>\n                <jats:tex-math>$$96.0\\%$$<\/jats:tex-math>\n                <mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                  <mml:mrow>\n                    <mml:mn>96.0<\/mml:mn>\n                    <mml:mo>%<\/mml:mo>\n                  <\/mml:mrow>\n                <\/mml:math>\n              <\/jats:alternatives>\n            <\/jats:inline-formula> of the performance of its supervised-distilled counterpart and <jats:inline-formula>\n              <jats:alternatives>\n                <jats:tex-math>$$86.2\\%$$<\/jats:tex-math>\n                <mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                  <mml:mrow>\n                    <mml:mn>86.2<\/mml:mn>\n                    <mml:mo>%<\/mml:mo>\n                  <\/mml:mrow>\n                <\/mml:math>\n              <\/jats:alternatives>\n            <\/jats:inline-formula> of the performance of a supervised-fine-tuned teacher model. We demonstrate our distillation approach to pick up on document geometry and to be effective on the two popular document understanding models LiLT and LayoutLMv3. Our implementation and training data is available at <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xlink:href=\"https:\/\/github.com\/marcel-lamott\/SlimDoc\" ext-link-type=\"uri\">https:\/\/github.com\/marcel-lamott\/SlimDoc<\/jats:ext-link>.<\/jats:p>","DOI":"10.1007\/s10032-025-00542-w","type":"journal-article","created":{"date-parts":[[2025,6,18]],"date-time":"2025-06-18T07:41:12Z","timestamp":1750232472000},"page":"457-473","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["SlimDoc: lightweight distillation of document transformer models"],"prefix":"10.1007","volume":"28","author":[{"given":"Marcel","family":"Lamott","sequence":"first","affiliation":[]},{"given":"Muhammad Armaghan","family":"Shakir","sequence":"additional","affiliation":[]},{"given":"Adrian","family":"Ulges","sequence":"additional","affiliation":[]},{"given":"Yves-Noel","family":"Weweler","sequence":"additional","affiliation":[]},{"given":"Faisal","family":"Shafait","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,6,18]]},"reference":[{"key":"542_CR1","unstructured":"Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA (2017)"},{"key":"542_CR2","unstructured":"Bengio, Y., et al.: A neural probabilistic language model. In: Advances in Neural Information Processing Systems 13, Papers from Neural Information Processing Systems (NIPS) 2000, Denver, CO, USA (2000)"},{"key":"542_CR3","unstructured":"OpenAI: ChatGPT. https:\/\/www.openai.com\/chatgpt. Accessed 08 Oct 2024 (2023)"},{"key":"542_CR4","unstructured":"Touvron, H., et al.: LLaMA: Open and Efficient Foundation Language Models. CoRR (2023)"},{"key":"542_CR5","doi-asserted-by":"crossref","unstructured":"Xu, Y., et al.: Layoutlm: pre-training of text and layout for document image understanding. In: KDD \u201920: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020 (2020)","DOI":"10.1145\/3394486.3403172"},{"key":"542_CR6","doi-asserted-by":"crossref","unstructured":"Xu, Y., et al.: LayoutLMv2: Multi-modal pre-training for visually-rich document understanding. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL\/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021 (2021)","DOI":"10.18653\/v1\/2021.acl-long.201"},{"key":"542_CR7","doi-asserted-by":"crossref","unstructured":"Huang, Y., et al.: LayoutLMv3: Pre-training for document ai with unified text and image masking. In: MM \u201922: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022 (2022)","DOI":"10.1145\/3503161.3548112"},{"key":"542_CR8","doi-asserted-by":"crossref","unstructured":"Wang, J., et al.: LiLT: a simple yet effective language-independent layout transformer for structured document understanding. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022 (2022)","DOI":"10.18653\/v1\/2022.acl-long.534"},{"key":"542_CR9","unstructured":"Kim, G., et al.: Donut: Document Understanding Transformer without OCR. CoRR (2021)"},{"key":"542_CR10","doi-asserted-by":"crossref","unstructured":"Li, J., et al.: DiT: self-supervised pre-training for document image transformer. In: MM \u201922: The 30th ACM International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022 (2022)","DOI":"10.1145\/3503161.3547911"},{"key":"542_CR11","doi-asserted-by":"crossref","unstructured":"Davis, B.L., et al.: End-to-end document recognition and understanding with dessurt. In: Computer Vision - ECCV 2022 Workshops - Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part IV. Lecture Notes in Computer Science (2022)","DOI":"10.1007\/978-3-031-25069-9_19"},{"key":"542_CR12","unstructured":"Hinton, G.E., et al.: Distilling the knowledge in a neural network. CoRR (2015)"},{"key":"542_CR13","unstructured":"Xu, X., et al.: A survey on knowledge distillation of large language models. CoRR (2024)"},{"key":"542_CR14","doi-asserted-by":"crossref","unstructured":"Jiao, X., et al.: TinyBERT: Distilling BERT for natural language understanding. In: Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020. Findings of ACL (2020)","DOI":"10.18653\/v1\/2020.findings-emnlp.372"},{"key":"542_CR15","unstructured":"Devlin, J., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers) (2019)"},{"key":"542_CR16","unstructured":"Liu, Y., et al.: RoBERTa: a robustly optimized bert pretraining approach. CoRR (2019)"},{"key":"542_CR17","unstructured":"Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. (2020)"},{"key":"542_CR18","doi-asserted-by":"crossref","unstructured":"Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020 (2020)","DOI":"10.18653\/v1\/2020.acl-main.703"},{"key":"542_CR19","unstructured":"Brown, T.B., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, Virtual (2020)"},{"key":"542_CR20","unstructured":"Zhang, S., et al.: Instruction tuning for large language models: a survey. CoRR (2023)"},{"key":"542_CR21","unstructured":"Wang, W., et al.: Layout and task aware instruction prompt for zero-shot document image question answering. CoRR (2023)"},{"key":"542_CR22","doi-asserted-by":"crossref","unstructured":"Lamott, M., et al.: LAPDoc: layout-aware prompting for documents. In: Document Analysis and Recognition - ICDAR 2024, Cham (2024)","DOI":"10.1007\/978-3-031-70546-5_9"},{"key":"542_CR23","unstructured":"Niyogi, D., Srihari, S.N.: A rule-based system for document understanding. In: Kehler, T., Rosenschein, S.J. (eds.) Proceedings of the 5th National Conference on Artificial Intelligence. Philadelphia, PA, USA, August 11-15, 1986. Volume 2: Engineering (1986)"},{"key":"542_CR24","doi-asserted-by":"publisher","DOI":"10.1007\/S10032-002-0080-X","author":"M Aiello","year":"2002","unstructured":"Aiello, M., Monz, C., Todoran, L.: Document understanding for a broad class of documents. Int. J. Doc. Anal. Recognit. (2002). https:\/\/doi.org\/10.1007\/S10032-002-0080-X","journal-title":"Int. J. Doc. Anal. Recognit."},{"key":"542_CR25","doi-asserted-by":"publisher","unstructured":"Dengel, A.R.: Making documents work: challenges for document understanding. In: Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings. (2003). https:\/\/doi.org\/10.1109\/ICDAR.2003.SPS1","DOI":"10.1109\/ICDAR.2003.SPS1"},{"key":"542_CR26","doi-asserted-by":"crossref","unstructured":"Shehzad, K., et al.: Named entity recognition in semi structured documents using neural tensor networks. In: Document Analysis Systems (2020)","DOI":"10.1007\/978-3-030-57058-3_28"},{"key":"542_CR27","doi-asserted-by":"crossref","unstructured":"Appalaraju, S., et al.: DocFormer: End-to-End transformer for document understanding. In: 2021 IEEE\/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021 (2021)","DOI":"10.1109\/ICCV48922.2021.00103"},{"key":"542_CR28","unstructured":"Han, S., et al.: Deep compression: compressing deep neural network with pruning, trained quantization and huffman coding. In: 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings (2016)"},{"key":"542_CR29","unstructured":"Ashkboos, S., et al.: SliceGPT: compress large language models by deleting rows and columns. In: The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 (2024)"},{"key":"542_CR30","unstructured":"Ma, X., et al.: LLM-Pruner: on the structural pruning of large language models. In: Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 (2023)"},{"key":"542_CR31","doi-asserted-by":"crossref","unstructured":"Jacob, B., et al.: Quantization and training of neural networks for efficient integer-arithmetic-only inference. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018 (2018)","DOI":"10.1109\/CVPR.2018.00286"},{"key":"542_CR32","unstructured":"Dettmers, T., et al.: QLoRA: efficient finetuning of quantized LLMs. In: Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 (2023)"},{"key":"542_CR33","unstructured":"Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 (2022)"},{"key":"542_CR34","unstructured":"Romero, A., et al.: Fitnets: hints for thin deep nets. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015)"},{"key":"542_CR35","unstructured":"Zagoruyko, S., Komodakis, N.: Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings (2017)"},{"key":"542_CR36","doi-asserted-by":"crossref","unstructured":"Sun, S., et al.: Patient knowledge distillation for BERT model compression. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019 (2019)","DOI":"10.18653\/v1\/D19-1441"},{"key":"542_CR37","unstructured":"Sanh, V., et al.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR (2019)"},{"key":"542_CR38","doi-asserted-by":"crossref","unstructured":"Van\u00a0Landeghem, J., et al.: DistilDoc: knowledge distillation forvisually-rich document applications. In: Document Analysis and Recognition - ICDAR 2024, Cham (2024)","DOI":"10.1007\/978-3-031-70546-5_12"},{"key":"542_CR39","doi-asserted-by":"publisher","unstructured":"Lamott, M., et al.: Leveraging distillation techniques for document understanding: a case study with FLAN-T5. In: 54. Jahrestagung der Gesellschaft F\u00fcr Informatik, INFORMATIK 2024 - Lock in or Log Out? Wie Digitale Souver\u00e4nit\u00e4t Gelingt, Wiesbaden, Germany, September 24-26, 2024. LNI (2024). https:\/\/doi.org\/10.18420\/inf2024_120","DOI":"10.18420\/inf2024_120"},{"key":"542_CR40","unstructured":"Chung, H.W., et al.: Scaling instruction-finetuned language models. J. Mach. Learn. Res. (2024)"},{"key":"542_CR41","doi-asserted-by":"crossref","unstructured":"Ding, Y., et al.: 3MVRD: multimodal multi-task multi-teacher visually-rich form document understanding. In: Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and Virtual Meeting, August 11-16, 2024 (2024)","DOI":"10.18653\/v1\/2024.findings-acl.903"},{"key":"542_CR42","doi-asserted-by":"crossref","unstructured":"Alberti, C., et al.: Synthetic QA corpora generation with roundtrip consistency. In: Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers (2019)","DOI":"10.18653\/v1\/P19-1620"},{"key":"542_CR43","doi-asserted-by":"crossref","unstructured":"Puri, R., et al.: Training question answering models from synthetic data. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020 (2020)","DOI":"10.18653\/v1\/2020.emnlp-main.468"},{"key":"542_CR44","doi-asserted-by":"crossref","unstructured":"Bartolo, M., et al.: Improving question answering model robustness with synthetic adversarial data generation. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event \/ Punta Cana, Dominican Republic, 7-11 November, 2021 (2021)","DOI":"10.18653\/v1\/2021.emnlp-main.696"},{"key":"542_CR45","doi-asserted-by":"crossref","unstructured":"Luo, H., et al.: Cooperative self-training of machine reading comprehension. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022 (2022)","DOI":"10.18653\/v1\/2022.naacl-main.18"},{"key":"542_CR46","doi-asserted-by":"crossref","unstructured":"Kullback, S., et al.: On information and sufficiency. The annals of mathematical statistics (1951)","DOI":"10.1214\/aoms\/1177729694"},{"key":"542_CR47","doi-asserted-by":"crossref","unstructured":"Borchmann, L., et al.: DUE: End-to-end document understanding benchmark. In: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, Virtual (2021)","DOI":"10.1016\/j.tbench.2021.100012"},{"key":"542_CR48","doi-asserted-by":"crossref","unstructured":"Mathew, M., et al.: DocVQA: A dataset for vqa on document images. In: IEEE Winter Conference on Applications of Computer Vision, WACV 2021, Waikoloa, HI, USA, January 3-8, 2021 (2021)","DOI":"10.1109\/WACV48630.2021.00225"},{"key":"542_CR49","doi-asserted-by":"crossref","unstructured":"Mathew, M., et al.: InfographicVQA. In: IEEE\/CVF winter conference on applications of computer vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022 (2022)","DOI":"10.1109\/WACV51458.2022.00264"},{"key":"542_CR50","doi-asserted-by":"crossref","unstructured":"Pasupat, P., et al.: Compositional semantic parsing on semi-structured tables. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers (2015)","DOI":"10.3115\/v1\/P15-1142"},{"key":"542_CR51","doi-asserted-by":"crossref","unstructured":"Huang, Z., et al.: ICDAR2019 competition on scanned receipt OCR and information extraction. In: 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, September 20-25, 2019 (2019)","DOI":"10.1109\/ICDAR.2019.00244"},{"key":"542_CR52","doi-asserted-by":"crossref","unstructured":"Jaume, G., et al.: FUNSD: A dataset for form understanding in noisy scanned documents. In: 2nd International Workshop on Open Services and Tools for Document Analysis, OST@ICDAR 2019, Sydney, Australia, September 22-25, 2019 (2019)","DOI":"10.1109\/ICDARW.2019.10029"}],"container-title":["International Journal on Document Analysis and Recognition (IJDAR)"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10032-025-00542-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s10032-025-00542-w\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s10032-025-00542-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,20]],"date-time":"2025-09-20T08:38:53Z","timestamp":1758357533000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s10032-025-00542-w"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,6,18]]},"references-count":52,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2025,9]]}},"alternative-id":["542"],"URL":"https:\/\/doi.org\/10.1007\/s10032-025-00542-w","relation":{},"ISSN":["1433-2833","1433-2825"],"issn-type":[{"type":"print","value":"1433-2833"},{"type":"electronic","value":"1433-2825"}],"subject":[],"published":{"date-parts":[[2025,6,18]]},"assertion":[{"value":"15 November 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 May 2025","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"6 June 2025","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"18 June 2025","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declaration"}},{"value":"The authors declare no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}]}}