{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,6,2]],"date-time":"2026-06-02T16:42:33Z","timestamp":1780418553245,"version":"3.54.1"},"reference-count":39,"publisher":"MIT Press - Journals","license":[{"start":{"date-parts":[[2022,4,7]],"date-time":"2022-04-07T00:00:00Z","timestamp":1649289600000},"content-version":"vor","delay-in-days":96,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2022,4,6]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Accurately extracting structured content from PDFs is a critical first step for NLP over scientific papers. Recent work has improved extraction accuracy by incorporating elementary layout information, for example, each token\u2019s 2D position on the page, into language model pretraining. We introduce new methods that explicitly model VIsual LAyout (VILA) groups, that is, text lines or text blocks, to further improve performance. In our I-VILA approach, we show that simply inserting special tokens denoting layout group boundaries into model inputs can lead to a 1.9% Macro F1 improvement in token classification. In the H-VILA approach, we show that hierarchical encoding of layout-groups can result in up to 47% inference time reduction with less than 0.8% Macro F1 loss. Unlike prior layout-aware approaches, our methods do not require expensive additional pretraining, only fine-tuning, which we show can reduce training cost by up to 95%. Experiments are conducted on a newly curated evaluation suite, S2-VLUE, that unifies existing automatically labeled datasets and includes a new dataset of manual annotations covering diverse papers from 19 scientific disciplines. Pre-trained weights, benchmark datasets, and source code are available at https:\/\/github.com\/allenai\/VILA.<\/jats:p>","DOI":"10.1162\/tacl_a_00466","type":"journal-article","created":{"date-parts":[[2022,4,7]],"date-time":"2022-04-07T14:23:06Z","timestamp":1649341386000},"page":"376-392","update-policy":"https:\/\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":24,"title":["VILA: Improving Structured Content Extraction from Scientific PDFs Using Visual Layout Groups"],"prefix":"10.1162","volume":"10","author":[{"given":"Zejiang","family":"Shen","sequence":"first","affiliation":[{"name":"Allen Institute for AI, USA. shannons@allenai.org"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Kyle","family":"Lo","sequence":"additional","affiliation":[{"name":"Allen Institute for AI, USA. kylel@allenai.org"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Lucy Lu","family":"Wang","sequence":"additional","affiliation":[{"name":"Allen Institute for AI, USA. lucyw@allenai.org"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Bailey","family":"Kuehl","sequence":"additional","affiliation":[{"name":"Allen Institute for AI, USA. baileyk@allenai.org"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Daniel S.","family":"Weld","sequence":"additional","affiliation":[{"name":"Allen Institute for AI, USA"},{"name":"University of Washington, USA. danw@allenai.org"}],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Doug","family":"Downey","sequence":"additional","affiliation":[{"name":"Allen Institute for AI, USA"},{"name":"Northwestern University, USA. dougd@allenai.org"}],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"281","published-online":{"date-parts":[[2022,4,6]]},"reference":[{"key":"2022040714222640000_bib1","article-title":"Grobid","year":"2008\u20132021"},{"key":"2022040714222640000_bib2","article-title":"ICDAR2021 competition on mathematical formula detection","year":"2021"},{"key":"2022040714222640000_bib3","doi-asserted-by":"publisher","first-page":"84","DOI":"10.18653\/v1\/N18-3011","article-title":"Construction of the literature graph in semantic scholar","volume-title":"Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)","author":"Ammar","year":"2018"},{"key":"2022040714222640000_bib4","first-page":"12526","article-title":"Segatron: Segment-aware transformer for language modeling and understanding","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"He","year":"2021"},{"key":"2022040714222640000_bib5","doi-asserted-by":"publisher","first-page":"3615","DOI":"10.18653\/v1\/D19-1371","article-title":"SciBERT: A pretrained language model for scientific text","volume-title":"Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)","author":"Iz","year":"2019"},{"issue":"1","key":"2022040714222640000_bib6","doi-asserted-by":"publisher","first-page":"5","DOI":"10.1023\/A:1010933404324","article-title":"Random forests","volume":"45","author":"Breiman","year":"2001","journal-title":"Machine Learning"},{"key":"2022040714222640000_bib7","first-page":"4171","article-title":"BERT: Pre-training of deep bidirectional transformers for language understanding","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)","author":"Devlin","year":"2019"},{"key":"2022040714222640000_bib8","first-page":"2961","article-title":"Mask R-CNN","volume-title":"Proceedings of the IEEE international conference on computer vision","author":"He","year":"2017"},{"issue":"8","key":"2022040714222640000_bib9","doi-asserted-by":"publisher","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long short-term memory","volume":"9","author":"Hochreiter","year":"1997","journal-title":"Neural Computation"},{"key":"2022040714222640000_bib10","article-title":"Adam: A method for stochastic optimization","volume-title":"3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings","author":"Kingma","year":"2015"},{"key":"2022040714222640000_bib11","first-page":"282","article-title":"Conditional random fields: Probabilistic models for segmenting and labeling sequence data","volume-title":"Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), Williams College, Williamstown, MA, USA, June 28 \u2013 July 1, 2001","author":"Lafferty","year":"2001"},{"key":"2022040714222640000_bib12","doi-asserted-by":"crossref","first-page":"1551","DOI":"10.18653\/v1\/2020.emnlp-main.120","article-title":"SLM: Learning a discourse language representation with sentence unshuffling","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Lee","year":"2020"},{"key":"2022040714222640000_bib13","first-page":"949","article-title":"DocBank: A benchmark dataset for document layout analysis","volume-title":"Proceedings of the 28th International Conference on Computational Linguistics, COLING","author":"Li","year":"2020"},{"key":"2022040714222640000_bib14","first-page":"5652","article-title":"SelfDoc: Self-supervised document representation learning","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Li","year":"2021"},{"key":"2022040714222640000_bib15","article-title":"RoBERTa: A robustly optimized BERT pretraining approach","volume":"cs.CL\/1907.11692v1","author":"Liu","year":"2019","journal-title":"CoRR"},{"key":"2022040714222640000_bib16","first-page":"15137","article-title":"Robust PDF document conversion using recurrent neural networks","volume-title":"Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021","author":"Livathinos","year":"2021"},{"key":"2022040714222640000_bib17","doi-asserted-by":"crossref","first-page":"4969","DOI":"10.18653\/v1\/2020.acl-main.447","article-title":"S2ORC: The semantic scholar open research corpus","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Lo","year":"2020"},{"key":"2022040714222640000_bib18","article-title":"Decoupled weight decay regularization","volume-title":"7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6\u20139, 2019","author":"Loshchilov","year":"2019"},{"key":"2022040714222640000_bib19","article-title":"Mixed precision training","volume-title":"6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings","author":"Micikevicius","year":"2018"},{"key":"2022040714222640000_bib20","doi-asserted-by":"publisher","first-page":"258","DOI":"10.18653\/v1\/2021.acl-demo.31","article-title":"PAWLS: PDF annotation with labels and structure","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations","author":"Neumann","year":"2021"},{"key":"2022040714222640000_bib21","article-title":"PyTorch: An imperative style, high-performance deep learning library","volume-title":"Advances in Neural Information Processing Systems","author":"Paszke","year":"2019"},{"key":"2022040714222640000_bib22","first-page":"91","article-title":"Faster R-CNN: Towards real-time object detection with region proposal networks","volume":"28","author":"Ren","year":"2015","journal-title":"Advances in Neural Information Processing Systems"},{"key":"2022040714222640000_bib23","doi-asserted-by":"publisher","first-page":"110","DOI":"10.18653\/v1\/2020.nlposs-1.15","article-title":"PySBD: Pragmatic sentence boundary disambiguation","volume-title":"Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)","author":"Sadvilkar","year":"2020"},{"key":"2022040714222640000_bib24","article-title":"Distilbert, a distilled version of BERT: Smaller, faster, cheaper and lighter","author":"Sanh","year":"2019","journal-title":"CoRR"},{"issue":"12","key":"2022040714222640000_bib25","doi-asserted-by":"publisher","first-page":"54","DOI":"10.1145\/3381831","article-title":"Green AI","volume":"63","author":"Schwartz","year":"2020","journal-title":"Communications of the ACM"},{"key":"2022040714222640000_bib26","doi-asserted-by":"publisher","first-page":"131","DOI":"10.1007\/978-3-030-86549-8_9","article-title":"LayoutParser: A unified toolkit for deep learning based document image analysis","volume-title":"Document Analysis and Recognition \u2013 ICDAR 2021","author":"Shen","year":"2021"},{"key":"2022040714222640000_bib27","doi-asserted-by":"publisher","first-page":"223","DOI":"10.1145\/3197026.3197040","article-title":"Extracting scientific figures with distantly supervised neural networks","volume-title":"Proceedings of the 18th ACM\/IEEE on joint conference on digital libraries","author":"Siegel","year":"2018"},{"key":"2022040714222640000_bib28","doi-asserted-by":"publisher","first-page":"774","DOI":"10.1145\/3219819.3219834","article-title":"Corpus conversion service: A machine learning platform to ingest documents at scale","volume-title":"Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining","author":"Staar","year":"2018"},{"issue":"11\/12","key":"2022040714222640000_bib29","doi-asserted-by":"publisher","DOI":"10.1045\/november14-tkaczyk","article-title":"GROTOAP2 \u2014 the methodology of creating a large ground truth dataset of scientific articles","volume":"20","author":"Tkaczyk","year":"2014","journal-title":"D-Lib Magazine"},{"issue":"4","key":"2022040714222640000_bib30","doi-asserted-by":"publisher","first-page":"317","DOI":"10.1007\/s10032-015-0249-8","article-title":"CERMINE: Automatic extraction of structured metadata from scientific literature","volume":"18","author":"Tkaczyk","year":"2015","journal-title":"International Journal on Document Analysis and Recognition (IJDAR)"},{"key":"2022040714222640000_bib31","article-title":"Attention is all you need","volume-title":"Advances in Neural Information Processing Systems","author":"Vaswani","year":"2017"},{"key":"2022040714222640000_bib32","article-title":"Improving the accessibility of scientific documents: Current state, user needs, and a system solution to enhance scientific PDF accessibility for blind and low vision users","author":"Wang","year":"2021","journal-title":"CoRR"},{"key":"2022040714222640000_bib33","article-title":"CORD-19: The covid-19 open research dataset","author":"Wang","year":"2020","journal-title":"CoRR"},{"key":"2022040714222640000_bib34","doi-asserted-by":"publisher","first-page":"38","DOI":"10.18653\/v1\/2020.emnlp-demos.6","article-title":"Transformers: State-of-the-art natural language processing","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations","author":"Wolf","year":"2020"},{"key":"2022040714222640000_bib35","first-page":"1192","article-title":"Layoutlm: Pre-training of text and layout for document image understanding","volume-title":"KDD \u201920: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23\u201327, 2020","author":"Yiheng","year":"2020"},{"key":"2022040714222640000_bib36","first-page":"2579","article-title":"LayoutLMv2: Multi-modal pre-training for visually-rich document understanding","volume-title":"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL\/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1\u20136, 2021","author":"Yang","year":"2021"},{"key":"2022040714222640000_bib37","doi-asserted-by":"publisher","first-page":"1725","DOI":"10.1145\/3340531.3411908","article-title":"Beyond 512 tokens: Siamese multi-depth transformer-based hierarchical encoder for long-form document matching","volume-title":"Proceedings of the 29th ACM International Conference on Information & Knowledge Management","author":"Yang","year":"2020"},{"key":"2022040714222640000_bib38","doi-asserted-by":"publisher","first-page":"5059","DOI":"10.18653\/v1\/P19-1499","article-title":"HIBERT: Document level pre-training of hierarchical bidirectional transformers for document summarization","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Zhang","year":"2019"},{"key":"2022040714222640000_bib39","doi-asserted-by":"publisher","first-page":"1015","DOI":"10.1109\/ICDAR.2019.00166","article-title":"PubLayNet: Largest dataset ever for document layout analysis","volume-title":"2019 International Conference on Document Analysis and Recognition (ICDAR)","author":"Zhong","year":"2019"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00466\/2006993\/tacl_a_00466.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00466\/2006993\/tacl_a_00466.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,4,7]],"date-time":"2022-04-07T14:23:34Z","timestamp":1649341414000},"score":1,"resource":{"primary":{"URL":"https:\/\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00466\/110438\/VILA-Improving-Structured-Content-Extraction-from"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022]]},"references-count":39,"URL":"https:\/\/doi.org\/10.1162\/tacl_a_00466","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2022]]},"published":{"date-parts":[[2022]]}}}