{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,3]],"date-time":"2026-04-03T12:34:19Z","timestamp":1775219659587,"version":"3.50.1"},"reference-count":55,"publisher":"Frontiers Media SA","license":[{"start":{"date-parts":[[2025,9,5]],"date-time":"2025-09-05T00:00:00Z","timestamp":1757030400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["frontiersin.org"],"crossmark-restriction":true},"short-container-title":["Front. Artif. Intell."],"abstract":"<jats:sec><jats:title>Introduction<\/jats:title><jats:p>Automating the extraction of information from Portable Document Format (PDF) documents represents a major advancement in information extraction, with applications in various domains such as healthcare, law, or biochemistry. However, existing solutions face challenges related to accuracy, domain adaptability, and implementation complexity.<\/jats:p><\/jats:sec><jats:sec><jats:title>Methods<\/jats:title><jats:p>A systematic review of the literature was conducted using the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) methodology to examine approaches and trends in PDF information extraction and storage approaches.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>The review revealed three dominant methodological categories: rule-based systems, statistical learning models, and neural network-based approaches. Key limitations include the rigidity of rule-based methods, the lack of annotated domain-specific datasets for learning-based approaches, and issues such as hallucinations in large language models.<\/jats:p><\/jats:sec><jats:sec><jats:title>Discussion<\/jats:title><jats:p>To overcome these limitations, a conceptual framework is proposed comprising nine core components: project manager, document manager, document pre-processor, ontology manager, information extractor, annotation engine, question-answering tool, knowledge visualizer, and data exporter. This framework aims to improve the accuracy, adaptability, and usability of PDF information extraction systems.<\/jats:p><\/jats:sec>","DOI":"10.3389\/frai.2025.1466092","type":"journal-article","created":{"date-parts":[[2025,9,5]],"date-time":"2025-09-05T11:05:48Z","timestamp":1757070348000},"update-policy":"https:\/\/doi.org\/10.3389\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["A review on knowledge and information extraction from PDF documents and storage approaches"],"prefix":"10.3389","volume":"8","author":[{"given":"Salvador D.","family":"Atagong","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Henri","family":"Tonnang","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Kennedy","family":"Senagi","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Mark","family":"Wamalwa","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Komi M.","family":"Agboka","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"John","family":"Odindi","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1965","published-online":{"date-parts":[[2025,9,5]]},"reference":[{"key":"B1","article-title":"PDF articles metadata harvester","author":"Abdillah","year":"2013","journal-title":"arXiv:1301.6591"},{"key":"B2","doi-asserted-by":"publisher","first-page":"102167","DOI":"10.1016\/j.artmed.2021.102167","article-title":"Substituting clinical features using synthetic medical phrases: medical text data augmentation techniques","volume":"120","author":"Abdollahi","year":"2021","journal-title":"Artif. Intell. Med"},{"key":"B3","doi-asserted-by":"publisher","first-page":"10535","DOI":"10.1109\/ACCESS.2023.3240898","article-title":"Systematic literature review of information extraction from textual data: recent methods, applications, trends, and challenges","volume":"11","author":"Abdullah","year":"2023","journal-title":"IEEE Access"},{"key":"B4","doi-asserted-by":"publisher","first-page":"100021","DOI":"10.1016\/j.nlp.2023.100021","article-title":"Ontology-based data interestingness: a state-of-the-art review","volume":"4","author":"Abhilash","year":"2023","journal-title":"Nat. Lang. Proc. J."},{"key":"B5","doi-asserted-by":"publisher","first-page":"103324","DOI":"10.1016\/j.jbi.2019.103324","article-title":"Disease: a biomedical text analytics system for disease symptom extraction and characterization","volume":"100","author":"Abulaish","year":"2019","journal-title":"J. Biomed. Inform"},{"key":"B6","doi-asserted-by":"publisher","first-page":"1180962","DOI":"10.3389\/fphar.2023.1180962","article-title":"Approach to machine learning for extraction of real-world data variables from electronic health records","volume":"14","author":"Adamson","year":"2023","journal-title":"Front. Pharmacol"},{"key":"B7","doi-asserted-by":"publisher","first-page":"129359","DOI":"10.1109\/ACCESS.2020.3009021","article-title":"A systematic approach to map the research articles' sections to imrad","volume":"8","author":"Ahmed","year":"2020","journal-title":"IEEE Access"},{"key":"B8","doi-asserted-by":"publisher","first-page":"e10295","DOI":"10.2196\/preprints.10295","article-title":"Three-dimensional portable document format (3D PDF) in clinical communication and biomedical sciences: systematic review of applications, tools, and protocols","volume":"6","author":"Axel Newe","year":"2018","journal-title":"JMIR Med. Inform"},{"key":"B9","doi-asserted-by":"publisher","first-page":"454","DOI":"10.18653\/v1\/2022.semeval-1.61","article-title":"\u201cYNU-HPCC at SemEval-2022 task 4: finetuning pretrained language models for patronizing and condescending language detection,\u201d","author":"Bai","year":"2022","journal-title":"Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)"},{"key":"B10","doi-asserted-by":"publisher","first-page":"141","DOI":"10.1016\/j.ijmedinf.2019.04.022","article-title":"Natural language processing of German clinical colorectal cancer notes for guideline-based treatment evaluation","volume":"127","author":"Becker","year":"2019","journal-title":"Int. J. Med. Inform"},{"key":"B11","doi-asserted-by":"publisher","first-page":"2215","DOI":"10.1002\/asi.23329","article-title":"Growth rates of modern science: a bibliometric analysis based on the number of publications and cited references","volume":"66","author":"Bornmann","year":"2015","journal-title":"J. Assoc. Inf. Sci. Technol"},{"key":"B12","doi-asserted-by":"crossref","first-page":"227","DOI":"10.1109\/MIPR62202.2024.00042","article-title":"\u201cRetrieval augmented structured generation: Business document information extraction as tool use,\u201d","volume-title":"2024 IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR)","author":"Cesista","year":"2024"},{"key":"B13","doi-asserted-by":"publisher","first-page":"20","DOI":"10.1186\/s12859-021-04534-5","article-title":"Biomedical relation extraction via knowledge-enhanced reading comprehension","volume":"23","author":"Chen","year":"2022","journal-title":"BMC Bioinform"},{"key":"B14","doi-asserted-by":"publisher","first-page":"939","DOI":"10.3390\/electronics12040939","article-title":"A framework for understanding unstructured financial documents using rpa and multimodal approach","volume":"12","author":"Cho","year":"2023","journal-title":"Electronics"},{"key":"B15","doi-asserted-by":"publisher","first-page":"603","DOI":"10.1007\/978-81-322-3972-7_19","article-title":"\u201cNatural language processing,\u201d","author":"Chowdhary","year":"2020","journal-title":"Fundamentals of Artificial Intelligence"},{"key":"B16","first-page":"4171","article-title":"\u201cBERT: pre-training of deep bidirectional transformers for language understanding,\u201d","author":"Devlin","year":"2019","journal-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)"},{"key":"B17","doi-asserted-by":"publisher","first-page":"103439","DOI":"10.1016\/j.compind.2021.103439","article-title":"Transformation from human-readable documents and archives in arc welding domain to machine-interpretable data","volume":"128","author":"Dong","year":"2021","journal-title":"Comput. Ind"},{"key":"B18","doi-asserted-by":"crossref","first-page":"1719","DOI":"10.1109\/ICPR56361.2022.9956590","article-title":"\u201cGraph neural networks and representation embedding for table extraction in pdf documents,\u201d","volume-title":"2022 26th International Conference on Pattern Recognition (ICPR)","author":"Gemelli","year":"2022"},{"key":"B19","doi-asserted-by":"publisher","first-page":"103089","DOI":"10.1109\/ACCESS.2022.3209066","article-title":"Relationship extraction and processing for knowledge graph of welding manufacturing","volume":"10","author":"Guan","year":"2022","journal-title":"IEEE Access"},{"key":"B20","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41524-022-00784-w","article-title":"MatSciBERT: a materials domain language model for text mining and information extraction","volume":"8","author":"Gupta","year":"2022","journal-title":"NPJ Comput. Mater"},{"key":"B21","doi-asserted-by":"publisher","first-page":"2370","DOI":"10.18653\/v1\/2021.findings-emnlp.204","article-title":"\u201cREBEL: relation extraction by end-to-end Language generation,\u201d","author":"Huguet Cabot","year":"2021","journal-title":"Findings of the Association for Computational Linguistics: EMNLP 2021"},{"key":"B22","doi-asserted-by":"publisher","first-page":"674730","DOI":"10.3389\/fvets.2021.674730","article-title":"Large-scale data mining of rapid residue detection assay data from html and pdf documents: improving data access and visualization for veterinarians","volume":"8","author":"Jaberi-Douraki","year":"2021","journal-title":"Front. Veter. Sci"},{"key":"B23","doi-asserted-by":"publisher","first-page":"2353","DOI":"10.1038\/s41598-023-29323-3","article-title":"Information extraction from German radiological reports for general clinical text and language understanding","volume":"13","author":"Jantscher","year":"2023","journal-title":"Sci. Rep"},{"key":"B24","author":"Johnson","year":"2021","journal-title":"Duff Johnson - page 13- PDF Association"},{"key":"B25","doi-asserted-by":"publisher","first-page":"100048","DOI":"10.1016\/j.nlp.2023.100048","article-title":"A survey of GPT-3 family large language models including ChatGPT and GPT-4","volume":"6","author":"Kalyan","year":"2024","journal-title":"Nat. Lang. Proc. J"},{"key":"B26","doi-asserted-by":"publisher","first-page":"17706","DOI":"10.1109\/ACCESS.2024.3522141","article-title":"Computer vision-based framework for data extraction from heterogeneous financial tables: a comprehensive approach to unlocking financial insights","volume":"13","author":"Khandokar","year":"2024","journal-title":"IEEE Access"},{"key":"B27","doi-asserted-by":"publisher","first-page":"102155","DOI":"10.1016\/j.isci.2021.102155","article-title":"Opportunities and challenges of text mining in materials research","volume":"24","author":"Kononova","year":"2021","journal-title":"Iscience"},{"key":"B28","doi-asserted-by":"publisher","first-page":"4381","DOI":"10.1093\/bioinformatics\/btz228","article-title":"Figure and caption extraction from biomedical documents","volume":"35","author":"Li","year":"2019","journal-title":"Bioinformatics"},{"key":"B29","doi-asserted-by":"publisher","DOI":"10.7326\/0003-4819-151-4-200908180-00136","article-title":"The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: explanation and elaboration","author":"Liberati","year":"2009","journal-title":"Ann. Internal Med"},{"key":"B30","doi-asserted-by":"crossref","first-page":"289","DOI":"10.1007\/978-981-10-5508-9_28","article-title":"\u201cInformation extraction approaches: a survey,\u201d","volume-title":"Information and Communication Technology","author":"Mannai","year":"2018"},{"key":"B31","doi-asserted-by":"publisher","first-page":"e40","DOI":"10.1093\/jamia\/ocw097","article-title":"Congestive heart failure information extraction framework for automated treatment performance measures assessment","volume":"24","author":"Meystre","year":"2017","journal-title":"J. Am. Med. Inform. Assoc"},{"key":"B32","doi-asserted-by":"publisher","first-page":"254","DOI":"10.1016\/j.lisr.2015.02.002","article-title":"The Portable Document Format (PDF) accessibility practice of four journal publishers","volume":"37","author":"Nganji","year":"2015","journal-title":"Libr. Inf. Sci. Res"},{"key":"B33","doi-asserted-by":"publisher","first-page":"e10710","DOI":"10.1016\/j.heliyon.2022.e10710","article-title":"Automating the extraction of information from a historical text and building a linked data model for the domain of ecology and conservation science","volume":"8","author":"Nundloll","year":"2022","journal-title":"Heliyon"},{"key":"B34","article-title":"GPT-4 Technical Report","author":"OpenA","year":"2024","journal-title":"arXiv:2303.08774"},{"key":"B35","doi-asserted-by":"crossref","first-page":"329","DOI":"10.1109\/ICDAR.2019.00060","article-title":"\u201cAttend, copy, parse end-to-end information extraction from documents,\u201d","volume-title":"2019 International Conference on Document Analysis and Recognition (ICDAR)","author":"Palm","year":"2019"},{"key":"B36","doi-asserted-by":"publisher","first-page":"5630","DOI":"10.3390\/app10165630","article-title":"A methodology for open information extraction and representation from large scientific corpora: the cord-19 data exploration use case","volume":"10","author":"Papadopoulos","year":"2020","journal-title":"Appl. Sci"},{"key":"B37","doi-asserted-by":"crossref","first-page":"167","DOI":"10.1109\/ICSC56153.2023.00035","article-title":"\u201cGenealogical relationship extraction from unstructured text using fine-tuned transformer models,\u201d","volume-title":"2023 IEEE 17th International Conference on Semantic Computing (ICSC)","author":"Parrolivelli","year":"2023"},{"key":"B38","doi-asserted-by":"crossref","first-page":"9","DOI":"10.1145\/3587259.3627572","article-title":"\u201cProcedural text mining with large language models,\u201d","volume-title":"Proceedings of the 12th Knowledge Capture Conference 2023, K-CAP '23","author":"Rula","year":"2023"},{"key":"B39","article-title":"PDFTriage: question answering over long, structured documents","author":"Saad-Falcon","year":"2023","journal-title":"arXiv:2309.08872 [cs"},{"key":"B40","doi-asserted-by":"publisher","first-page":"140","DOI":"10.1111\/spsr.12590","article-title":"Processing large-scale archival records: the case of the swiss parliamentary records","volume":"30","author":"Salamanca","year":"2024","journal-title":"Swiss Polit. Sci. Rev"},{"key":"B41","doi-asserted-by":"publisher","first-page":"123038","DOI":"10.1016\/j.eswa.2023.123038","article-title":"Cnosso, a novel method for business document automation based on open information extraction","volume":"245","author":"Scannapieco","year":"2024","journal-title":"Expert Syst. Appl"},{"key":"B42","doi-asserted-by":"publisher","first-page":"273","DOI":"10.1007\/s10844-023-00814-z","article-title":"Oie4pa: open information extraction for the public administration","volume":"62","author":"Siciliani","year":"2024","journal-title":"J. Intell. Inf. Syst"},{"key":"B43","doi-asserted-by":"publisher","first-page":"4634","DOI":"10.1109\/CVPR52688.2022.00459","article-title":"\u201cPubtables-1m: towards comprehensive table extraction from unstructured documents,\u201d","author":"Smock","year":"2022","journal-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition"},{"key":"B44","doi-asserted-by":"publisher","first-page":"106367","DOI":"10.1016\/j.oregeorev.2024.106367","article-title":"Deep learning-based mineral exploration named entity recognition: a case study of granitic pegmatite-type lithium deposits","volume":"175","author":"Tao","year":"2024","journal-title":"Ore Geol. Rev"},{"key":"B45","article-title":"Gemini: a family of highly capable multimodal models","author":"Team","year":"2023","journal-title":"arXiv preprint arXiv:2312.11805"},{"key":"B46","doi-asserted-by":"publisher","first-page":"1941","DOI":"10.3390\/electronics13101941","article-title":"Unstructured document information extraction method with multi-faceted domain knowledge graph assistance for m2m customs risk prevention and screening application","volume":"13","author":"Tian","year":"2024","journal-title":"Electronics"},{"key":"B47","article-title":"LLaMA: open and efficient foundation language models","author":"Touvron","year":"2023","journal-title":"arXiv:2302.13971"},{"key":"B48","doi-asserted-by":"publisher","first-page":"407","DOI":"10.1186\/s12888-022-04058-z","article-title":"Information extraction from free text for aiding transdiagnostic psychiatry: constructing NLP pipelines tailored to clinicians' needs","volume":"22","author":"Turner","year":"2022","journal-title":"BMC Psychiatry"},{"key":"B49","article-title":"\u201cAttention is all you need,\u201d","volume-title":"Advances in Neural Information Processing Systems","author":"Vaswani","year":"2017"},{"key":"B50","article-title":"Wukong: a large multimodal model for efficient long pdf reading with end-to-end sparse sampling","author":"Xie","year":"2024","journal-title":"arXiv preprint arXiv:2410.05970"},{"key":"B51","doi-asserted-by":"publisher","first-page":"e2300269","DOI":"10.1200\/CCI.23.00269","article-title":"Extraction and imputation of eastern cooperative oncology group performance status from unstructured oncology notes using language models","volume":"8","author":"Xu","year":"2024","journal-title":"JCO Clin. Cancer Inform"},{"key":"B52","doi-asserted-by":"publisher","first-page":"1161","DOI":"10.1007\/s10115-022-01665-w","article-title":"A survey on extraction of causal relations from natural language text","volume":"64","author":"Yang","year":"2022","journal-title":"Knowl. Inf. Syst"},{"key":"B53","doi-asserted-by":"publisher","first-page":"103276","DOI":"10.1016\/j.jbi.2019.103276","article-title":"Ontology-based clinical information extraction from physician's free-text notes","volume":"98","author":"Yehia","year":"2019","journal-title":"J. Biomed. Inform"},{"key":"B54","doi-asserted-by":"publisher","first-page":"521","DOI":"10.1055\/s-0042-1748144","article-title":"Transforming thyroid cancer diagnosis and staging information from unstructured reports to the observational medical outcome partnership common data model","volume":"13","author":"Yoo","year":"2022","journal-title":"Appl. Clin. Inform"},{"key":"B55","doi-asserted-by":"publisher","first-page":"1633","DOI":"10.1021\/acs.jcim.1c01198","article-title":"Pdfdataextractor: a tool for reading scientific text and interpreting metadata from the typeset literature in the portable document format","volume":"62","author":"Zhu","year":"2022","journal-title":"J. Chem. Inf. Model"}],"container-title":["Frontiers in Artificial Intelligence"],"original-title":[],"link":[{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/frai.2025.1466092\/full","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,5]],"date-time":"2025-09-05T11:05:50Z","timestamp":1757070350000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.frontiersin.org\/articles\/10.3389\/frai.2025.1466092\/full"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,9,5]]},"references-count":55,"alternative-id":["10.3389\/frai.2025.1466092"],"URL":"https:\/\/doi.org\/10.3389\/frai.2025.1466092","relation":{},"ISSN":["2624-8212"],"issn-type":[{"value":"2624-8212","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,9,5]]},"article-number":"1466092"}}