{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,27]],"date-time":"2026-02-27T04:25:33Z","timestamp":1772166333855,"version":"3.50.1"},"reference-count":32,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,3,20]],"date-time":"2025-03-20T00:00:00Z","timestamp":1742428800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,3,20]],"date-time":"2025-03-20T00:00:00Z","timestamp":1742428800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"crossref","award":["GM144308"],"award-info":[{"award-number":["GM144308"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"crossref","award":["GM144308"],"award-info":[{"award-number":["GM144308"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Chan Zuckerberg Initiative DAF","award":["2022-250218"],"award-info":[{"award-number":["2022-250218"]}]},{"name":"Chan Zuckerberg Initiative DAF","award":["2022-250218"],"award-info":[{"award-number":["2022-250218"]}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BioData Mining"],"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:sec>\n                    <jats:title>Background<\/jats:title>\n                    <jats:p>Tables are useful information artifacts that allow easy detection of missing data and have been deployed by several publishers to improve the amount of information present for key resources and reagents such as antibodies, cell lines, and other tools that constitute the inputs to a study. STAR*Methods key resource tables have increased the \u201cfindability\u201d of these key resources, improving transparency of the paper by warning authors (before publication) about any problems, such as key resources that cannot be uniquely identified or those that are known to be problematic, but they have not been commonly available outside of the Cell Press journal family. We believe that processing preprints and adding these \u2019resource table candidates\u2019 automatically will improve the availability of structured and linked information about research resources in a broader swath of the scientific literature. However, if the authors have already added a key resource table, that table must be detected, and each entity must be correctly identified and faithfully restructured into a standard format.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Methods<\/jats:title>\n                    <jats:p>We introduce four end-to-end table extraction pipelines to extract and faithfully reconstruct key resource tables from biomedical papers in PDF format. The pipelines employ machine learning approaches for key resource table page identification, \u201cTable Transformer\u201d models for table detection, and table structure recognition. We also introduce a character-level generative pre-trained transformer (GPT) language model for scientific tables pre-trained on over 11 million scientific tables. We fine-tuned our table-specific language model with synthetic training data generated with a novel approach to alleviate row over-segmentation significantly improving key resource extraction performance.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Results<\/jats:title>\n                    <jats:p>The extraction of key resource tables in PDF files by the popular GROBID tool resulted in a Grid Table Similarity (GriTS) score of 0.12. All of our pipelines have outperformed GROBID by a large margin. Our best pipeline with table-specific language model-based row merger achieved a GriTS score of 0.90.<\/jats:p>\n                  <\/jats:sec>\n                  <jats:sec>\n                    <jats:title>Conclusions<\/jats:title>\n                    <jats:p>Our pipelines allow the detection and extraction of key resources from tables with much higher accuracy, enabling the deployment of automated research resource extraction tools on BioRxiv to help authors correct unidentifiable key resources detected in their articles and improve the reproducibility of their findings. The code, table-specific language model, annotated training and evaluation data are publicly available.<\/jats:p>\n                  <\/jats:sec>","DOI":"10.1186\/s13040-025-00438-9","type":"journal-article","created":{"date-parts":[[2025,3,20]],"date-time":"2025-03-20T08:00:00Z","timestamp":1742457600000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Automatic detection and extraction of key resources from tables in biomedical papers"],"prefix":"10.1186","volume":"18","author":[{"given":"Ibrahim Burak","family":"Ozyurt","sequence":"first","affiliation":[]},{"given":"Anita","family":"Bandrowski","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,3,20]]},"reference":[{"issue":"5","key":"438_CR1","doi-asserted-by":"publisher","first-page":"1059","DOI":"10.1016\/j.cell.2016.08.021","volume":"166","author":"E Marcus","year":"2016","unstructured":"Marcus E. A STAR Is Born. Cell. 2016;166(5):1059\u201360. https:\/\/doi.org\/10.1016\/j.cell.2016.08.021.","journal-title":"Cell."},{"key":"438_CR2","doi-asserted-by":"publisher","unstructured":"Piekniewska A, Anderson N, Roelandse M, Lloyd KCK, Korf I, Voss SR, et\u00a0al. Do organisms need an impact factor? Citations of key biological resources including model organisms reveal usage patterns and impact. bioRxiv. 2024. https:\/\/doi.org\/10.1101\/2024.01.15.575636.","DOI":"10.1101\/2024.01.15.575636"},{"key":"438_CR3","doi-asserted-by":"publisher","unstructured":"Monya Baker. Reproducibility crisis: Blame it on the antibodies. Nature. 2015;521. https:\/\/doi.org\/10.1038\/521274a.","DOI":"10.1038\/521274a"},{"key":"438_CR4","doi-asserted-by":"publisher","unstructured":"Menke J, Roelandse M, Ozyurt B, Martone M, Bandrowski A. The Rigor and Transparency Index Quality Metric for Assessing Biological and Medical Science Methods. iScience. 2020;23(11):101698. https:\/\/doi.org\/10.1016\/j.isci.2020.101698.","DOI":"10.1016\/j.isci.2020.101698"},{"key":"438_CR5","doi-asserted-by":"publisher","unstructured":"LP Freedman R G Venugopalan. Reproducibility2020: Progress and priorities. F1000Res. 2017;6:604. https:\/\/doi.org\/10.12688\/f1000research.11334.1.","DOI":"10.12688\/f1000research.11334.1"},{"issue":"3","key":"438_CR6","doi-asserted-by":"publisher","first-page":"434","DOI":"10.1016\/j.neuron.2016.04.030","volume":"90","author":"A Bandrowski","year":"2016","unstructured":"Bandrowski A, Martone M. RRIDs: A Simple Step toward Improving Reproducibility through Rigor and Transparency of Experimental Methods. Neuron. 2016;90(3):434\u20136. https:\/\/doi.org\/10.1016\/j.neuron.2016.04.030.","journal-title":"Neuron."},{"key":"438_CR7","doi-asserted-by":"publisher","unstructured":"Bhandari\u00a0Neupane J, Neupane RP, Luo Y, Yoshida WY, Sun R, Williams PG. Characterization of Leptazolines A\u2013D, Polar Oxazolines from the Cyanobacterium Leptolyngbya sp., Reveals a Glitch with the \u201cWilloughby\u2013Hoye\u201d Scripts for Calculating NMR Chemical Shifts. Org Lett. 2019;21(20):8449\u20138453. PMID: 31591889. https:\/\/doi.org\/10.1021\/acs.orglett.9b03216.","DOI":"10.1021\/acs.orglett.9b03216"},{"key":"438_CR8","volume-title":"Tabular abstraction, editing, and formatting","author":"X Wang","year":"1996","unstructured":"Wang X. Tabular abstraction, editing, and formatting. Waterloo: University of Waterloo; 1996."},{"key":"438_CR9","doi-asserted-by":"crossref","unstructured":"Gatterbauer W, Bohunsky P, Herzog M, Kr\u00fcpl B, Pollak B. Towards domain-independent information extraction from web tables. In: Proceedings of the 16th international conference on World Wide Web.\u00a0New York: Association for Computing Machinery; 2007. p. 71\u201380.","DOI":"10.1145\/1242572.1242583"},{"key":"438_CR10","doi-asserted-by":"publisher","unstructured":"Oro E, Ruffolo M. PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents. In: 2009 10th International Conference on Document Analysis and Recognition. 2009. pp. 906\u2013910. https:\/\/doi.org\/10.1109\/ICDAR.2009.12.","DOI":"10.1109\/ICDAR.2009.12"},{"key":"438_CR11","doi-asserted-by":"crossref","unstructured":"Prasad D, Gadpal A, Kapadni K, Visave M, Sultanpure KA. CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents. In: 2020 IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).\u00a0New York: Institute of Electrical and Electronics Engineers; 2020. p. 2439\u20132447.","DOI":"10.1109\/CVPRW50498.2020.00294"},{"key":"438_CR12","doi-asserted-by":"publisher","unstructured":"Schreiber S, Agne S, Wolf I, Dengel A, Ahmed S. DeepDeSRT: Deep Learning for Detection and Structure Recognition of Tables in Document Images. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol.\u00a001. 2017. pp. 1162\u20131167. https:\/\/doi.org\/10.1109\/ICDAR.2017.192.","DOI":"10.1109\/ICDAR.2017.192"},{"key":"438_CR13","doi-asserted-by":"crossref","unstructured":"Zhong X, ShafieiBavani E, Yepes AJ. Image-based table recognition: data, model, and evaluation. 2020. arXiv:1911.10683.","DOI":"10.1007\/978-3-030-58589-1_34"},{"key":"438_CR14","doi-asserted-by":"crossref","unstructured":"Smock B, Pesala R, Abraham R. PubTables-1M: Towards comprehensive table extraction from unstructured documents. In: Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition (CVPR).\u00a0New York: Institute of Electrical and Electronics Engineers; 2022. p. 4634\u20134642.","DOI":"10.1109\/CVPR52688.2022.00459"},{"issue":"6","key":"438_CR15","doi-asserted-by":"publisher","first-page":"1624","DOI":"10.1093\/bioinformatics\/btab843","volume":"38","author":"T Adams","year":"2021","unstructured":"Adams T, Namysl M, Kodamullil AT, Behnke S, Jacobs M. Benchmarking table recognition performance on biomedical literature on neurological disorders. Bioinformatics. 2021;38(6):1624\u201330. https:\/\/doi.org\/10.1093\/bioinformatics\/btab843.","journal-title":"Bioinformatics."},{"key":"438_CR16","doi-asserted-by":"publisher","unstructured":"Jimeno\u00a0Yepes A, Verspoor K. Literature mining of genetic variants for curation: quantifying the importance of supplementary material. Database. 2014;2014. https:\/\/doi.org\/10.1093\/database\/bau003.","DOI":"10.1093\/database\/bau003"},{"issue":"1","key":"438_CR17","doi-asserted-by":"publisher","first-page":"55","DOI":"10.1007\/s10032-019-00317-0","volume":"22","author":"N Milosevic","year":"2019","unstructured":"Milosevic N, Gregson C, Hernandez R, Nenadic G. A framework for information extraction from tables in biomedical literature. Int J Doc Anal Recognit. 2019;22(1):55\u201378. https:\/\/doi.org\/10.1007\/s10032-019-00317-0.","journal-title":"Int J Doc Anal Recognit."},{"key":"438_CR18","doi-asserted-by":"publisher","first-page":"128","DOI":"10.1016\/j.jbi.2014.10.002","volume":"53","author":"R Xu","year":"2015","unstructured":"Xu R, Wang Q. Combining automatic table classification and relationship extraction in extracting anticancer drug\u2013side effect pairs from full-text articles. J Biomed Inform. 2015;53:128\u201335. https:\/\/doi.org\/10.1016\/j.jbi.2014.10.002.","journal-title":"J Biomed Inform."},{"key":"438_CR19","unstructured":"GROBID. GitHub. 2008\u20132024.\u00a0Retrieved June 2024.\u00a0https:\/\/github.com\/kermitt2\/grobid."},{"key":"438_CR20","doi-asserted-by":"publisher","unstructured":"Krithara A, Mork JG, Nentidis A, Paliouras G. The road from manual to automatic semantic indexing of biomedical literature: a 10 years journey. Front Res Metr Anal. 2023;(8:1250930). https:\/\/doi.org\/10.3389\/frma.2023.1250930.","DOI":"10.3389\/frma.2023.1250930"},{"key":"438_CR21","doi-asserted-by":"publisher","first-page":"241","DOI":"10.1016\/S0893-6080(05)80023-1","volume":"5","author":"D Wolpert","year":"1992","unstructured":"Wolpert D. Stacked generalization. Neural Netw. 1992;5:241\u201359. https:\/\/doi.org\/10.1016\/S0893-6080(05)80023-1.","journal-title":"Neural Netw."},{"key":"438_CR22","doi-asserted-by":"crossref","unstructured":"Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Comput. 1997;9(8):1735\u201380.","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"438_CR23","doi-asserted-by":"crossref","unstructured":"Pennington J, Socher R, Manning CD. Glove: Global vectors for word representation. In: EMNLP, vol.\u00a014.\u00a0Association for Computing Machinery:\u00a0{New York; 2014. p. 1532\u20131543.","DOI":"10.3115\/v1\/D14-1162"},{"issue":"3","key":"438_CR24","doi-asserted-by":"publisher","first-page":"443","DOI":"10.1016\/0022-2836(70)90057-4","volume":"48","author":"SB Needleman","year":"1970","unstructured":"Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970;48(3):443\u201353.","journal-title":"J Mol Biol."},{"key":"438_CR25","doi-asserted-by":"publisher","unstructured":"Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-End Object Detection with Transformers. In: Computer Vision \u2013 ECCV 2020: 16th European Conference, Glasgow, UK, August 23\u201328, 2020, Proceedings, Part I. Berlin, Heidelberg: Springer-Verlag; 2020. pp. 213\u2013229. https:\/\/doi.org\/10.1007\/978-3-030-58452-8_13.","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"438_CR26","unstructured":"Smith R. Tesseract OCR Engine. 2007.\u00a0Retrieved September 2024.\u00a0https:\/\/web.archive.org\/web\/20160819190257\/tesseract-ocr.googlecode.com\/files\/TesseractOSCON.pdf."},{"key":"438_CR27","unstructured":"Jelinek F, Mercer RL. Interpolated estimation of Markov source parameters from sparse data. In: In Proceedings of the Workshop on Pattern Recognition in Practice. Amsterdam: North-Holland; 1980. pp. 381\u2013397."},{"key":"438_CR28","unstructured":"Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et\u00a0al. Attention is all you need. In: Advances in neural information processing systems. 2017. pp. 5998\u20136008. arXiv:1706.03762."},{"key":"438_CR29","unstructured":"Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are unsupervised multitask learners. 2019. Retrieved September 2024.\u00a0https:\/\/api.semanticscholar.org\/CorpusID:160025533."},{"key":"438_CR30","doi-asserted-by":"publisher","unstructured":"Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Burstein J, Doran C, Solorio T, editors. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis: Association for Computational Linguistics. 2019. pp. 4171\u20134186.https:\/\/doi.org\/10.18653\/v1\/N19-1423.","DOI":"10.18653\/v1\/N19-1423"},{"key":"438_CR31","doi-asserted-by":"publisher","unstructured":"Smock B, Pesala R, Abraham R. GriTS: Grid Table Similarity Metric for Table Structure Recognition. In: Document Analysis and Recognition - ICDAR 2023. 2023. pp. 535\u2013549. https:\/\/doi.org\/10.1007\/978-3-031-41734-4_33.","DOI":"10.1007\/978-3-031-41734-4_33"},{"key":"438_CR32","doi-asserted-by":"crossref","unstructured":"Guan S, Greene D. Advancing post-OCR correction: a comparative study of synthetic data. 2024. arXiv:2408.02253.","DOI":"10.18653\/v1\/2024.findings-acl.361"}],"container-title":["BioData Mining"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13040-025-00438-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13040-025-00438-9\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13040-025-00438-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,20]],"date-time":"2025-03-20T08:00:10Z","timestamp":1742457610000},"score":1,"resource":{"primary":{"URL":"https:\/\/biodatamining.biomedcentral.com\/articles\/10.1186\/s13040-025-00438-9"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,3,20]]},"references-count":32,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["438"],"URL":"https:\/\/doi.org\/10.1186\/s13040-025-00438-9","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2024.10.15.618379","asserted-by":"object"}]},"ISSN":["1756-0381"],"issn-type":[{"value":"1756-0381","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,3,20]]},"assertion":[{"value":"21 November 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"5 March 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"20 March 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare no competing interests.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"23"}}