{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,7,6]],"date-time":"2026-07-06T14:04:51Z","timestamp":1783346691360,"version":"3.54.6"},"reference-count":43,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2021,12,1]],"date-time":"2021-12-01T00:00:00Z","timestamp":1638316800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,12,11]],"date-time":"2021-12-11T00:00:00Z","timestamp":1639180800000},"content-version":"vor","delay-in-days":10,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100000923","name":"Australian Research Council","doi-asserted-by":"crossref","award":["LP160101469"],"award-info":[{"award-number":["LP160101469"]}],"id":[{"id":"10.13039\/501100000923","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Cheminform"],"published-print":{"date-parts":[[2021,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>\n                    Chemical patents are a commonly used channel for disclosing novel compounds and reactions, and hence represent important resources for chemical and pharmaceutical research. Key chemical data in patents is often presented in tables. Both the number and the size of tables can be very large in patent documents. In addition, various types of information can be presented in tables in patents, including spectroscopic and physical data, or pharmacological use and effects of chemicals. Since images of Markush structures and merged cells are commonly used in these tables, their structure also shows substantial variation. This heterogeneity in content and structure of tables in chemical patents makes relevant information difficult to find. We therefore propose a new text mining task of automatically categorising tables in chemical patents based on their contents. Categorisation of tables based on the nature of their content can help to identify tables containing key information, improving the accessibility of information in patents that is highly relevant for new inventions. For developing and evaluating methods for the table classification task, we developed a new dataset, called\n                    <jats:sc>ChemTables<\/jats:sc>\n                    , which consists of 788 chemical patent tables with labels of their content type. We introduce this data set in detail. We further establish strong baselines for the table classification task in chemical patents by applying state-of-the-art neural network models developed for natural language processing, including TabNet, ResNet and Table-BERT on\n                    <jats:sc>ChemTables<\/jats:sc>\n                    . The best performing model, Table-BERT, achieves a performance of 88.66 micro-averaged\n                    <jats:inline-formula>\n                      <jats:alternatives>\n                        <jats:tex-math>$$F_1$$<\/jats:tex-math>\n                        <mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                          <mml:msub>\n                            <mml:mi>F<\/mml:mi>\n                            <mml:mn>1<\/mml:mn>\n                          <\/mml:msub>\n                        <\/mml:math>\n                      <\/jats:alternatives>\n                    <\/jats:inline-formula>\n                    score on the table classification task. The\n                    <jats:sc>ChemTables<\/jats:sc>\n                    dataset is publicly available at\n                    <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/doi.org\/10.17632\/g7tjh7tbrj.3\">https:\/\/doi.org\/10.17632\/g7tjh7tbrj.3<\/jats:ext-link>\n                    , subject to the CC BY NC 3.0 license. Code\/models evaluated in this work are in a Github repository\n                    <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/zenanz\/ChemTables\">https:\/\/github.com\/zenanz\/ChemTables<\/jats:ext-link>\n                    .\n                  <\/jats:p>","DOI":"10.1186\/s13321-021-00568-2","type":"journal-article","created":{"date-parts":[[2021,12,11]],"date-time":"2021-12-11T03:02:31Z","timestamp":1639191751000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["ChemTables: a dataset for semantic classification on tables in chemical patents"],"prefix":"10.1186","volume":"13","author":[{"given":"Zenan","family":"Zhai","sequence":"first","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Christian","family":"Druckenbrodt","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Camilo","family":"Thorne","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Saber A.","family":"Akhondi","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Dat Quoc","family":"Nguyen","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"given":"Trevor","family":"Cohn","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8661-1544","authenticated-orcid":false,"given":"Karin","family":"Verspoor","sequence":"additional","affiliation":[],"role":[{"vocabulary":"crossref","role":"author"}]}],"member":"297","published-online":{"date-parts":[[2021,12,11]]},"reference":[{"issue":"1","key":"568_CR1","doi-asserted-by":"publisher","first-page":"49","DOI":"10.1186\/s13321-015-0097-z","volume":"7","author":"S Senger","year":"2015","unstructured":"Senger S, Bartek L, Papadatos G, Gaulton A (2015) Managing expectations: Assessment of chemistry databases generated by automated extraction of chemical structures from patents. J Cheminformat 7(1):49","journal-title":"J Cheminformat"},{"key":"568_CR2","doi-asserted-by":"publisher","first-page":"001","DOI":"10.1093\/database\/baz001","volume":"2019","author":"SA Akhondi","year":"2019","unstructured":"Akhondi SA, Rey H, Schw\u00f6rer M, Maier M, Toomey JP, Nau H, Ilchmann G, Sheehan M, Irmer M, Bobach C, Doornenbal MA, Gregory M, Kors JA (2019) Automatic identification of relevant chemical compounds from patents. Database 2019:001","journal-title":"Database"},{"issue":"3","key":"568_CR3","doi-asserted-by":"publisher","first-page":"739","DOI":"10.1021\/ci100384d","volume":"51","author":"DM Lowe","year":"2011","unstructured":"Lowe DM, Corbett PT, Murray-Rust P, Glen RC (2011) Chemical name to structure: OPSIN, an open source solution. J Chem Informat Model 51(3):739\u2013753","journal-title":"J Chem Informat Model"},{"key":"568_CR4","unstructured":"MarvinSketch. https:\/\/chemaxon.com\/products\/marvin. Accessed 08 Sep 2020"},{"key":"568_CR5","doi-asserted-by":"crossref","unstructured":"Milosevic N, Gregson C, Hernandez R, Nenadic G (2016) Disentangling the structure of tables in scientific literature. In: International Conference on Applications of Natural Language to Information Systems, pp. 162\u2013174 . Springer","DOI":"10.1007\/978-3-319-41754-7_14"},{"issue":"23\u201324","key":"568_CR6","doi-asserted-by":"publisher","first-page":"1019","DOI":"10.1016\/j.drudis.2011.10.005","volume":"16","author":"S Muresan","year":"2011","unstructured":"Muresan S, Petrov P, Southan C, Kjellberg MJ, Kogej T, Tyrchan C, Varkonyi P, Xie PH (2011) Making every SAR point count: the development of Chemistry Connect for the large-scale integration of structure and bioactivity data. Drug Discov Today 16(23\u201324):1019\u20131030","journal-title":"Drug Discov Today"},{"issue":"10","key":"568_CR7","doi-asserted-by":"publisher","first-page":"1894","DOI":"10.1021\/acs.jcim.6b00207","volume":"56","author":"MC Swain","year":"2016","unstructured":"Swain MC, Cole JM (2016) Chemdataextractor: a toolkit for automated extraction of chemical information from the scientific literature. J Chem Inform Model 56(10):1894\u20131904","journal-title":"J Chem Inform Model"},{"key":"568_CR8","unstructured":"Unlocking chemical information from tables and legacy articles. https:\/\/www.nextmovesoftware.com\/talks\/Lowe_UnlockingLegacyArticles_ACS_201508.pdf. Accessed: 08 Sep 2020"},{"issue":"9","key":"568_CR9","doi-asserted-by":"publisher","first-page":"107477","DOI":"10.1371\/journal.pone.0107477","volume":"9","author":"SA Akhondi","year":"2014","unstructured":"Akhondi SA, Klenner AG, Tyrchan C, Manchala AK, Boppana K, Lowe D, Zimmermann M, Jagarlapudi SA, Sayle R, Kors JA et al (2014) Annotated chemical patent corpus: a gold standard for text mining. PLoS One 9(9):107477","journal-title":"PLoS One"},{"key":"568_CR10","unstructured":"Krallinger M, Rabal O, Louren\u00e7o A, Perez MP, Rodriguez GP, Vazquez M, Leitner F, Oyarzabal J, Valencia A (2015) Overview of the CHEMDNER patents task. In: Proceedings of the Fifth BioCreative Challenge Evaluation Workshop, pp. 63\u201375"},{"key":"568_CR11","unstructured":"Wei C-H, Peng Y, Leaman R, Davis AP, Mattingly CJ, Li J, Wiegers TC, Lu Z (2015) Overview of the BioCreative V chemical disease relation (CDR) task. In: Proceedings of the Fifth BioCreative Challenge Evaluation Workshop, vol. 14"},{"issue":"14","key":"568_CR12","doi-asserted-by":"publisher","first-page":"37","DOI":"10.1093\/bioinformatics\/btx228","volume":"33","author":"M Habibi","year":"2017","unstructured":"Habibi M, Weber L, Neves M, Wiegandt DL, Leser U (2017) Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 33(14):37\u201348","journal-title":"Bioinformatics"},{"key":"568_CR13","doi-asserted-by":"crossref","unstructured":"Zhai Z, Nguyen DQ, Akhondi S, Thorne C, Druckenbrodt C, Cohn T, Gregory M, Verspoor K (2019) Improving chemical named entity recognition in patents with contextualized word embeddings. In: Proceedings of the 18th BioNLP Workshop and Shared Task, pp. 328\u2013338","DOI":"10.18653\/v1\/W19-5035"},{"key":"568_CR14","unstructured":"He J, Nguyen DQ, Akhondi SA, Druckenbrodt C, Thorne C, Hoessel R, Afzal Z, Zhai Z, Fang B, Yoshikawa H et al (2020) Overview of ChEMU 2020: Named Entity Recognition and Event Extraction of Chemical Reactions from Patents. In: International Conference of the Cross-Language Evaluation Forum for European Languages, pp. 237\u2013254 . Springer"},{"key":"568_CR15","doi-asserted-by":"publisher","unstructured":"Zhai Z, Druckenbrodt C, Eustratiadis P, Thorne C, Akhondi SA, Nguyen DQ, Cohn T, Verspoor K (2020) ChemTables: dataset for table classification in chemical patents. Mendeley Data. https:\/\/doi.org\/10.17632\/g7tjh7tbrj.1","DOI":"10.17632\/g7tjh7tbrj.1"},{"key":"568_CR16","doi-asserted-by":"crossref","unstructured":"Lehmberg O, Ritze D, Meusel R, Bizer C (2016) A large public corpus of web tables containing time and context metadata. In: Proceedings of the 25th International Conference Companion on World Wide Web, pp. 75\u201376. International World Wide Web Conferences Steering Committee","DOI":"10.1145\/2872518.2889386"},{"key":"568_CR17","doi-asserted-by":"crossref","unstructured":"Nishida K, Sadamitsu K, Higashinaka R, Matsuo Y (2017) Understanding the semantic structures of tables with a hybrid deep neural network architecture. In: Thirty-First AAAI Conference on Artificial Intelligence","DOI":"10.1609\/aaai.v31i1.10484"},{"key":"568_CR18","doi-asserted-by":"crossref","unstructured":"He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770\u2013778","DOI":"10.1109\/CVPR.2016.90"},{"key":"568_CR19","doi-asserted-by":"crossref","unstructured":"Chen W, Wang H, Chen J, Zhang Y, Wang H, Li S, Zhou X, Wang WY (2020) TabFact: a large-scale dataset for table-based fact verification. In: International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia","DOI":"10.18653\/v1\/2021.findings-emnlp.338"},{"key":"568_CR20","doi-asserted-by":"crossref","unstructured":"Crestan E, Pantel P (2011) Web-scale table census and classification. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, pp. 545\u2013554. ACM","DOI":"10.1145\/1935826.1935904"},{"issue":"8","key":"568_CR21","doi-asserted-by":"publisher","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","volume":"9","author":"S Hochreiter","year":"1997","unstructured":"Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735\u20131780","journal-title":"Neural Comput"},{"key":"568_CR22","unstructured":"April 2016 Common Crawl Archive. https:\/\/commoncrawl.org\/2016\/05\/april-2016-crawl-archive-now-available\/. Accessed: 08 Sep 2020"},{"key":"568_CR23","unstructured":"Cafarella MJ, Halevy AY, Zhang Y, Wang DZ, Wu E (2008) Uncovering the relational web. In: WebDB"},{"key":"568_CR24","doi-asserted-by":"crossref","unstructured":"Eberius J, Braunschweig K, Hentsch M, Thiele M, Ahmadov A, Lehner W (2015) Building the Dresden web table corpus: a classification approach. In: 2015 IEEE\/ACM 2nd International Symposium on Big Data Computing (BDC), pp. 41\u201350. IEEE","DOI":"10.1109\/BDC.2015.30"},{"key":"568_CR25","doi-asserted-by":"crossref","unstructured":"Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E (2016) Hierarchical attention networks for document classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480\u20131489","DOI":"10.18653\/v1\/N16-1174"},{"key":"568_CR26","unstructured":"Ghasemi-Gol M, Szekely P (2018) TabVec: table vectors for classification of web tables. arXiv preprint arXiv:1802.06290"},{"issue":"2","key":"568_CR27","doi-asserted-by":"publisher","first-page":"139","DOI":"10.1007\/s12559-009-9009-8","volume":"1","author":"P Kanerva","year":"2009","unstructured":"Kanerva P (2009) Hyperdimensional computing: an introduction to computing in distributed representation with high-dimensional random vectors. Cogn Comput 1(2):139\u2013159","journal-title":"Cogn Comput"},{"key":"568_CR28","doi-asserted-by":"crossref","unstructured":"Zhang L, Zhang S, Balog K (2019) Table2Vec: Neural word and entity embeddings for table population and retrieval. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1029\u20131032","DOI":"10.1145\/3331184.3331333"},{"key":"568_CR29","unstructured":"Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed Representations of Words and Phrases and their Compositionality. In: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems, pp. 3111\u20133119"},{"key":"568_CR30","doi-asserted-by":"crossref","unstructured":"Pasupat P, Liang P (2015) Compositional Semantic Parsing on Semi-Structured Tables. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1470\u20131480","DOI":"10.3115\/v1\/P15-1142"},{"key":"568_CR31","doi-asserted-by":"crossref","unstructured":"Haug T, Ganea O-E, Grnarova P (2018) Neural multi-step reasoning for question answering on semi-structured tables. In: European Conference on Information Retrieval, pp. 611\u2013617 . Springer","DOI":"10.1007\/978-3-319-76941-7_52"},{"key":"568_CR32","doi-asserted-by":"crossref","unstructured":"Krishnamurthy J, Dasigi P, Gardner M (2017) Neural semantic parsing with type constraints for semi-structured tables. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1516\u20131526","DOI":"10.18653\/v1\/D17-1160"},{"key":"568_CR33","unstructured":"Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171\u20134186"},{"key":"568_CR34","doi-asserted-by":"crossref","unstructured":"Liang C, Berant J, Le Q, Forbus KD, Lao N (2017) Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 23\u201333","DOI":"10.18653\/v1\/P17-1003"},{"key":"568_CR35","doi-asserted-by":"crossref","unstructured":"Ibrahim Y, Weikum G (2019) ExQuisiTe: Explaining Quantities in Text. In: The World Wide Web Conference, pp. 3541\u20133544 . ACM","DOI":"10.1145\/3308558.3314134"},{"key":"568_CR36","doi-asserted-by":"crossref","unstructured":"Ibrahim Y, Riedewald M, Weikum G, Zeinalipour-Yazti D (2019) Bridging Quantities in Tables and Text. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 1010\u20131021. IEEE","DOI":"10.1109\/ICDE.2019.00094"},{"key":"568_CR37","unstructured":"Shmanina T, Zukerman I, Cheam AL, Bochynek T, Cavedon L (2016) A Corpus of tables in full-text biomedical research publications. In: Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016), pp. 70\u201379"},{"key":"568_CR38","unstructured":"Elsevier: Reaxys features and capabilities. https:\/\/www.elsevier.com\/solutions\/reaxys\/how-reaxys-works. Accessed: 08 Sep 2020"},{"issue":"12","key":"568_CR39","doi-asserted-by":"publisher","first-page":"2897","DOI":"10.1021\/ci900437n","volume":"49","author":"J Goodman","year":"2009","unstructured":"Goodman J (2009) Computer software review: Reaxys. J Chem Inf Model 49(12):2897\u20132898. https:\/\/doi.org\/10.1021\/ci900437n","journal-title":"J Chem Inf Model"},{"key":"568_CR40","doi-asserted-by":"crossref","unstructured":"Lawson AJ, Swienty-Busch J, G\u00e9oui T, Evans D (2014) The making of Reaxys \u2013 Towards unobstructed access to relevant chemistry information. In: The Future of the History of Chemical Information, pp. 127\u2013148. American Chemical Society Publications, Washington, D.C","DOI":"10.1021\/bk-2014-1164.ch008"},{"issue":"1","key":"568_CR41","doi-asserted-by":"publisher","first-page":"41","DOI":"10.1186\/1758-2946-3-41","volume":"3","author":"DM Jessop","year":"2011","unstructured":"Jessop DM, Adams SE, Willighagen EL, Hawizy L, Murray-Rust P (2011) OSCAR4: a flexible architecture for chemical text-mining. J Cheminformat 3(1):41","journal-title":"J Cheminformat"},{"issue":"1","key":"568_CR42","doi-asserted-by":"publisher","first-page":"95","DOI":"10.1162\/089120104773633402","volume":"30","author":"B Di Eugenio","year":"2004","unstructured":"Di Eugenio B, Glass M (2004) The Kappa statistic: a second look. Comput Linguist 30(1):95\u2013101","journal-title":"Comput Linguist"},{"key":"568_CR43","doi-asserted-by":"crossref","unstructured":"Ma X, Hovy E (2016) End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1064\u20131074","DOI":"10.18653\/v1\/P16-1101"}],"container-title":["Journal of Cheminformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-021-00568-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s13321-021-00568-2\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s13321-021-00568-2.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,17]],"date-time":"2023-01-17T19:07:41Z","timestamp":1673982461000},"score":1,"resource":{"primary":{"URL":"https:\/\/jcheminf.biomedcentral.com\/articles\/10.1186\/s13321-021-00568-2"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,12]]},"references-count":43,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2021,12]]}},"alternative-id":["568"],"URL":"https:\/\/doi.org\/10.1186\/s13321-021-00568-2","relation":{"has-preprint":[{"id-type":"doi","id":"10.21203\/rs.3.rs-127219\/v2","asserted-by":"object"},{"id-type":"doi","id":"10.21203\/rs.3.rs-127219\/v1","asserted-by":"object"}]},"ISSN":["1758-2946"],"issn-type":[{"value":"1758-2946","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,12]]},"assertion":[{"value":"7 May 2021","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"6 November 2021","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 December 2021","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"CD CT and SA work for Elsevier.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"97"}}