{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,6,11]],"date-time":"2025-06-11T06:43:09Z","timestamp":1749624189540,"version":"3.37.3"},"reference-count":20,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2020,5,27]],"date-time":"2020-05-27T00:00:00Z","timestamp":1590537600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2020,5,27]],"date-time":"2020-05-27T00:00:00Z","timestamp":1590537600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100000038","name":"U.S. Food and Drug Administration","doi-asserted-by":"publisher","award":["U01FD004979"],"award-info":[{"award-number":["U01FD004979"]}],"id":[{"id":"10.13039\/100000038","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["LM012354","LM05652","GM102365","N000141712266"],"award-info":[{"award-number":["LM012354","LM05652","GM102365","N000141712266"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"name":"National Science Foundation","award":["ACI-1429830","DGE-114747"],"award-info":[{"award-number":["ACI-1429830","DGE-114747"]}]},{"DOI":"10.13039\/100000185","name":"Defense Advanced Research Projects Agency","doi-asserted-by":"crossref","award":["FA87501720095"],"award-info":[{"award-number":["FA87501720095"]}],"id":[{"id":"10.13039\/100000185","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100000185","name":"Defense Advanced Research Projects Agency","doi-asserted-by":"publisher","award":["FA86501827865"],"award-info":[{"award-number":["FA86501827865"]}],"id":[{"id":"10.13039\/100000185","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["CCF1763315","CCF1563078"],"award-info":[{"award-number":["CCF1763315","CCF1563078"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000006","name":"Office of Naval Research","doi-asserted-by":"publisher","award":["N000141712266"],"award-info":[{"award-number":["N000141712266"]}],"id":[{"id":"10.13039\/100000006","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"published-print":{"date-parts":[[2020,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec>\n<jats:title>Background<\/jats:title>\n<jats:p>Enzymatic and chemical reactions are key for understanding biological processes in cells. Curated databases of chemical reactions exist but these databases struggle to keep up with the exponential growth of the biomedical literature. Conventional text mining pipelines provide tools to automatically extract entities and relationships from the scientific literature, and partially replace expert curation, but such machine learning frameworks often require a large amount of labeled training data and thus lack scalability for both larger document corpora and new relationship types.<\/jats:p>\n<\/jats:sec><jats:sec>\n<jats:title>Results<\/jats:title>\n<jats:p>We developed an application of Snorkel, a weakly supervised learning framework, for extracting chemical reaction relationships from biomedical literature abstracts. For this work, we defined a chemical reaction relationship as the transformation of chemical A to chemical B. We built and evaluated our system on small annotated sets of chemical reaction relationships from two corpora: curated bacteria-related abstracts from the MetaCyc database (MetaCyc_Corpus) and a more general set of abstracts annotated with MeSH (Medical Subject Headings) term Bacteria (Bacteria_Corpus; a superset of MetaCyc_Corpus). For the MetaCyc_Corpus, we obtained 84% precision and 41% recall (55% F1 score). Extending to the more general Bacteria_Corpus decreased precision to 62% with only a four-point drop in recall to 37% (46% F1 score). Overall, the Bacteria_Corpus contained two orders of magnitude more candidate chemical reaction relationships (nine million candidates vs 68,0000 candidates) and had a larger class imbalance (2.5% positives vs 5% positives) as compared to the MetaCyc_Corpus. In total, we extracted 6871 chemical reaction relationships from nine million candidates in the Bacteria_Corpus.<\/jats:p>\n<\/jats:sec><jats:sec>\n<jats:title>Conclusions<\/jats:title>\n<jats:p>With this work, we built a database of chemical reaction relationships from almost 900,000 scientific abstracts without a large training set of labeled annotations. Further, we showed the generalizability of our initial application built on MetaCyc documents enriched with chemical reactions to a general set of articles related to bacteria.<\/jats:p>\n<\/jats:sec>","DOI":"10.1186\/s12859-020-03542-1","type":"journal-article","created":{"date-parts":[[2020,5,27]],"date-time":"2020-05-27T11:03:40Z","timestamp":1590577420000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":16,"title":["Extracting chemical reactions from text using Snorkel"],"prefix":"10.1186","volume":"21","author":[{"given":"Emily K.","family":"Mallory","sequence":"first","affiliation":[]},{"given":"Matthieu","family":"de Rochemonteix","sequence":"additional","affiliation":[]},{"given":"Alex","family":"Ratner","sequence":"additional","affiliation":[]},{"given":"Ambika","family":"Acharya","sequence":"additional","affiliation":[]},{"given":"Chris","family":"Re","sequence":"additional","affiliation":[]},{"given":"Roselie A.","family":"Bright","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3859-2905","authenticated-orcid":false,"given":"Russ B.","family":"Altman","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2020,5,27]]},"reference":[{"issue":"5","key":"3542_CR1","doi-asserted-by":"publisher","first-page":"273","DOI":"10.1038\/nrmicro.2016.17","volume":"14","author":"P Spanogiannopoulos","year":"2016","unstructured":"Spanogiannopoulos P, Bess EN, Carmody RN, Turnbaugh PJ. The microbial pharmacists within us: a metagenomic view of xenobiotic metabolism. Nat Rev Microbiol. 2016;14(5):273\u201387.","journal-title":"Nat Rev Microbiol"},{"issue":"D1","key":"3542_CR2","doi-asserted-by":"publisher","first-page":"D471","DOI":"10.1093\/nar\/gkv1164","volume":"44","author":"R Caspi","year":"2016","unstructured":"Caspi R, Billington R, Ferrer L, Foerster H, Fulcher CA, Keseler IM, et al. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway\/genome databases. Nucleic Acids Res. 2016;44(D1):D471\u201380.","journal-title":"Nucleic Acids Res"},{"issue":"1","key":"3542_CR3","doi-asserted-by":"publisher","first-page":"27","DOI":"10.1093\/nar\/28.1.27","volume":"28","author":"M Kanehisa","year":"2000","unstructured":"Kanehisa M, Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28(1):27\u201330.","journal-title":"Nucleic Acids Res"},{"key":"3542_CR4","doi-asserted-by":"publisher","first-page":"47","DOI":"10.1016\/j.ymeth.2014.10.026","volume":"74","author":"N Papanikolaou","year":"2015","unstructured":"Papanikolaou N, Pavlopoulos GA, Theodosiou T, Iliopoulos I. Protein-protein interaction predictions using text mining methods. Methods. 2015;74:47\u201353.","journal-title":"Methods."},{"issue":"1","key":"3542_CR5","doi-asserted-by":"publisher","first-page":"132","DOI":"10.1093\/bib\/bbv024","volume":"17","author":"CC Huang","year":"2016","unstructured":"Huang CC, Lu Z. Community challenges in biomedical text mining over 10 years: success, failure and the future. Brief Bioinform. 2016;17(1):132\u201344.","journal-title":"Brief Bioinform"},{"issue":"Web Server issu","key":"3542_CR6","doi-asserted-by":"publisher","first-page":"W518","DOI":"10.1093\/nar\/gkt441","volume":"41","author":"CH Wei","year":"2013","unstructured":"Wei CH, Kao HY, Lu Z. PubTator: a web-based text mining tool for assisting biocuration. Nucleic Acids Res. 2013;41(Web Server issue):W518\u201322.","journal-title":"Nucleic Acids Res"},{"issue":"Suppl 1 Text mi","key":"3542_CR7","doi-asserted-by":"publisher","first-page":"S3","DOI":"10.1186\/1758-2946-7-S1-S3","volume":"7","author":"R Leaman","year":"2015","unstructured":"Leaman R, Wei CH, Lu Z. tmChem: a high performance approach for chemical named entity recognition and normalization. J Cheminform. 2015;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S3.","journal-title":"J Cheminform"},{"key":"3542_CR8","volume-title":"Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems","author":"T Mikolov","year":"2013","unstructured":"Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems; 2013."},{"key":"3542_CR9","volume-title":"Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805","author":"J Devlin","year":"2018","unstructured":"Devlin J, Chang M-W, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805; 2018."},{"key":"3542_CR10","volume-title":"BioBERT: pre-trained biomedical language representation model for biomedical text mining. arXiv preprint arXiv:190108746","author":"J Lee","year":"2019","unstructured":"Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, et al. BioBERT: pre-trained biomedical language representation model for biomedical text mining. arXiv preprint arXiv:190108746. 2019."},{"key":"3542_CR11","volume-title":"Overview of the BioCreative VI chemical-protein interaction track. Proceedings of the BioCreative VI challenge evaluation workshop","author":"M Krallinger","year":"2017","unstructured":"Krallinger M, Rabal O, Akhondi SA, P\u00e9rez MP, Santamar\u00eda J, Rodr\u00edguez GP, et al. Overview of the BioCreative VI chemical-protein interaction track. Proceedings of the BioCreative VI challenge evaluation workshop, vol. 2017; 2017."},{"issue":"Suppl 1 Text mi","key":"3542_CR12","doi-asserted-by":"publisher","first-page":"S2","DOI":"10.1186\/1758-2946-7-S1-S2","volume":"7","author":"M Krallinger","year":"2015","unstructured":"Krallinger M, Rabal O, Leitner F, Vazquez M, Salgado D, Lu Z, et al. The CHEMDNER corpus of chemicals and drugs and its annotation principles. J Cheminform. 2015;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S2.","journal-title":"J Cheminform"},{"key":"3542_CR13","first-page":"3567","volume":"29","author":"A Ratner","year":"2016","unstructured":"Ratner A, De Sa C, Wu S, Selsam D, Re C. Data programming: creating large training sets, Quickly. Adv Neural Inf Process Syst. 2016;29:3567\u201375.","journal-title":"Adv Neural Inf Process Syst"},{"issue":"3","key":"3542_CR14","doi-asserted-by":"publisher","first-page":"269","DOI":"10.14778\/3157794.3157797","volume":"11","author":"A Ratner","year":"2017","unstructured":"Ratner A, Bach SH, Ehrenberg H, Fries J, Wu S, Re C. Snorkel: rapid training data creation with weak supervision. Proceedings VLDB Endowment. 2017;11(3):269\u201382.","journal-title":"Proceedings VLDB Endowment"},{"key":"3542_CR15","volume-title":"spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing. To appear","author":"M Honnibal","year":"2017","unstructured":"Honnibal M, Montani I. spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing. To appear; 2017."},{"key":"3542_CR16","volume-title":"Training complex models with multi-task weak supervision. arXiv preprint arXiv:181002840","author":"A Ratner","year":"2018","unstructured":"Ratner A, Hancock B, Dunnmon J, Sala F, Pandey S, R\u00e9 C. Training complex models with multi-task weak supervision. arXiv preprint arXiv:181002840; 2018."},{"issue":"1","key":"3542_CR17","doi-asserted-by":"publisher","first-page":"101","DOI":"10.1016\/j.jbiotec.2013.07.033","volume":"168","author":"A Hildebrand","year":"2013","unstructured":"Hildebrand A, Schlacta T, Warmack R, Kasuga T, Fan Z. Engineering Escherichia coli for improved ethanol production from gluconate. J Biotechnol. 2013;168(1):101\u20136.","journal-title":"J Biotechnol"},{"issue":"2","key":"3542_CR18","doi-asserted-by":"publisher","first-page":"175","DOI":"10.1016\/0168-1656(94)90109-0","volume":"33","author":"N Layh","year":"1994","unstructured":"Layh N, Stolz A, Bohme J, Effenberger F, Knackmuss HJ. Enantioselective hydrolysis of racemic naproxen nitrile and naproxen amide to S-naproxen by new bacterial isolates. J Biotechnol. 1994;33(2):175\u201382.","journal-title":"J Biotechnol"},{"issue":"3","key":"3542_CR19","doi-asserted-by":"publisher","first-page":"996","DOI":"10.1006\/bbrc.1995.1596","volume":"209","author":"YC Lee","year":"1995","unstructured":"Lee YC, Shlyankevich M, Jeong HK, Douglas JS, Surh YJ. Bioactivation of 5-hydroxymethyl-2-furaldehyde to an electrophilic and mutagenic allylic sulfuric acid ester. Biochem Biophys Res Commun. 1995;209(3):996\u20131002.","journal-title":"Biochem Biophys Res Commun"},{"issue":"5","key":"3542_CR20","doi-asserted-by":"publisher","first-page":"1291","DOI":"10.1111\/j.1742-4658.2005.04567.x","volume":"272","author":"A Riemenschneider","year":"2005","unstructured":"Riemenschneider A, Wegele R, Schmidt A, Papenbrock J. Isolation and characterization of a D-cysteine desulfhydrase protein from Arabidopsis thaliana. FEBS J. 2005;272(5):1291\u2013304.","journal-title":"FEBS J"}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-020-03542-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s12859-020-03542-1\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-020-03542-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,5,26]],"date-time":"2021-05-26T23:11:53Z","timestamp":1622070713000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/s12859-020-03542-1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,5,27]]},"references-count":20,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2020,12]]}},"alternative-id":["3542"],"URL":"https:\/\/doi.org\/10.1186\/s12859-020-03542-1","relation":{},"ISSN":["1471-2105"],"issn-type":[{"type":"electronic","value":"1471-2105"}],"subject":[],"published":{"date-parts":[[2020,5,27]]},"assertion":[{"value":"3 November 2019","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"11 May 2020","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"27 May 2020","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"Not applicable.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare that they have no competing interests.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"217"}}