{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,14]],"date-time":"2026-02-14T05:50:01Z","timestamp":1771048201470,"version":"3.50.1"},"reference-count":56,"publisher":"Oxford University Press (OUP)","issue":"Supplement_1","license":[{"start":{"date-parts":[[2022,6,27]],"date-time":"2022-06-27T00:00:00Z","timestamp":1656288000000},"content-version":"vor","delay-in-days":3,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001659","name":"Deutsche Forschungsgemeinschaft","doi-asserted-by":"publisher","award":["BO 1910\/20"],"award-info":[{"award-number":["BO 1910\/20"]}],"id":[{"id":"10.13039\/501100001659","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001659","name":"Deutsche Forschungsgemeinschaft","doi-asserted-by":"publisher","award":["1910\/23"],"award-info":[{"award-number":["1910\/23"]}],"id":[{"id":"10.13039\/501100001659","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2022,6,24]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec><jats:title>Motivation<\/jats:title><jats:p>Untargeted metabolomics experiments rely on spectral libraries for structure annotation, but these libraries are vastly incomplete; in silico methods search in structure databases, allowing us to overcome this limitation. The best-performing in silico methods use machine learning to predict a molecular fingerprint from tandem mass spectra, then use the predicted fingerprint to search in a molecular structure database. Predicted molecular fingerprints are also of great interest for compound class annotation, de novo structure elucidation, and other tasks. So far, kernel support vector machines are the best tool for fingerprint prediction. However, they cannot be trained on all publicly available reference spectra because their training time scales cubically with the number of training data.<\/jats:p><\/jats:sec><jats:sec><jats:title>Results<\/jats:title><jats:p>We use the Nystr\u00f6m approximation to transform the kernel into a linear feature map. We evaluate two methods that use this feature map as input: a linear support vector machine and a deep neural network (DNN). For evaluation, we use a cross-validated dataset of 156 017 compounds and three independent datasets with 1734 compounds. We show that the combination of kernel method and DNN outperforms the kernel support vector machine, which is the current gold standard, as well as a DNN on tandem mass spectra on all evaluation datasets.<\/jats:p><\/jats:sec><jats:sec><jats:title>Availability and implementation<\/jats:title><jats:p>The deep kernel learning method for fingerprint prediction is part of the SIRIUS software, available at https:\/\/bio.informatik.uni-jena.de\/software\/sirius.<\/jats:p><\/jats:sec>","DOI":"10.1093\/bioinformatics\/btac260","type":"journal-article","created":{"date-parts":[[2022,4,14]],"date-time":"2022-04-14T11:10:15Z","timestamp":1649934615000},"page":"i342-i349","source":"Crossref","is-referenced-by-count":12,"title":["Deep kernel learning improves molecular fingerprint prediction from tandem mass spectra"],"prefix":"10.1093","volume":"38","author":[{"given":"Kai","family":"D\u00fchrkop","sequence":"first","affiliation":[{"name":"Department of Bioinformatics, Friedrich Schiller University , Jena 07743, Germany"}]}],"member":"286","published-online":{"date-parts":[[2022,6,27]]},"reference":[{"key":"2023041407560292100_","first-page":"265","author":"Abadi","year":"2016"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"98","DOI":"10.1007\/s11306-014-0676-4","article-title":"Competitive fragmentation modeling of ESI-MS\/MS spectra for putative metabolite identification","volume":"11","author":"Allen","year":"2015","journal-title":"Metabolomics"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"I49","DOI":"10.1093\/bioinformatics\/btn270","article-title":"Towards de novo identification of metabolites by analyzing tandem mass spectra","volume":"24","author":"B\u00f6cker","year":"2008","journal-title":"Bioinformatics"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"i28","DOI":"10.1093\/bioinformatics\/btw246","article-title":"Fast metabolite identification with input output kernel regression","volume":"32","author":"Brouard","year":"2016","journal-title":"Bioinformatics"},{"key":"2023041407560292100_","first-page":"407","volume-title":"Proceedings of Machine Learning Research, Seoul, Korea, PMLR,","author":"Brouard","year":"2017"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"160","DOI":"10.3390\/metabo9080160","article-title":"Improved small molecule identification through learning combinations of kernel regression models","volume":"9","author":"Brouard","year":"2019","journal-title":"Metabolites"},{"key":"2023041407560292100_","author":"Chen","year":"2019"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"13","DOI":"10.1186\/s13040-021-00244-z","article-title":"The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation","volume":"14","author":"Chicco","year":"2021","journal-title":"BioData Min"},{"key":"2023041407560292100_","first-page":"795","article-title":"Algorithms for learning kernels based on centered alignment","volume":"13","author":"Cortes","year":"2012","journal-title":"J. Mach. Learn. Res"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"1128","DOI":"10.3389\/fgene.2020.567757","article-title":"Approximate genome-based kernel models for large data sets including main effects and interactions","volume":"11","author":"Cuevas","year":"2020","journal-title":"Front. Genet"},{"key":"2023041407560292100_","author":"D\u00fchrkop","year":"2018"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"12580","DOI":"10.1073\/pnas.1509788112","article-title":"Searching molecular structure databases with tandem mass spectra using CSI: fingerID","volume":"112","author":"D\u00fchrkop","year":"2015","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"299","DOI":"10.1038\/s41592-019-0344-8","article-title":"SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information","volume":"16","author":"D\u00fchrkop","year":"2019","journal-title":"Nat. Methods"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"462","DOI":"10.1038\/s41587-020-0740-8","article-title":"Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra","volume":"39","author":"D\u00fchrkop","year":"2021","journal-title":"Nat. Biotechnol"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"104","DOI":"10.1007\/s11306-020-01726-7","article-title":"MetFID: artificial neural network-based compound fingerprint prediction for metabolite annotation","volume":"16","author":"Fan","year":"2020","journal-title":"Metabolomics"},{"key":"2023041407560292100_","first-page":"D440","article-title":"MetaboLights: a resource evolving in response to the needs of its scientific community","volume":"48","author":"Haug","year":"2019","journal-title":"Nucleic Acids Res"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"2333","DOI":"10.1093\/bioinformatics\/bts437","article-title":"Metabolite identification and molecular fingerprint prediction via machine learning","volume":"28","author":"Heinonen","year":"2012","journal-title":"Bioinformatics"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"411","DOI":"10.1038\/s41587-021-01045-9","article-title":"High-confidence structural annotation of metabolites absent from spectral libraries","volume":"40","author":"Hoffmann","year":"2022","journal-title":"Nat. Biotechnol"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"703","DOI":"10.1002\/jms.1777","article-title":"MassBank: a public repository for sharing mass spectral data for life sciences","volume":"45","author":"Horai","year":"2010","journal-title":"J. Mass Spectrom"},{"key":"2023041407560292100_","author":"Ioffe","year":"2015"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"8649","DOI":"10.1021\/acs.analchem.0c01450","article-title":"Predicting a molecular fingerprint from an electron ionization mass spectrum with deep neural networks","volume":"92","author":"Ji","year":"2020","journal-title":"Anal. Chem"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"D457","DOI":"10.1093\/nar\/gkv1070","article-title":"KEGG as a reference resource for gene and protein annotation","volume":"44","author":"Kanehisa","year":"2016","journal-title":"Nucleic Acids Res"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"D1202","DOI":"10.1093\/nar\/gkv951","article-title":"PubChem substance and compound databases","volume":"44","author":"Kim","year":"2016","journal-title":"Nucleic Acids Res"},{"key":"2023041407560292100_","author":"Kingma","year":"2015"},{"key":"2023041407560292100_","author":"Kingma","year":"2014"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"2518","DOI":"10.1093\/bioinformatics\/btn479","article-title":"Chemical substructures that enrich for biological activity","volume":"24","author":"Klekota","year":"2008","journal-title":"Bioinformatics"},{"key":"2023041407560292100_","first-page":"1061","volume-title":"Proceedings of Machine Learning Research, Volume 89 of Proceedings of Machine Learning Research","author":"Laforgue","year":"2019"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"2096","DOI":"10.1093\/bioinformatics\/bty080","article-title":"Chemdistiller: an engine for metabolite annotation in mass spectrometry","volume":"34","author":"Laponogov","year":"2018","journal-title":"Bioinformatics"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"1","DOI":"10.5391\/IJFIS.2017.17.1.1","article-title":"Deep neural network self-training based on unsupervised learning and dropout","volume":"17","author":"Lee","year":"2017","journal-title":"Int. J. Fuzzy Log Intell. Syst"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"196","DOI":"10.1016\/j.eswa.2019.01.063","article-title":"Shallow neural network with kernel approximation for prediction problems in highly demanding data networks","volume":"124","author":"Lopez-Martin","year":"2019","journal-title":"Expert Syst. Appl"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"i333","DOI":"10.1093\/bioinformatics\/bty245","article-title":"Bayesian networks for mass spectrometric metabolite identification via molecular fingerprints","volume":"34","author":"Ludwig","year":"2018","journal-title":"Bioinformatics"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"442","DOI":"10.1016\/0005-2795(75)90109-9","article-title":"Comparison of the predicted and observed secondary structure of T4 phage lysozyme","volume":"405","author":"Matthews","year":"1975","journal-title":"Biochim. Biophys. Acta"},{"key":"2023041407560292100_","first-page":"14410","volume-title":"Advances in Neural Information Processing Systems","author":"Meanti","year":"2020"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"905","DOI":"10.1038\/s41592-020-0933-6","article-title":"Feature-based molecular networking in the GNPS analysis environment","volume":"17","author":"Nothias","year":"2020","journal-title":"Nat. Methods"},{"key":"2023041407560292100_","author":"Ober","year":"2021"},{"key":"2023041407560292100_","volume-title":"Advances in Large Margin Classifiers, Chapter 5","author":"Platt","year":"2000"},{"key":"2023041407560292100_","first-page":"529","author":"Powers","year":"2003"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"742","DOI":"10.1021\/ci100050t","article-title":"Extended-connectivity fingerprints","volume":"50","author":"Rogers","year":"2010","journal-title":"J. Chem. Inf. Model"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"22","DOI":"10.1186\/s13321-017-0207-1","article-title":"Critical assessment of small molecule identification 2016: automated methods","volume":"9","author":"Schymanski","year":"2017","journal-title":"J. Cheminform"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"12423","DOI":"10.1038\/ncomms12423","article-title":"The WEIZMASS spectral library for high-confidence metabolite identification","volume":"7","author":"Shahaf","year":"2016","journal-title":"Nat. Commun"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"i157","DOI":"10.1093\/bioinformatics\/btu275","article-title":"Metabolite identification through multiple kernel learning on fragmentation trees","volume":"30","author":"Shen","year":"2014","journal-title":"Bioinformatics"},{"key":"2023041407560292100_","first-page":"1929","article-title":"Dropout: a simple way to prevent neural networks from overfitting","volume":"15","author":"Srivastava","year":"2014","journal-title":"J Mach. Learn. Res"},{"key":"2023041407560292100_","author":"Stravs","year":"2021"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"D463","DOI":"10.1093\/nar\/gkv1042","article-title":"Metabolomics workbench: an international repository for metabolomics data and metadata, metabolite standards, protocols, tutorials and training, and analysis tools","volume":"44","author":"Sud","year":"2016","journal-title":"Nucleic Acids Res"},{"key":"2023041407560292100_","author":"Tanimoto","year":"1958"},{"key":"2023041407560292100_","author":"Tossou","year":"2020"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"146","DOI":"10.1038\/s41589-020-00677-3","article-title":"Chemically-informed analyses of metabolomics mass spectrometry data with qemistree","volume":"17","author":"Tripathi","year":"2021","journal-title":"Nat. Chem. Biol"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1007\/s11306-016-1036-3","article-title":"Improved metabolite identification with MIDAS and MAGMa through MS\/MS spectral dataset-driven parameter optimization","volume":"12","author":"Verdegem","year":"2016","journal-title":"Metabolomics"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"828","DOI":"10.1038\/nbt.3597","article-title":"Sharing and community curation of mass spectrometry data with global natural products social molecular networking","volume":"34","author":"Wang","year":"2016","journal-title":"Nat. Biotechnol"},{"key":"2023041407560292100_","volume-title":"Advances in Neural Information Processing Systems","author":"Williams","year":"2001"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"33","DOI":"10.1186\/s13321-017-0220-4","article-title":"The chemistry development kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching","volume":"9","author":"Willighagen","year":"2017","journal-title":"J. Cheminform"},{"key":"2023041407560292100_","first-page":"25942602","author":"Wilson","year":"2016"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"D608","DOI":"10.1093\/nar\/gkx1089","article-title":"HMDB 4.0: the human metabolome database for 2018","volume":"46","author":"Wishart","year":"2018","journal-title":"Nucleic Acids Res"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"148","DOI":"10.1186\/1471-2105-11-148","article-title":"In silico fragmentation for computer assisted identification of metabolite mass spectra","volume":"11","author":"Wolf","year":"2010","journal-title":"BMC Bioinformatics"},{"key":"2023041407560292100_","first-page":"1425","volume-title":"Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, Volume 22 of Proceedings of Machine Learning Research","author":"Zhang","year":"2012"},{"key":"2023041407560292100_","doi-asserted-by":"crossref","first-page":"71","DOI":"10.1016\/j.patrec.2020.03.030","article-title":"On the performance of Matthews correlation coefficient (MCC) for imbalanced dataset","volume":"136","author":"Zhu","year":"2020","journal-title":"Patt. Recog. Lett"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/38\/Supplement_1\/i342\/49886625\/btac260.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/38\/Supplement_1\/i342\/49886625\/btac260.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,11,20]],"date-time":"2023-11-20T00:28:45Z","timestamp":1700440125000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/38\/Supplement_1\/i342\/6617526"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2022,6,24]]},"references-count":56,"journal-issue":{"issue":"Supplement_1","published-print":{"date-parts":[[2022,6,24]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btac260","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2022,7,1]]},"published":{"date-parts":[[2022,6,24]]}}}