{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,13]],"date-time":"2026-04-13T19:53:17Z","timestamp":1776109997722,"version":"3.50.1"},"reference-count":27,"publisher":"Oxford University Press (OUP)","issue":"15","license":[{"start":{"date-parts":[[2018,3,23]],"date-time":"2018-03-23T00:00:00Z","timestamp":1521763200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/about_us\/legal\/notices"}],"funder":[{"name":"U.S. Army Research Office Institute for Collaborative Biotechnologies","award":["W911F-09-0001"],"award-info":[{"award-number":["W911F-09-0001"]}]},{"DOI":"10.13039\/100012614","name":"Donna and Benjamin M. Rosen Bioengineering Center","doi-asserted-by":"crossref","id":[{"id":"10.13039\/100012614","id-type":"DOI","asserted-by":"crossref"}]},{"DOI":"10.13039\/100000002","name":"National Institutes of Health","doi-asserted-by":"publisher","award":["F31MH102913"],"award-info":[{"award-number":["F31MH102913"]}],"id":[{"id":"10.13039\/100000002","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000001","name":"National Science Foundation","doi-asserted-by":"publisher","award":["GRF2017227007"],"award-info":[{"award-number":["GRF2017227007"]}],"id":[{"id":"10.13039\/100000001","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2018,8,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Motivation<\/jats:title>\n                  <jats:p>Machine-learning models trained on protein sequences and their measured functions can infer biological properties of unseen sequences without requiring an understanding of the underlying physical or biological mechanisms. Such models enable the prediction and discovery of sequences with optimal properties. Machine-learning models generally require that their inputs be vectors, and the conversion from a protein sequence to a vector representation affects the model\u2019s ability to learn. We propose to learn embedded representations of protein sequences that take advantage of the vast quantity of unmeasured protein sequence data available. These embeddings are low-dimensional and can greatly simplify downstream modeling.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>The predictive power of Gaussian process models trained using embeddings is comparable to those trained on existing representations, which suggests that embeddings enable accurate predictions despite having orders of magnitude fewer dimensions. Moreover, embeddings are simpler to obtain because they do not require alignments, structural data, or selection of informative amino-acid properties. Visualizing the embedding vectors shows meaningful relationships between the embedded proteins are captured.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and implementation<\/jats:title>\n                  <jats:p>The embedding vectors and code to reproduce the results are available at https:\/\/github.com\/fhalab\/embeddings_reproduction\/.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Supplementary information<\/jats:title>\n                  <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/bty178","type":"journal-article","created":{"date-parts":[[2018,3,22]],"date-time":"2018-03-22T20:11:04Z","timestamp":1521749464000},"page":"2642-2648","source":"Crossref","is-referenced-by-count":282,"title":["Learned protein embeddings for machine learning"],"prefix":"10.1093","volume":"34","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-9045-6826","authenticated-orcid":false,"given":"Kevin K","family":"Yang","sequence":"first","affiliation":[{"name":"Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA"}]},{"given":"Zachary","family":"Wu","sequence":"additional","affiliation":[{"name":"Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA"}]},{"given":"Claire N","family":"Bedbrook","sequence":"additional","affiliation":[{"name":"Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA"}]},{"given":"Frances H","family":"Arnold","sequence":"additional","affiliation":[{"name":"Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA"},{"name":"Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA"}]}],"member":"286","published-online":{"date-parts":[[2018,3,23]]},"reference":[{"key":"2023012810012315800_bty178-B1","doi-asserted-by":"crossref","first-page":"1650011.","DOI":"10.1142\/S0219720016500116","article-title":"Issues in performance evaluation for host-pathogen protein interaction prediction","volume":"14","author":"Abbasi","year":"2016","journal-title":"J. Bioinform. Comput. Biol"},{"key":"2023012810012315800_bty178-B2","doi-asserted-by":"crossref","first-page":"831","DOI":"10.1038\/nbt.3300","article-title":"Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning","volume":"33","author":"Alipanahi","year":"2015","journal-title":"Nat. Biotechnol"},{"key":"2023012810012315800_bty178-B3","doi-asserted-by":"crossref","first-page":"e0141287.","DOI":"10.1371\/journal.pone.0141287","article-title":"Continuous distributed representation of biological sequences for deep proteomics and genomics","volume":"10","author":"Asgari","year":"2015","journal-title":"PLoS One"},{"key":"2023012810012315800_bty178-B4","doi-asserted-by":"crossref","first-page":"E2624","DOI":"10.1073\/pnas.1700269114","article-title":"Structure-guided SCHEMA recombination generates diverse chimeric channelrhodopsins","volume":"114","author":"Bedbrook","year":"2017","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"2023012810012315800_bty178-B5","doi-asserted-by":"crossref","first-page":"e1005786","DOI":"10.1371\/journal.pcbi.1005786","article-title":"Machine learning to design integral membrane channelrhodopsins for efficient eukaryotic expression and plasma membrane localization","volume":"13","author":"Bedbrook","year":"2017","journal-title":"PLOS Comput. Biol"},{"key":"2023012810012315800_bty178-B6","doi-asserted-by":"crossref","first-page":"21844","DOI":"10.1038\/srep21844","article-title":"Periscope: quantitative prediction of soluble protein expression in the periplasm of Escherichia coli","volume":"6","author":"Chang","year":"2016","journal-title":"Sci. Rep"},{"key":"2023012810012315800_bty178-B7","doi-asserted-by":"crossref","first-page":"78","DOI":"10.1145\/2347736.2347755","article-title":"A few useful things to know about machine learning","volume":"55","author":"Domingos","year":"2012","journal-title":"Commun. ACM"},{"key":"2023012810012315800_bty178-B100","doi-asserted-by":"crossref","first-page":"205","DOI":"10.1016\/j.jmb.2014.06.015","article-title":"Directed evolution of Gloeobacter violaceus rhodopsin spectral properties","volume":"427","author":"Engqvist","year":"2015","journal-title":"J. Mol. Biol."},{"key":"2023012810012315800_bty178-B8","doi-asserted-by":"crossref","first-page":"338","DOI":"10.1038\/nbt1286","article-title":"Improving catalytic function by ProSAR-driven enzyme evolution","volume":"25","author":"Fox","year":"2007","journal-title":"Nat. Biotechnol"},{"key":"2023012810012315800_bty178-B9","first-page":"202","volume-title":"Nucleic Acids Res","author":"Kawashima","year":"2008"},{"key":"2023012810012315800_bty178-B10","volume-title":"arXiv preprint","author":"Kimothi","year":"2016"},{"key":"2023012810012315800_bty178-B11","first-page":"1188","article-title":"Distributed representations of sentences and documents","volume":"32","author":"Le","year":"2014","journal-title":"Int. Conf. Mach. Learn. ICML 2014"},{"key":"2023012810012315800_bty178-B12","doi-asserted-by":"crossref","first-page":"467","DOI":"10.1093\/bioinformatics\/btg431","article-title":"Mismatch string kernels for discriminative protein classification","volume":"20","author":"Leslie","year":"2004","journal-title":"Bioinformatics"},{"key":"2023012810012315800_bty178-B13","doi-asserted-by":"crossref","first-page":"1051","DOI":"10.1038\/nbt1333","article-title":"A diverse family of thermostable cytochrome P450s created by recombination of stabilizing fragments","volume":"25","author":"Li","year":"2007","journal-title":"Nat. Biotechnol"},{"key":"2023012810012315800_bty178-B14","first-page":"2579","article-title":"Visualizing data using t-SNE","volume":"9","author":"Maaten","year":"2008","journal-title":"J. Mach. Learn. Res"},{"key":"2023012810012315800_bty178-B15","volume-title":"bioRxiv preprint","author":"Mazzaferro","year":"2017"},{"key":"2023012810012315800_bty178-B17","first-page":"3111","volume-title":"Advances in Neural Information Processing Systems","author":"Mikolov","year":"2013"},{"key":"2023012810012315800_bty178-B18","volume-title":"arXiv preprint","author":"Mikolov","year":"2013"},{"key":"2023012810012315800_bty178-B19","volume-title":"arXiv preprint","author":"Ng","year":"2017"},{"key":"2023012810012315800_bty178-B20","doi-asserted-by":"crossref","first-page":"3429","DOI":"10.1093\/bioinformatics\/btv345","article-title":"ProFET: Feature engineering captures high-level protein functions","volume":"31","author":"Ofer","year":"2015","journal-title":"Bioinformatics"},{"key":"2023012810012315800_bty178-B21","volume-title":"Gaussian Processes for Machine Learning","author":"Rasmussen","year":"2006"},{"key":"2023012810012315800_bty178-B22","first-page":"45","volume-title":"Proceedings of LREC 2010 workshop New Challenges for NLP Frameworks","author":"Rurek","year":"2010"},{"key":"2023012810012315800_bty178-B23","doi-asserted-by":"crossref","first-page":"E193","DOI":"10.1073\/pnas.1215251110","article-title":"Navigating the protein fitness landscape with Gaussian processes","volume":"110","author":"Romero","year":"2013","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"2023012810012315800_bty178-B24","doi-asserted-by":"crossref","DOI":"10.1074\/jbc.RA117.001052","article-title":"A statistical model for improved membrane protein expression using sequence-derived features","volume-title":"J Biol Chem.","author":"Saladi","year":"2018"},{"key":"2023012810012315800_bty178-B25","doi-asserted-by":"crossref","first-page":"158","DOI":"10.1093\/nar\/gkw1099","article-title":"UniProt: the universal protein knowledgebase","volume":"45","author":"The UniProt Consortium","year":"2017","journal-title":"Nucleic Acids Res"},{"key":"2023012810012315800_bty178-B26","author":"Young","year":"2016"},{"key":"2023012810012315800_bty178-B27","doi-asserted-by":"crossref","first-page":"1085","DOI":"10.1007\/s10822-017-0090-x","article-title":"Learning epistatic interactions from sequence-activity data to predict enantioselectivity","volume":"31","author":"Zaugg","year":"2017","journal-title":"J. Comput. Aided Mol. Des"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/34\/15\/2642\/48935464\/bioinformatics_34_15_2642.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/34\/15\/2642\/48935464\/bioinformatics_34_15_2642.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,28]],"date-time":"2023-01-28T10:04:01Z","timestamp":1674900241000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/34\/15\/2642\/4951834"}},"subtitle":[],"editor":[{"given":"Jonathan","family":"Wren","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2018,3,23]]},"references-count":27,"journal-issue":{"issue":"15","published-print":{"date-parts":[[2018,8,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/bty178","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2018,8,1]]},"published":{"date-parts":[[2018,3,23]]}}}