{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,4]],"date-time":"2026-04-04T19:15:14Z","timestamp":1775330114709,"version":"3.50.1"},"reference-count":43,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2021,1,19]],"date-time":"2021-01-19T00:00:00Z","timestamp":1611014400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100010570","name":"Nieders\u00e4chsisches Ministerium f\u00fcr Wissenschaft und Kultur","doi-asserted-by":"publisher","award":["ZN3429"],"award-info":[{"award-number":["ZN3429"]}],"id":[{"id":"10.13039\/501100010570","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Algorithms"],"abstract":"<jats:p>Predicting biological properties of unseen proteins is shown to be improved by the use of protein sequence embeddings. However, these sequence embeddings have the caveat that biological metadata do not exist for each amino acid, in order to measure the quality of each unique learned embedding vector separately. Therefore, current sequence embedding cannot be intrinsically evaluated on the degree of their captured biological information in a quantitative manner. We address this drawback by our approach, dom2vec, by learning vector representation for protein domains and not for each amino acid base, as biological metadata do exist for each domain separately. To perform a reliable quantitative intrinsic evaluation in terms of biology knowledge, we selected the metadata related to the most distinctive biological characteristics of a domain, which are its structure, enzymatic, and molecular function. Notably, dom2vec obtains an adequate level of performance in the intrinsic assessment\u2014therefore, we can draw an analogy between the local linguistic features in natural languages and the domain structure and function information in domain architectures. Moreover, we demonstrate the dom2vec applicability on protein prediction tasks, by comparing it with state-of-the-art sequence embeddings in three downstream tasks. We show that dom2vec outperforms sequence embeddings for toxin and enzymatic function prediction and is comparable with sequence embeddings in cellular location prediction.<\/jats:p>","DOI":"10.3390\/a14010028","type":"journal-article","created":{"date-parts":[[2021,1,19]],"date-time":"2021-01-19T11:39:55Z","timestamp":1611056395000},"page":"28","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":10,"title":["Capturing Protein Domain Structure and Function Using Self-Supervision on Domain Architectures"],"prefix":"10.3390","volume":"14","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-8503-1962","authenticated-orcid":false,"given":"Damianos P.","family":"Melidis","sequence":"first","affiliation":[{"name":"L3S Research Center, Leibniz University Hannover, 30167 Hannover, Germany"}]},{"given":"Wolfgang","family":"Nejdl","sequence":"additional","affiliation":[{"name":"L3S Research Center, Leibniz University Hannover, 30167 Hannover, Germany"},{"name":"Knowledge-Based Systems Laboratory, Leibniz University Hannover, 30167 Hannover, Germany"}]}],"member":"1968","published-online":{"date-parts":[[2021,1,19]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"444","DOI":"10.1016\/j.tibs.2008.05.008","article-title":"Arrangements in the modular evolution of proteins","volume":"33","author":"Moore","year":"2008","journal-title":"Trends Biochem. Sci."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Forslund, K., and Sonnhammer, E.L. (2012). Evolution of protein domain architectures. Evolutionary Genomics, Springer.","DOI":"10.1007\/978-1-61779-585-5_8"},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"45765","DOI":"10.1074\/jbc.M204161200","article-title":"Using functional domain composition and support vector machines for prediction of protein subcellular location","volume":"277","author":"Chou","year":"2002","journal-title":"J. Biol. Chem."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"1681","DOI":"10.1093\/bioinformatics\/btn312","article-title":"Predicting protein function from domain content","volume":"24","author":"Forslund","year":"2008","journal-title":"Bioinformatics"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"2264","DOI":"10.1093\/bioinformatics\/btw114","article-title":"UniProt-DAAC: Domain architecture alignment and classification, a new method for automatic functional annotation in UniProtKB","volume":"32","author":"MacDougall","year":"2016","journal-title":"Bioinformatics"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"50","DOI":"10.1016\/j.gde.2015.08.010","article-title":"The language of the protein universe","volume":"35","author":"Scaiewicz","year":"2015","journal-title":"Curr. Opin. Genet. Dev."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"3636","DOI":"10.1073\/pnas.1814684116","article-title":"Grammar of protein domain architectures","volume":"116","author":"Yu","year":"2019","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"320","DOI":"10.1093\/nar\/26.1.320","article-title":"Pfam: Multiple sequence alignments and HMM-profiles of protein domains","volume":"26","author":"Sonnhammer","year":"1998","journal-title":"Nucleic Acids Res."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"274","DOI":"10.1093\/bioinformatics\/btt379","article-title":"Rapid similarity search of proteins using alignments of domain arrangements","volume":"30","author":"Terrapon","year":"2013","journal-title":"Bioinformatics"},{"key":"ref_10","first-page":"D200","article-title":"CDD\/SPARCLE: Functional classification of proteins via subfamily domain architectures","volume":"45","author":"Bo","year":"2016","journal-title":"Nucleic Acids Res."},{"key":"ref_11","first-page":"2493","article-title":"Natural language processing (almost) from scratch","volume":"12","author":"Collobert","year":"2011","journal-title":"J. Mach. Learn. Res."},{"key":"ref_12","unstructured":"Mikolov, T., Chen, K., Corrado, G.S., and Dean, J. (2013, January 2\u20134). Efficient Estimation of Word Representations in Vector Space. Proceedings of the 1st International Conference on Learning Representations, Scottsdale, AR, USA."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Pennington, J., Socher, R., and Manning, C. (2014, January 25\u201329). GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar.","DOI":"10.3115\/v1\/D14-1162"},{"key":"ref_14","unstructured":"Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., and Dean, J. (2013, January 5\u201310). Distributed representations of words and phrases and their compositionality. Proceedings of the 27th Advances in Neural Information Processing Systems, LakeTahoe, NV, USA."},{"key":"ref_15","unstructured":"Drozd, A., Gladkova, A., and Matsuoka, S. (2016, January 5\u201310). Word embeddings, analogies, and machine learning: Beyond king-man+woman=queen. Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Attardi, G., Cozza, V., and Sartiano, D. (2015, January 3\u20134). Detecting the scope of negations in clinical notes. Proceedings of the Second Italian Conference on Computational Linguistics CLiC-it 2015, Trento, Italy.","DOI":"10.4000\/books.aaccademia.1286"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Asgari, E., and Mofrad, M.R.K. (2015). Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PLoS ONE, 10.","DOI":"10.1371\/journal.pone.0141287"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"2642","DOI":"10.1093\/bioinformatics\/bty178","article-title":"Learned protein embeddings for machine learning","volume":"34","author":"Yang","year":"2018","journal-title":"Bioinformatics"},{"key":"ref_19","unstructured":"Bepler, T., and Berger, B. (2019, January 6\u20139). Learning Protein Sequence Embeddings using Information from Structure. Proceedings of the 7th International Conference on Learning Representations, New Orleans, LA, USA."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1038\/s41598-019-38746-w","article-title":"Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX)","volume":"9","author":"Asgari","year":"2019","journal-title":"Sci. Rep."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Heinzinger, M., Elnaggar, A., Wang, Y., Dallago, C., Nechaev, D., Matthes, F., and Rost, B. (2019). Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinform., 20.","DOI":"10.1186\/s12859-019-3220-8"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"1315","DOI":"10.1038\/s41592-019-0598-1","article-title":"Unified rational protein engineering with sequence-based deep representation learning","volume":"16","author":"Alley","year":"2019","journal-title":"Nat. Methods"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"616","DOI":"10.1002\/prot.25842","article-title":"Learning a functional grammar of protein domains using natural language word embedding techniques","volume":"88","author":"Buchan","year":"2020","journal-title":"Proteins: Struct. Funct. Bioinform."},{"key":"ref_24","doi-asserted-by":"crossref","unstructured":"Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. (2018, January 1\u20136). Deep contextualized word representations. Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA.","DOI":"10.18653\/v1\/N18-1202"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/gb-2009-10-2-207","article-title":"Protein function annotation by homology-based inference","volume":"10","author":"Loewenstein","year":"2009","journal-title":"Genome Biol."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Schnabel, T., Labutov, I., Mimno, D., and Joachims, T. (2015, January 17\u201321). Evaluation methods for unsupervised word embeddings. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal.","DOI":"10.18653\/v1\/D15-1036"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"645","DOI":"10.1016\/j.engappai.2019.07.010","article-title":"A reproducible survey on word embeddings and ontology-based methods for word similarity: Linear combinations outperform the state of the art","volume":"85","author":"Goikoetxea","year":"2019","journal-title":"Eng. Appl. Artif. Intell."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"The UniProt Consortium (2017). UniProt: The universal protein knowledgebase. Nucleic Acids Res., 45, D158\u2013D169.","DOI":"10.1093\/nar\/gkw1099"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"D304","DOI":"10.1093\/nar\/gkt1240","article-title":"SCOPe: Structural Classification of Proteins-extended, integrating SCOP and ASTRAL data and classification of new structures","volume":"42","author":"Fox","year":"2013","journal-title":"Nucleic Acids Res."},{"key":"ref_30","first-page":"2579","article-title":"Visualizing data using t-SNE","volume":"9","author":"Maaten","year":"2008","journal-title":"J. Mach. Learn. Res."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"442","DOI":"10.1016\/0005-2795(75)90109-9","article-title":"Comparison of the predicted and observed secondary structure of T4 phage lysozyme","volume":"405","author":"Matthews","year":"1975","journal-title":"Biochim. Et Biophys. Acta (BBA)-Protein Struct."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"D434","DOI":"10.1093\/nar\/gkh119","article-title":"IntEnz, the integrated relational enzyme database","volume":"32","author":"Fleischmann","year":"2004","journal-title":"Nucleic Acids Res."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"559","DOI":"10.1080\/14786440109462720","article-title":"LIII. On lines and planes of closest fit to systems of points in space","volume":"2","author":"Pearson","year":"1901","journal-title":"Lond. Edinb. Dublin Philos. Mag. J. Sci."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"1005","DOI":"10.1006\/jmbi.2000.3903","article-title":"Predicting subcellular localization of proteins based on their N-terminal amino acid sequence","volume":"300","author":"Emanuelsson","year":"2000","journal-title":"J. Mol. Biol."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"e90","DOI":"10.7717\/peerj-cs.90","article-title":"Machine learning can differentiate venom toxins from other proteins having non-toxic physiological functions","volume":"2","author":"Gacesa","year":"2016","journal-title":"PeerJ Comput. Sci."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"760","DOI":"10.1093\/bioinformatics\/btx680","article-title":"DEEPre: Sequence-based enzyme EC number prediction by deep learning","volume":"34","author":"Li","year":"2017","journal-title":"Bioinformatics"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Luong, T., Sutskever, I., Le, Q., Vinyals, O., and Zaremba, W. (2015, January 26\u201331). Addressing the Rare Word Problem in Neural Machine Translation. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, Beijing, China.","DOI":"10.3115\/v1\/P15-1002"},{"key":"ref_38","doi-asserted-by":"crossref","unstructured":"Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T. (2017, January 3\u20137). Bag of tricks for efficient text classification. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain.","DOI":"10.18653\/v1\/E17-2068"},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"2278","DOI":"10.1109\/5.726791","article-title":"Gradient-based learning applied to document recognition","volume":"86","author":"LeCun","year":"1998","journal-title":"Proc. IEEE"},{"key":"ref_40","doi-asserted-by":"crossref","first-page":"1735","DOI":"10.1162\/neco.1997.9.8.1735","article-title":"Long short-term memory","volume":"9","author":"Hochreiter","year":"1997","journal-title":"Neural Comput."},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"D351","DOI":"10.1093\/nar\/gky1100","article-title":"InterPro in 2019: Improving coverage, classification and access to protein sequence annotations","volume":"47","author":"Mitchell","year":"2019","journal-title":"Nucleic Acids Res."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"1236","DOI":"10.1093\/bioinformatics\/btu031","article-title":"InterProScan 5: Genome-scale protein function classification","volume":"30","author":"Jones","year":"2014","journal-title":"Bioinformatics"},{"key":"ref_43","unstructured":"Kingma, D.P., and Ba, J.L. (2015, January 7\u20139). Adam: A Method for Stochastic Optimization. Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, USA."}],"container-title":["Algorithms"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1999-4893\/14\/1\/28\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T05:12:48Z","timestamp":1760159568000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1999-4893\/14\/1\/28"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,1,19]]},"references-count":43,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2021,1]]}},"alternative-id":["a14010028"],"URL":"https:\/\/doi.org\/10.3390\/a14010028","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2020.03.17.995498","asserted-by":"object"},{"id-type":"doi","id":"10.21203\/rs.3.rs-58816\/v1","asserted-by":"object"}]},"ISSN":["1999-4893"],"issn-type":[{"value":"1999-4893","type":"electronic"}],"subject":[],"published":{"date-parts":[[2021,1,19]]}}}