{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,10]],"date-time":"2026-03-10T13:03:46Z","timestamp":1773147826520,"version":"3.50.1"},"reference-count":35,"publisher":"Oxford University Press (OUP)","issue":"3","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2006,2,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Motivation: Remote homology detection between protein sequences is a central problem in computational biology. The discriminative method such as the support vector machine (SVM) is one of the most effective methods. Many of the SVM-based methods focus on finding useful representations of protein sequence, using either explicit feature vector representations or kernel functions. Such representations may suffer from the peaking phenomenon in many machine-learning methods because the features are usually very large and noise data may be introduced. Based on these observations, this research focuses on feature extraction and efficient representation of protein vectors for SVM protein classification.<\/jats:p>\n               <jats:p>Results: In this study, a latent semantic analysis (LSA) model, which is an efficient feature extraction technique from natural language processing, has been introduced in protein remote homology detection. Several basic building blocks of protein sequences have been investigated as the \u2018words\u2019 of \u2018protein sequence language\u2019, including N-grams, patterns and motifs. Each protein sequence is taken as a \u2018document\u2019 that is composed of bags-of-word. The word-document matrix is constructed first. The LSA is performed on the matrix to produce the latent semantic representation vectors of protein sequences, leading to noise-removal and smart description of protein sequences. The latent semantic representation vectors are then evaluated by SVM. The method is tested on the SCOP 1.53 database. The results show that the LSA model significantly improves the performance of remote homology detection in comparison with the basic formalisms. Furthermore, the performance of this method is comparable with that of the complex kernel methods such as SVM-LA and better than that of other sequence-based methods such as PSI-BLAST and SVM-pairwise.<\/jats:p>\n               <jats:p>Availability: The source codes are freely available at or upon request from the authors.<\/jats:p>\n               <jats:p>Contact: \u00a0qwdong@insun.hit.edu.cn<\/jats:p>","DOI":"10.1093\/bioinformatics\/bti801","type":"journal-article","created":{"date-parts":[[2005,11,30]],"date-time":"2005-11-30T03:18:43Z","timestamp":1133320723000},"page":"285-290","source":"Crossref","is-referenced-by-count":92,"title":["Application of latent semantic analysis to protein remote homology detection"],"prefix":"10.1093","volume":"22","author":[{"given":"Qi-wen","family":"Dong","sequence":"first","affiliation":[{"name":"School of Computer Science and Technology, Harbin Institute of Technology \u00a0 Harbin, China"}]},{"given":"Xiao-long","family":"Wang","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, Harbin Institute of Technology \u00a0 Harbin, China"}]},{"given":"Lei","family":"Lin","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, Harbin Institute of Technology \u00a0 Harbin, China"}]}],"member":"286","published-online":{"date-parts":[[2005,11,29]]},"reference":[{"key":"2023012408342256600_b1","doi-asserted-by":"crossref","first-page":"403","DOI":"10.1016\/S0022-2836(05)80360-2","article-title":"Basic local alignment search tool","volume":"215","author":"Altschul","year":"1990","journal-title":"J. Mol. Biol."},{"key":"2023012408342256600_b2","doi-asserted-by":"crossref","first-page":"3389","DOI":"10.1093\/nar\/25.17.3389","article-title":"Gapped Blast and Psi-blast: a new generation of protein database search programs","volume":"25","author":"Altschul","year":"1997","journal-title":"Nucleic Acids Res."},{"key":"2023012408342256600_b3","doi-asserted-by":"crossref","first-page":"2821","DOI":"10.1093\/bioinformatics\/bti432","article-title":"Use of multiple profiles corresponding to a sequence alignment enables effective detection of remote homologues","volume":"21","author":"Anand","year":"2005","journal-title":"Bioinformatics"},{"key":"2023012408342256600_b4","doi-asserted-by":"crossref","first-page":"D226","DOI":"10.1093\/nar\/gkh039","article-title":"SCOP database in 2004: refinements integrate structure and sequence family data","volume":"32","author":"Andreeva","year":"2004","journal-title":"Nucleic Acids Res."},{"key":"2023012408342256600_b5","first-page":"28","article-title":"Fitting a mixture model by expectation maximization to discover motifs in biopolymers","author":"Bailey","year":"1994"},{"key":"2023012408342256600_b6","first-page":"10","article-title":"Classifying proteins by family using the product of correlated p-values","author":"Bailey","year":"1999"},{"key":"2023012408342256600_b7","doi-asserted-by":"crossref","first-page":"1279","DOI":"10.1109\/5.880084","article-title":"Exploiting latent semantic information in statistical language modeling","volume":"88","author":"Bellegarda","year":"2000","journal-title":"Proc. IEEE"},{"key":"2023012408342256600_b8","doi-asserted-by":"crossref","first-page":"i26","DOI":"10.1093\/bioinformatics\/btg1002","article-title":"Remote homology detection: a motif based approach","volume":"19","author":"Ben-Hur","year":"2003","journal-title":"Bioinformatics"},{"key":"2023012408342256600_b9","doi-asserted-by":"crossref","first-page":"189","DOI":"10.1093\/nar\/gkh034","article-title":"The ASTRAL Compendium in 2004","volume":"32","author":"Chandonia","year":"2004","journal-title":"Nucleic Acids Res."},{"key":"2023012408342256600_b10","doi-asserted-by":"crossref","first-page":"955","DOI":"10.1002\/prot.20373","article-title":"Protein classification based on text document classification techniques","volume":"58","author":"Cheng","year":"2005","journal-title":"Proteins"},{"key":"2023012408342256600_b11","doi-asserted-by":"crossref","first-page":"4516","DOI":"10.1073\/pnas.0737502100","article-title":"Enhanced protein domain discovery by using language modeling techniques from speech recognition","volume":"100","author":"Coin","year":"2003","journal-title":"Proc. Natl Acad. Sci. USA"},{"key":"2023012408342256600_b12","first-page":"3363","article-title":"A pattern-based SVM for protein remote homology detection","author":"Dong","year":"2005"},{"key":"2023012408342256600_b13","doi-asserted-by":"crossref","first-page":"25","DOI":"10.1007\/978-3-540-32263-4_2","article-title":"Computational Biology and Language","volume":"3345","author":"Ganapathiraju","year":"2005","journal-title":"Ambient Intelligence for Scientific Discovery, LNAI"},{"key":"2023012408342256600_b14","doi-asserted-by":"crossref","first-page":"78","DOI":"10.1109\/MSP.2004.1296545","article-title":"Characterization of protein secondary structure, Application of latent semantic analysis using different vocabularies","volume":"21","author":"Ganapathiraju","year":"2004","journal-title":"IEEE Signal Processing Magazine"},{"key":"2023012408342256600_b15","doi-asserted-by":"crossref","DOI":"10.3115\/1289189.1289259","article-title":"Comparative N-gram analysis of whole-genome protein sequences","author":"Ganapathiraju","year":"2002"},{"key":"2023012408342256600_b16","doi-asserted-by":"crossref","first-page":"25","DOI":"10.1016\/S0097-8485(96)80004-0","article-title":"use of receiver operating characteristic(ROC) analysis to evaluate sequence matching","volume":"20","author":"Gribskov","year":"1996","journal-title":"Comput. Chem."},{"key":"2023012408342256600_b17","doi-asserted-by":"crossref","first-page":"2667","DOI":"10.1093\/bioinformatics\/bti384","article-title":"Fold recognition by combining profile\u2013profile alignment and support vector machine","volume":"21","author":"Han","year":"2005","journal-title":"Bioinformatics"},{"key":"2023012408342256600_b18","doi-asserted-by":"crossref","first-page":"518","DOI":"10.1002\/prot.20221","article-title":"Remote homolog detection using local sequence-structure correlations","volume":"57","author":"Hou","year":"2004","journal-title":"Proteins"},{"key":"2023012408342256600_b19","doi-asserted-by":"crossref","first-page":"2294","DOI":"10.1093\/bioinformatics\/btg317","article-title":"Efficient remote homology detection using local structure","volume":"19","author":"Hou","year":"2003","journal-title":"Bioinformatics"},{"key":"2023012408342256600_b20","doi-asserted-by":"crossref","first-page":"95","DOI":"10.1089\/10665270050081405","article-title":"A discriminative framework for detecting remote protein homologies","volume":"7","author":"Jaakkola","year":"2000","journal-title":"J. Comput. Biol."},{"key":"2023012408342256600_b21","doi-asserted-by":"crossref","first-page":"846","DOI":"10.1093\/bioinformatics\/14.10.846","article-title":"Hidden Markov models for detecting remote protein homologies","volume":"14","author":"Karplus","year":"1998","journal-title":"Bioinformatics"},{"key":"2023012408342256600_b22","doi-asserted-by":"crossref","first-page":"1501","DOI":"10.1006\/jmbi.1994.1104","article-title":"Hidden Markov models in computational biology: applications to protein modeling","volume":"235","author":"Krogh","year":"1994","journal-title":"J. Mol. Biol."},{"key":"2023012408342256600_b23","doi-asserted-by":"crossref","first-page":"259","DOI":"10.1080\/01638539809545028","article-title":"Introduction to latent semantic analysis","volume":"25","author":"Landauer","year":"1998","journal-title":"Discourse Process"},{"key":"2023012408342256600_b24","doi-asserted-by":"crossref","first-page":"467","DOI":"10.1093\/bioinformatics\/btg431","article-title":"Mismatch string kernels for discriminative protein classification","volume":"20","author":"Leslie","year":"2004","journal-title":"Bioinformatics"},{"key":"2023012408342256600_b25","first-page":"564","article-title":"The spectrum kernel: a string kernel for SVM protein classification","author":"Leslie","year":"2002"},{"key":"2023012408342256600_b26","doi-asserted-by":"crossref","first-page":"857","DOI":"10.1089\/106652703322756113","article-title":"Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships","volume":"10","author":"Li","year":"2003","journal-title":"J. Comput. Boil."},{"key":"2023012408342256600_b27","doi-asserted-by":"crossref","first-page":"63","DOI":"10.1016\/0076-6879(90)83007-V","article-title":"Rapid and sensitive sequence comparison with FASTP and FASTA","volume":"183","author":"Pearson","year":"1990","journal-title":"Methods Enzymol."},{"key":"2023012408342256600_b28","article-title":"A basis for repeated motifs in pattern discovery and text mining","author":"Pisanti","year":"2002"},{"key":"2023012408342256600_b29","doi-asserted-by":"crossref","first-page":"2175","DOI":"10.1093\/bioinformatics\/bth181","article-title":"Performance of an iterated T-HMM for homology detection","volume":"20","author":"Qian","year":"2004","journal-title":"Bioinformatics"},{"key":"2023012408342256600_b30","doi-asserted-by":"crossref","first-page":"55","DOI":"10.1093\/bioinformatics\/14.1.55","article-title":"Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm","volume":"14","author":"Rigoutsos","year":"1998","journal-title":"Bioinformatics"},{"key":"2023012408342256600_b31","doi-asserted-by":"crossref","first-page":"1682","DOI":"10.1093\/bioinformatics\/bth141","article-title":"Protein homology detection using string alignment kernels","volume":"20","author":"Saigo","year":"2004","journal-title":"Bioinformatics"},{"key":"2023012408342256600_b32","first-page":"396","article-title":"Comparison of SVM-based methods for remote homology detection","volume":"13","author":"Saigo","year":"2002","journal-title":"Genome Inform."},{"key":"2023012408342256600_b33","doi-asserted-by":"crossref","first-page":"195","DOI":"10.1016\/0022-2836(81)90087-5","article-title":"Identification of common molecular subsequences","volume":"147","author":"Smith","year":"1981","journal-title":"J. Mol. Biol."},{"key":"2023012408342256600_b34","doi-asserted-by":"crossref","first-page":"4673","DOI":"10.1093\/nar\/22.22.4673","article-title":"CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice","volume":"22","author":"Thompson","year":"1994","journal-title":"Nucleic Acids Res."},{"key":"2023012408342256600_b35","volume-title":"Statistical Learning Theory","author":"Vapnik","year":"1998"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/22\/3\/285\/48839521\/bioinformatics_22_3_285.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/22\/3\/285\/48839521\/bioinformatics_22_3_285.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,24]],"date-time":"2023-01-24T08:53:20Z","timestamp":1674550400000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/22\/3\/285\/220519"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2005,11,29]]},"references-count":35,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2006,2,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/bti801","relation":{},"ISSN":["1367-4811","1367-4803"],"issn-type":[{"value":"1367-4811","type":"electronic"},{"value":"1367-4803","type":"print"}],"subject":[],"published-other":{"date-parts":[[2006,2,1]]},"published":{"date-parts":[[2005,11,29]]}}}