{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,11]],"date-time":"2025-12-11T07:27:35Z","timestamp":1765438055644},"reference-count":34,"publisher":"Springer Science and Business Media LLC","issue":"1","content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"published-print":{"date-parts":[[2007,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:sec>\n            <jats:title>Background<\/jats:title>\n            <jats:p>The post-genomic era is characterised by a torrent of biological information flooding the public databases. As a direct consequence, similarity searches starting with a single query sequence frequently lead to the identification of hundreds, or even thousands of potential homologues. The huge volume of data renders the subsequent structural, functional and evolutionary analyses very difficult. It is therefore essential to develop new strategies for efficient sampling of this large sequence space, in order to reduce the number of sequences to be processed. At the same time, it is important to retain the most pertinent sequences for structural and functional studies.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Results<\/jats:title>\n            <jats:p>An exhaustive analysis on a large scale test set (284 protein families) was performed to compare the efficiency of four different sampling methods aimed at selecting the most pertinent sequences. These four methods sample the proteins detected by BlastP searches and can be divided into two categories: two customisable methods where the user defines either the maximal number or the percentage of sequences to be selected; two automatic methods in which the number of sequences selected is determined by the program. We focused our analysis on the potential information content of the sampled sets of sequences using multiple alignment of complete sequences as the main validation tool. The study considered two criteria: the total number of sequences in BlastP and their associated E-values. The subsequent analyses investigated the influence of the sampling methods on the E-value distributions, the sequence coverage, the final multiple alignment quality and the active site characterisation at various residue conservation thresholds as a function of these criteria.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Conclusion<\/jats:title>\n            <jats:p>The comparative analysis of the four sampling methods allows us to propose a suitable sampling strategy that significantly reduces the number of homologous sequences required for alignment, while at the same time maintaining the relevant information concerning the active site residues.<\/jats:p>\n          <\/jats:sec>","DOI":"10.1186\/1471-2105-8-62","type":"journal-article","created":{"date-parts":[[2007,2,23]],"date-time":"2007-02-23T20:34:51Z","timestamp":1172262891000},"update-policy":"http:\/\/dx.doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":8,"title":["Blast sampling for structural and functional analyses"],"prefix":"10.1186","volume":"8","author":[{"given":"Anne","family":"Friedrich","sequence":"first","affiliation":[]},{"given":"Raymond","family":"Ripp","sequence":"additional","affiliation":[]},{"given":"Nicolas","family":"Garnier","sequence":"additional","affiliation":[]},{"given":"Emmanuel","family":"Bettler","sequence":"additional","affiliation":[]},{"given":"Gilbert","family":"Del\u00e9age","sequence":"additional","affiliation":[]},{"given":"Olivier","family":"Poch","sequence":"additional","affiliation":[]},{"given":"Luc","family":"Moulinier","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2007,2,23]]},"reference":[{"issue":"4","key":"1434_CR1","doi-asserted-by":"publisher","first-page":"332","DOI":"10.1038\/ng0893-332","volume":"4","author":"MS Boguski","year":"1993","unstructured":"Boguski MS, Lowe TM, Tolstoshev CM: dbEST--database for \"expressed sequence tags\". Nat Genet 1993, 4(4):332\u2013333. 10.1038\/ng0893-332","journal-title":"Nat Genet"},{"issue":"1","key":"1434_CR2","doi-asserted-by":"publisher","first-page":"126","DOI":"10.1093\/nar\/29.1.126","volume":"29","author":"A Bernal","year":"2001","unstructured":"Bernal A, Ear U, Kyrpides N: Genomes OnLine Database (GOLD): a monitor of genome projects world-wide. Nucleic Acids Res 2001, 29(1):126\u2013127. 10.1093\/nar\/29.1.126","journal-title":"Nucleic Acids Res"},{"key":"1434_CR3","unstructured":"Genome OnLine Database[http:\/\/www.genomesonline.org\/]"},{"issue":"3","key":"1434_CR4","doi-asserted-by":"publisher","first-page":"403","DOI":"10.1016\/S0022-2836(05)80360-2","volume":"215","author":"SF Altschul","year":"1990","unstructured":"Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215(3):403\u2013410.","journal-title":"J Mol Biol"},{"issue":"1-2","key":"1434_CR5","doi-asserted-by":"publisher","first-page":"17","DOI":"10.1016\/S0378-1119(01)00461-9","volume":"270","author":"O Lecompte","year":"2001","unstructured":"Lecompte O, Thompson JD, Plewniak F, Thierry J, Poch O: Multiple alignment of complete sequences (MACS) in the post-genomic era. Gene 2001, 270(1\u20132):17\u201330. 10.1016\/S0378-1119(01)00461-9","journal-title":"Gene"},{"issue":"13","key":"1434_CR6","doi-asserted-by":"publisher","first-page":"2682","DOI":"10.1093\/nar\/27.13.2682","volume":"27","author":"JD Thompson","year":"1999","unstructured":"Thompson JD, Plewniak F, Poch O: A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res 1999, 27(13):2682\u20132690. 10.1093\/nar\/27.13.2682","journal-title":"Nucleic Acids Res"},{"issue":"Database issue","key":"1434_CR7","doi-asserted-by":"publisher","first-page":"D187","DOI":"10.1093\/nar\/gkj161","volume":"34","author":"CH Wu","year":"2006","unstructured":"Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Mazumder R, O'Donovan C, Redaschi N, Suzek B: The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res 2006, 34(Database issue):D187\u201391. 10.1093\/nar\/gkj161","journal-title":"Nucleic Acids Res"},{"issue":"13","key":"1434_CR8","doi-asserted-by":"publisher","first-page":"3789","DOI":"10.1093\/nar\/gkg620","volume":"31","author":"S Mika","year":"2003","unstructured":"Mika S, Rost B: UniqueProt: Creating representative protein sequence sets. Nucleic Acids Res 2003, 31(13):3789\u20133791. 10.1093\/nar\/gkg620","journal-title":"Nucleic Acids Res"},{"issue":"Web Server issu","key":"1434_CR9","doi-asserted-by":"publisher","first-page":"W26","DOI":"10.1093\/nar\/gkh459","volume":"32","author":"JB Spalding","year":"2004","unstructured":"Spalding JB, Lammers PJ: BLAST Filter and GraphAlign: rule-based formation and analysis of sets of related DNA and protein sequences. Nucleic Acids Res 2004, 32(Web Server issue):W26\u201332. 10.1093\/nar\/gkh459","journal-title":"Nucleic Acids Res"},{"issue":"2","key":"1434_CR10","doi-asserted-by":"publisher","first-page":"149","DOI":"10.1093\/bioinformatics\/bti791","volume":"22","author":"I Mihalek","year":"2006","unstructured":"Mihalek I, Res I, Lichtarge O: A structure and evolution-guided Monte Carlo sequence selection strategy for multiple alignment-based analysis of proteins. Bioinformatics 2006, 22(2):149\u2013156. 10.1093\/bioinformatics\/bti791","journal-title":"Bioinformatics"},{"issue":"1","key":"1434_CR11","doi-asserted-by":"publisher","first-page":"235","DOI":"10.1093\/nar\/28.1.235","volume":"28","author":"HM Berman","year":"2000","unstructured":"Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res 2000, 28(1):235\u2013242. 10.1093\/nar\/28.1.235","journal-title":"Nucleic Acids Res"},{"issue":"2","key":"1434_CR12","doi-asserted-by":"publisher","first-page":"197","DOI":"10.1002\/prot.10029","volume":"46","author":"D Przybylski","year":"2002","unstructured":"Przybylski D, Rost B: Alignments grow, secondary structure prediction improves. Proteins 2002, 46(2):197\u2013205. 10.1002\/prot.10029","journal-title":"Proteins"},{"issue":"4","key":"1434_CR13","doi-asserted-by":"publisher","first-page":"937","DOI":"10.1006\/jmbi.2001.5187","volume":"314","author":"JD Thompson","year":"2001","unstructured":"Thompson JD, Plewniak F, Ripp R, Thierry JC, Poch O: Towards a reliable objective function for multiple sequence alignments. J Mol Biol 2001, 314(4):937\u2013951. 10.1006\/jmbi.2001.5187","journal-title":"J Mol Biol"},{"issue":"1","key":"1434_CR14","doi-asserted-by":"publisher","first-page":"276","DOI":"10.1093\/nar\/30.1.276","volume":"30","author":"A Bateman","year":"2002","unstructured":"Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths-Jones S, Howe KL, Marshall M, Sonnhammer EL: The Pfam protein families database. Nucleic Acids Res 2002, 30(1):276\u2013280. 10.1093\/nar\/30.1.276","journal-title":"Nucleic Acids Res"},{"issue":"13","key":"1434_CR15","doi-asserted-by":"publisher","first-page":"3829","DOI":"10.1093\/nar\/gkg518","volume":"31","author":"F Plewniak","year":"2003","unstructured":"Plewniak F, Bianchetti L, Brelivet Y, Carles A, Chalmel F, Lecompte O, Mochel T, Moulinier L, Muller A, Muller J, Prigent V, Ripp R, Thierry JC, Thompson JD, Wicker N, Poch O: PipeAlign: A new toolkit for protein family analysis. Nucleic Acids Res 2003, 31(13):3829\u20133832. 10.1093\/nar\/gkg518","journal-title":"Nucleic Acids Res"},{"key":"1434_CR16","doi-asserted-by":"publisher","first-page":"471","DOI":"10.1186\/1471-2105-7-471","volume":"7","author":"PA Nuin","year":"2006","unstructured":"Nuin PA, Wang Z, Tillier ER: The accuracy of several multiple sequence alignment programs for proteins. BMC Bioinformatics 2006, 7: 471. 10.1186\/1471-2105-7-471","journal-title":"BMC Bioinformatics"},{"issue":"1","key":"1434_CR17","doi-asserted-by":"publisher","first-page":"127","DOI":"10.1002\/prot.20527","volume":"61","author":"JD Thompson","year":"2005","unstructured":"Thompson JD, Koehl P, Ripp R, Poch O: BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins 2005, 61(1):127\u2013136. 10.1002\/prot.20527","journal-title":"Proteins"},{"issue":"1","key":"1434_CR18","doi-asserted-by":"publisher","first-page":"177","DOI":"10.1006\/jmbi.1999.2911","volume":"291","author":"LA Mirny","year":"1999","unstructured":"Mirny LA, Shakhnovich EI: Universally conserved positions in protein folds: reading evolutionary signals about stability, folding kinetics and function. J Mol Biol 1999, 291(1):177\u2013196. 10.1006\/jmbi.1999.2911","journal-title":"J Mol Biol"},{"issue":"1","key":"1434_CR19","doi-asserted-by":"publisher","first-page":"105","DOI":"10.1016\/S0022-2836(02)01036-7","volume":"324","author":"GJ Bartlett","year":"2002","unstructured":"Bartlett GJ, Porter CT, Borkakoti N, Thornton JM: Analysis of catalytic residues in enzyme active sites. J Mol Biol 2002, 324(1):105\u2013121. 10.1016\/S0022-2836(02)01036-7","journal-title":"J Mol Biol"},{"key":"1434_CR20","doi-asserted-by":"publisher","first-page":"271","DOI":"10.1023\/A:1017181826899","volume":"30","author":"R Kohavi","year":"1998","unstructured":"Kohavi R, Provost F: Glossary of Terms. Machine Learning 1998, 30: 271\u2013274. 10.1023\/A:1017181826899","journal-title":"Machine Learning"},{"key":"1434_CR21","doi-asserted-by":"publisher","first-page":"195","DOI":"10.1023\/A:1007452223027","volume":"30","author":"M Kubat","year":"1998","unstructured":"Kubat M, Holte RC, Matwin S: Machine learning for the detection of oil spills in satellite radar images. Machine Learning 1998, 30: 195\u2013215. 10.1023\/A:1007452223027","journal-title":"Machine Learning"},{"issue":"2","key":"1434_CR22","doi-asserted-by":"publisher","first-page":"395","DOI":"10.1006\/jmbi.2001.4870","volume":"311","author":"P Aloy","year":"2001","unstructured":"Aloy P, Querol E, Aviles FX, Sternberg MJ: Automated structure-based prediction of functional sites in proteins: applications to assessing the validity of inheriting protein function from homology in genome annotation and to protein docking. J Mol Biol 2001, 311(2):395\u2013408. 10.1006\/jmbi.2001.4870","journal-title":"J Mol Biol"},{"issue":"1","key":"1434_CR23","doi-asserted-by":"publisher","first-page":"452","DOI":"10.1093\/nar\/gkg062","volume":"31","author":"FM Pearl","year":"2003","unstructured":"Pearl FM, Bennett CF, Bray JE, Harrison AP, Martin N, Shepherd A, Sillitoe I, Thornton J, Orengo CA: The CATH database: an extended protein family resource for structural and functional genomics. Nucleic Acids Res 2003, 31(1):452\u2013455. 10.1093\/nar\/gkg062","journal-title":"Nucleic Acids Res"},{"issue":"12","key":"1434_CR24","doi-asserted-by":"crossref","first-page":"1192","DOI":"10.1096\/fasebj.7.12.8375619","volume":"7","author":"EC Webb","year":"1993","unstructured":"Webb EC: Enzyme nomenclature: a personal retrospective. Faseb J 1993, 7(12):1192\u20131194.","journal-title":"Faseb J"},{"issue":"Database issue","key":"1434_CR25","doi-asserted-by":"publisher","first-page":"D115","DOI":"10.1093\/nar\/gkh131","volume":"32","author":"R Apweiler","year":"2004","unstructured":"Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LS: UniProt: the Universal Protein knowledgebase. Nucleic Acids Res 2004, 32(Database issue):D115\u20139. 10.1093\/nar\/gkh131","journal-title":"Nucleic Acids Res"},{"issue":"17","key":"1434_CR26","doi-asserted-by":"publisher","first-page":"3389","DOI":"10.1093\/nar\/25.17.3389","volume":"25","author":"SF Altschul","year":"1997","unstructured":"Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389\u20133402. 10.1093\/nar\/25.17.3389","journal-title":"Nucleic Acids Res"},{"issue":"4","key":"1434_CR27","first-page":"327","volume":"12","author":"K Sjolander","year":"1996","unstructured":"Sjolander K, Karplus K, Brown M, Hughey R, Krogh A, Mian IS, Haussler D: Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Comput Appl Biosci 1996, 12(4):327\u2013345.","journal-title":"Comput Appl Biosci"},{"issue":"8","key":"1434_CR28","doi-asserted-by":"publisher","first-page":"1435","DOI":"10.1093\/oxfordjournals.molbev.a003929","volume":"18","author":"N Wicker","year":"2001","unstructured":"Wicker N, Perrin GR, Thierry JC, Poch O: Secator: a program for inferring protein subfamilies from phylogenetic trees. Mol Biol Evol 2001, 18(8):1435\u20131441.","journal-title":"Mol Biol Evol"},{"issue":"9","key":"1434_CR29","doi-asserted-by":"publisher","first-page":"750","DOI":"10.1093\/bioinformatics\/16.9.750","volume":"16","author":"F Plewniak","year":"2000","unstructured":"Plewniak F, Thompson JD, Poch O: Ballast: blast post-processing based on locally conserved segments. Bioinformatics 2000, 16(9):750\u2013759. 10.1093\/bioinformatics\/16.9.750","journal-title":"Bioinformatics"},{"issue":"15","key":"1434_CR30","doi-asserted-by":"publisher","first-page":"2919","DOI":"10.1093\/nar\/28.15.2919","volume":"28","author":"JD Thompson","year":"2000","unstructured":"Thompson JD, Plewniak F, Thierry J, Poch O: DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches. Nucleic Acids Res 2000, 28(15):2919\u20132926. 10.1093\/nar\/28.15.2919","journal-title":"Nucleic Acids Res"},{"issue":"9","key":"1434_CR31","doi-asserted-by":"publisher","first-page":"1155","DOI":"10.1093\/bioinformatics\/btg133","volume":"19","author":"JD Thompson","year":"2003","unstructured":"Thompson JD, Thierry JC, Poch O: RASCAL: rapid scanning and correction of multiple sequence alignments. Bioinformatics 2003, 19(9):1155\u20131161. 10.1093\/bioinformatics\/btg133","journal-title":"Bioinformatics"},{"issue":"4","key":"1434_CR32","doi-asserted-by":"publisher","first-page":"506","DOI":"10.1093\/bioinformatics\/btg016","volume":"19","author":"M Errami","year":"2003","unstructured":"Errami M, Geourjon C, Deleage G: Detection of unrelated proteins in sequences multiple alignments by using predicted secondary structures. Bioinformatics 2003, 19(4):506\u2013512. 10.1093\/bioinformatics\/btg016","journal-title":"Bioinformatics"},{"issue":"1","key":"1434_CR33","doi-asserted-by":"publisher","first-page":"29","DOI":"10.1148\/radiology.143.1.7063747","volume":"143","author":"JA Hanley","year":"1982","unstructured":"Hanley JA, McNeil BJ: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982, 143(1):29\u201336.","journal-title":"Radiology"},{"issue":"12","key":"1434_CR34","doi-asserted-by":"publisher","first-page":"1038","DOI":"10.1097\/01.opx.0000192350.01045.6f","volume":"82","author":"MD Twa","year":"2005","unstructured":"Twa MD, Parthasarathy S, Roberts C, Mahmoud AM, Raasch TW, Bullimore MA: Automated decision tree classification of corneal shape. Optom Vis Sci 2005, 82(12):1038\u20131046. 10.1097\/01.opx.0000192350.01045.6f","journal-title":"Optom Vis Sci"}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/1471-2105-8-62.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,9,1]],"date-time":"2021-09-01T01:37:08Z","timestamp":1630460228000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/1471-2105-8-62"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2007,2,23]]},"references-count":34,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2007,12]]}},"alternative-id":["1434"],"URL":"https:\/\/doi.org\/10.1186\/1471-2105-8-62","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2007,2,23]]},"assertion":[{"value":"8 August 2006","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"23 February 2007","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"23 February 2007","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"62"}}