{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,29]],"date-time":"2026-04-29T08:33:17Z","timestamp":1777451597689,"version":"3.51.4"},"reference-count":21,"publisher":"Oxford University Press (OUP)","issue":"18","license":[{"start":{"date-parts":[[2016,10,2]],"date-time":"2016-10-02T00:00:00Z","timestamp":1475366400000},"content-version":"vor","delay-in-days":1490,"URL":"http:\/\/creativecommons.org\/licenses\/by\/3.0"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2012,9,15]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Motivation: The genomic era in molecular biology has brought on a rapidly widening gap between the amount of sequence data and first-hand experimental characterization of proteins. Fortunately, the theory of evolution provides a simple solution: functional and structural information can be transferred between homologous proteins. Sequence similarity searching followed by k-nearest neighbor classification is the most widely used tool to predict the function or structure of anonymous gene products that come out of genome sequencing projects.<\/jats:p>\n               <jats:p>Results: We present a novel word filter, suffix array neighborhood search (SANS), to identify protein sequence similarities in the range of 50\u2013100% identity with sensitivity comparable to BLAST and 10 times the speed of USEARCH. In contrast to these previous approaches, the complexity of the search is proportional only to the length of the query sequence and independent of database size, enabling fast searching and functional annotation into the future despite rapidly expanding databases.<\/jats:p>\n               <jats:p>Availability and implementation: The software is freely available to non-commercial users from our website http:\/\/ekhidna.biocenter.helsinki.fi\/downloads\/sans.<\/jats:p>\n               <jats:p>Contact: \u00a0liisa.holm@helsinki.fi.<\/jats:p>","DOI":"10.1093\/bioinformatics\/bts417","type":"journal-article","created":{"date-parts":[[2012,9,7]],"date-time":"2012-09-07T20:35:22Z","timestamp":1347050122000},"page":"i438-i443","source":"Crossref","is-referenced-by-count":22,"title":["SANS: high-throughput retrieval of protein sequences allowing 50% mismatches"],"prefix":"10.1093","volume":"28","author":[{"given":"J. Patrik","family":"Koskinen","sequence":"first","affiliation":[{"name":"1 Department of Biosciences, Division of Genetics"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Liisa","family":"Holm","sequence":"additional","affiliation":[{"name":"1 Department of Biosciences, Division of Genetics"},{"name":"2 Institute of Biotechnology, University of Helsinki, 00014 Helsinki, Finland"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"286","published-online":{"date-parts":[[2012,9,3]]},"reference":[{"key":"2023012513033517400_B1","doi-asserted-by":"crossref","first-page":"403","DOI":"10.1016\/S0022-2836(05)80360-2","article-title":"Basic local alignment search tool","volume":"215","author":"Altschul","year":"1990","journal-title":"J. Mol. Biol."},{"key":"2023012513033517400_B2","first-page":"15","article-title":"Modeling protein families using probabilistic suffix trees","volume-title":"The Proceedings of RECOMB 1999","author":"Bejerano","year":"1999"},{"key":"2023012513033517400_B3","doi-asserted-by":"crossref","first-page":"77","DOI":"10.1145\/299432.299460","article-title":"q-gram based database searching using a suffix array (QUASAR)","volume-title":"RECOMB'99 Proceedings of the third annual international conference on Computational molecular biology","author":"Burkhard","year":"1999"},{"key":"2023012513033517400_B4","first-page":"56","article-title":"FLASH: A fast look-up algorithm for string homology","volume-title":"Proceedings of the first International Conference on Intelligent Systems for Molecular Biology","author":"Califano","year":"1993"},{"key":"2023012513033517400_B5","doi-asserted-by":"crossref","first-page":"98","DOI":"10.1002\/1097-0134(20001001)41:1<98::AID-PROT120>3.0.CO;2-S","article-title":"Practical limits of function prediction","volume":"41","author":"Devos","year":"2000","journal-title":"Proteins"},{"key":"2023012513033517400_B6","doi-asserted-by":"crossref","first-page":"2460","DOI":"10.1093\/bioinformatics\/btq461","article-title":"Search and clustering orders of magnitude faster than BLAST","volume":"26","author":"Edgar","year":"2010","journal-title":"Bioinformatics"},{"key":"2023012513033517400_B7","doi-asserted-by":"crossref","first-page":"225","DOI":"10.1093\/bib\/bbl004","article-title":"Automated protein function prediction\u2013the genomic challenge","volume":"7","author":"Friedberg","year":"2006","journal-title":"Brief. Bioinform."},{"key":"2023012513033517400_B8","doi-asserted-by":"crossref","first-page":"1443","DOI":"10.1126\/science.1604319","article-title":"Exhaustive matching of the entire protein sequence database","volume":"256","author":"Gonnet","year":"1992","journal-title":"Science"},{"key":"2023012513033517400_B9","doi-asserted-by":"crossref","first-page":"423","DOI":"10.1093\/bioinformatics\/14.5.423","article-title":"Removing near-neighbour redundancy from large protein data sets","volume":"14","author":"Holm","year":"1998","journal-title":"Bioinformatics"},{"key":"2023012513033517400_B10","doi-asserted-by":"crossref","first-page":"2969","DOI":"10.1093\/bioinformatics\/btm422","article-title":"Multiple spaced seeds for homology search","volume":"22","author":"Ilie","year":"2007","journal-title":"Bioinformatics"},{"key":"2023012513033517400_B11","doi-asserted-by":"crossref","first-page":"33","DOI":"10.1186\/1471-2105-13-33","article-title":"BLANNOTATOR: enhanced homology-based function prediction of bacterial proteins","volume":"13","author":"Kankainen","year":"2012","journal-title":"BMC Bioinformatics"},{"key":"2023012513033517400_B12","doi-asserted-by":"crossref","first-page":"995","DOI":"10.1038\/nrm2281","article-title":"Predicting protein function from sequence and structure","volume":"8","author":"Lee","year":"2007","journal-title":"Nat. Rev. Mol. Cell Biol."},{"key":"2023012513033517400_B13","doi-asserted-by":"crossref","first-page":"1658","DOI":"10.1093\/bioinformatics\/btl158","article-title":"Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences","volume":"22","author":"Li","year":"2006","journal-title":"Bioinformatics"},{"key":"2023012513033517400_B14","doi-asserted-by":"crossref","first-page":"440","DOI":"10.1093\/bioinformatics\/18.3.440","article-title":"PatternHunter: faster and more sensitive HomologySearch","volume":"18","author":"Ma","year":"2002","journal-title":"Bioinformatics"},{"key":"2023012513033517400_B15","doi-asserted-by":"crossref","first-page":"302","DOI":"10.1093\/bioinformatics\/btn643","article-title":"All hits all the time: parameter-free calculation of spaced seed sensitivity","volume":"25","author":"Mak","year":"2009","journal-title":"Bioinformatics"},{"key":"2023012513033517400_B16","first-page":"193","volume-title":"Linear Suffix Array Construction by Almost Pure Induced-Sorting","author":"Nong","year":"2009"},{"key":"2023012513033517400_B17","doi-asserted-by":"crossref","first-page":"458","DOI":"10.1093\/bioinformatics\/16.5.458","article-title":"RSDB: representative protein sequence databases have high information content","volume":"16","author":"Park","year":"2000","journal-title":"Bioinformatics"},{"key":"2023012513033517400_B18","doi-asserted-by":"crossref","first-page":"635","DOI":"10.1016\/0888-7543(91)90071-L","article-title":"Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms","volume":"11","author":"Pearson","year":"1991","journal-title":"Genomics"},{"key":"2023012513033517400_B19","doi-asserted-by":"crossref","first-page":"D290","DOI":"10.1093\/nar\/gkr1065","article-title":"The Pfam protein families database","volume":"40","author":"Punta","year":"2012","journal-title":"Nucleic Acids Res."},{"key":"2023012513033517400_B20","doi-asserted-by":"crossref","first-page":"W116","DOI":"10.1093\/nar\/gki442","article-title":"InterProScan: protein domains identifier","volume":"33","author":"Quevillon","year":"2005","journal-title":"Nucleic Acids Res."},{"key":"2023012513033517400_B21","doi-asserted-by":"crossref","first-page":"595","DOI":"10.1016\/S0022-2836(02)00016-5","article-title":"Enzyme function less conserved than anticipated","volume":"318","author":"Rost","year":"2002","journal-title":"J. Mol. Biol."}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/28\/18\/i438\/48884592\/bioinformatics_28_18_i438.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/28\/18\/i438\/48884592\/bioinformatics_28_18_i438.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,25]],"date-time":"2023-01-25T18:53:51Z","timestamp":1674672831000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/28\/18\/i438\/251493"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2012,9,3]]},"references-count":21,"journal-issue":{"issue":"18","published-print":{"date-parts":[[2012,9,15]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/bts417","relation":{},"ISSN":["1367-4811","1367-4803"],"issn-type":[{"value":"1367-4811","type":"electronic"},{"value":"1367-4803","type":"print"}],"subject":[],"published-other":{"date-parts":[[2012,9,15]]},"published":{"date-parts":[[2012,9,3]]}}}