{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,4,28]],"date-time":"2024-04-28T17:53:41Z","timestamp":1714326821195},"reference-count":27,"publisher":"Oxford University Press (OUP)","issue":"24","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2011,12,15]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Motivation: Second-generation sequencing technology has reinvigorated research using expression data, and clustering such data remains a significant challenge, with much larger datasets and with different error profiles. Algorithms that rely on all-versus-all comparison of sequences are not practical for large datasets.<\/jats:p>\n               <jats:p>Results: We introduce a new filter for string similarity which has the potential to eliminate the need for all-versus-all comparison in clustering of expression data and other similar tasks. Our filter is based on multiple long exact matches between the two strings, with the additional constraint that these matches must be sufficiently far apart. We give details of its efficient implementation using modified suffix arrays. We demonstrate its efficiency by presenting our new expression clustering tool, wcd-express, which uses this heuristic. We compare it to other current tools and show that it is very competitive both with respect to quality and run time.<\/jats:p>\n               <jats:p>Availability: Source code and binaries available under GPL at http:\/\/code.google.com\/p\/wcdest. Runs on Linux and MacOS X.<\/jats:p>\n               <jats:p>Contact: \u00a0scott.hazelhurst@wits.ac.za; zsuzsa@cebitec.uni-bielefeld.de<\/jats:p>\n               <jats:p>Supplementary Information: \u00a0Supplementary data are available at Bioinformatics online.<\/jats:p>","DOI":"10.1093\/bioinformatics\/btr560","type":"journal-article","created":{"date-parts":[[2011,10,10]],"date-time":"2011-10-10T14:08:29Z","timestamp":1318255709000},"page":"3348-3355","source":"Crossref","is-referenced-by-count":13,"title":["KABOOM! A new suffix array based algorithm for clustering expression data"],"prefix":"10.1093","volume":"27","author":[{"given":"Scott","family":"Hazelhurst","sequence":"first","affiliation":[{"name":"1 Wits Bioinformatics, School of Electrical and Information Engineering, University of the Witwatersrand, Johannesburg, Private Bag 3, 2050 Wits, South Africa and 2Universit\u00e0 degli Studi di Salerno, Dipartimento di Informatica, Via Ponte don Melillo, 1, 84084 Fisciano, Italy"}]},{"given":"Zsuzsanna","family":"Lipt\u00e1k","sequence":"additional","affiliation":[{"name":"1 Wits Bioinformatics, School of Electrical and Information Engineering, University of the Witwatersrand, Johannesburg, Private Bag 3, 2050 Wits, South Africa and 2Universit\u00e0 degli Studi di Salerno, Dipartimento di Informatica, Via Ponte don Melillo, 1, 84084 Fisciano, Italy"}]}],"member":"286","published-online":{"date-parts":[[2011,10,8]]},"reference":[{"key":"2023012511314538400_B1","first-page":"77","article-title":"q-gram based database searching using a suffix array (QUASAR)","volume-title":"Proceedings of the Third Annual International Conference on Research in Computational Molecular Biology (RECOMB)","author":"Burkhardt","year":"1999"},{"key":"2023012511314538400_B2","first-page":"1542","article-title":"Algorithms for clustering EST sequences: the wcd tool","volume":"24","author":"Hazelhurst","year":"2008","journal-title":"South African Comput. J."},{"key":"2023012511314538400_B3","volume-title":"ESTsim: a tool for creating benchmarks for EST clustering algorithms.","author":"Hazelhurst","year":"2003"},{"key":"2023012511314538400_B4","doi-asserted-by":"crossref","first-page":"1542","DOI":"10.1093\/bioinformatics\/btn203","article-title":"An overview of the wcd EST clustering tool","volume":"24","author":"Hazelhurst","year":"2008","journal-title":"Bioinformatics"},{"key":"2023012511314538400_B5","doi-asserted-by":"crossref","first-page":"1084","DOI":"10.1093\/bioinformatics\/btp112","article-title":"mkESA: enhanced suffix array construction tool","volume":"25","author":"Homann","year":"2009","journal-title":"Bioinformatics"},{"key":"2023012511314538400_B6","doi-asserted-by":"crossref","first-page":"868","DOI":"10.1101\/gr.9.9.868","article-title":"CAP3: a DNA sequence assembly program","volume":"9","author":"Huang","year":"1999","journal-title":"Genome Res."},{"key":"2023012511314538400_B7","doi-asserted-by":"crossref","first-page":"264","DOI":"10.1145\/331499.331504","article-title":"Data clustering: a review","volume":"31","author":"Jain","year":"1999","journal-title":"ACM Comput. Surv."},{"key":"2023012511314538400_B8","doi-asserted-by":"crossref","DOI":"10.1109\/IPDPS.2002.1016587","article-title":"Parallel EST clustering","volume-title":"Proceedings of IEEE Conference High Performance Computational Biology.","author":"Kalyanaraman","year":"2002"},{"key":"2023012511314538400_B9","first-page":"707","article-title":"Binary codes capable of correcting deletions, insertions, and reversals","volume":"10","author":"Levenshtein","year":"1966","journal-title":"Soviet Phys. Doklady"},{"key":"2023012511314538400_B10","doi-asserted-by":"crossref","first-page":"1221","DOI":"10.1093\/bioinformatics\/btg138","article-title":"Fast sequence clustering using a suffix array algorithm","volume":"19","author":"Malde","year":"2003","journal-title":"Bioinformatics"},{"key":"2023012511314538400_B11","doi-asserted-by":"crossref","first-page":"935","DOI":"10.1137\/0222058","article-title":"Suffix arrays: a new method for on-line string searches","volume":"22","author":"Manber","year":"1993","journal-title":"SIAM J. Comput."},{"key":"2023012511314538400_B12","doi-asserted-by":"crossref","first-page":"33","DOI":"10.1007\/s00453-004-1094-1","article-title":"Engineering a lightweight suffix array construction algorithm","volume":"40","author":"Manzini","year":"2004","journal-title":"Algorithmica"},{"key":"2023012511314538400_B13","doi-asserted-by":"crossref","first-page":"1143","DOI":"10.1101\/gr.9.11.1143","article-title":"A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base","volume":"9","author":"Miller","year":"1999","journal-title":"Genome Res."},{"key":"2023012511314538400_B14","doi-asserted-by":"crossref","first-page":"651","DOI":"10.1093\/bioinformatics\/btg034","article-title":"TIGR gene indices clustering tools (TGICL): a software system for fast clustering of large EST datasets","volume":"19","author":"Pertea","year":"2003","journal-title":"Bioinformatics"},{"issue":"Suppl. 6","key":"2023012511314538400_B15","doi-asserted-by":"crossref","first-page":"S10","DOI":"10.1186\/1471-2105-10-S6-S10","article-title":"EasyCluster: a fast and efficient gene-oriented clustering tool for large-scale transcriptome data","volume":"10","author":"Picardi","year":"2009","journal-title":"BMC Bioinformatics"},{"key":"2023012511314538400_B16","doi-asserted-by":"crossref","first-page":"142","DOI":"10.1016\/j.tig.2007.12.006","article-title":"Bioinformatics challenges of new sequencing technology","volume":"24","author":"Pop","year":"2008","journal-title":"Trends Genetics"},{"key":"2023012511314538400_B17","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/1242471.1242472","article-title":"A taxonomy of suffix array construction algorithms","volume":"39","author":"Puglisi","year":"2007","journal-title":"ACM Comput. Surv."},{"key":"2023012511314538400_B18","doi-asserted-by":"crossref","first-page":"375","DOI":"10.1007\/3-540-45123-4_31","article-title":"Exact and efficient computation of the expected number of missing and common words in random texts","volume-title":"Proceedings of the 11th Annual Symposium Combinatorial Pattern Matching (CPM 2000)","author":"Rahmann","year":"2000"},{"key":"2023012511314538400_B19","doi-asserted-by":"crossref","first-page":"W737","DOI":"10.1093\/nar\/gkq470","article-title":"PEACE: parallel environment for assembly and clustering of gene expression","volume":"38","author":"Rao","year":"2010","journal-title":"Nucleic Acids Res."},{"key":"2023012511314538400_B20","doi-asserted-by":"crossref","first-page":"1615","DOI":"10.1089\/cmb.2009.0198","article-title":"Alignment-free sequence comparison (I): Statistics and power","volume":"16","author":"Reinert","year":"2009","journal-title":"J. Comput. Biol."},{"key":"2023012511314538400_B21","doi-asserted-by":"crossref","first-page":"e3373","DOI":"10.1371\/journal.pone.0003373","article-title":"MetaSim \u2013 a sequencing simulator for genomics and metagenomics","volume":"3","author":"Richter","year":"2008","journal-title":"PLoS One"},{"key":"2023012511314538400_B22","doi-asserted-by":"crossref","first-page":"455","DOI":"10.1093\/bib\/bbq066","article-title":"Editorial: next generation sequencing","volume":"11","author":"Robison","year":"2010","journal-title":"Brief. Bioinformatics"},{"key":"2023012511314538400_B23","doi-asserted-by":"crossref","first-page":"3","DOI":"10.1007\/s11390-010-9300-x","article-title":"New generations: Sequencing machines and their computational challenges","volume":"25","author":"Schwartz","year":"2009","journal-title":"J. Comput. Sci. Technol."},{"key":"2023012511314538400_B24","article-title":"Algorithms for the Analysis of Expressed Sequence Tags","volume-title":"PhD Thesis","author":"Slater","year":"2000"},{"key":"2023012511314538400_B25","first-page":"109","article-title":"Computation of d2: a measure of sequence dissimilarity","volume-title":"Computers and DNA.","author":"Torney","year":"1990"},{"key":"2023012511314538400_B26","doi-asserted-by":"crossref","first-page":"191","DOI":"10.1016\/0304-3975(92)90143-4","article-title":"Approximate string-matching with q-grams and maximal matches","volume":"92","author":"Ukkonen","year":"1992","journal-title":"Theor. Comput. Sci."},{"key":"2023012511314538400_B27","first-page":"301","article-title":"A method for evaluating the quality of string dissimilarity measures and clustering algorithms for EST clustering","volume-title":"Proceedings of the 4th IEEE International Symposium BioInformatics and BioEngineering (BIBE 2004).","author":"Zimmermann","year":"2004"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/27\/24\/3348\/48863076\/bioinformatics_27_24_3348.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/27\/24\/3348\/48863076\/bioinformatics_27_24_3348.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,25]],"date-time":"2023-01-25T11:32:42Z","timestamp":1674646362000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/27\/24\/3348\/304992"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2011,10,8]]},"references-count":27,"journal-issue":{"issue":"24","published-print":{"date-parts":[[2011,12,15]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btr560","relation":{},"ISSN":["1367-4811","1367-4803"],"issn-type":[{"value":"1367-4811","type":"electronic"},{"value":"1367-4803","type":"print"}],"subject":[],"published-other":{"date-parts":[[2011,12,15]]},"published":{"date-parts":[[2011,10,8]]}}}