{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2023,1,31]],"date-time":"2023-01-31T22:10:42Z","timestamp":1675203042631},"reference-count":25,"publisher":"Oxford University Press (OUP)","issue":"18","license":[{"start":{"date-parts":[[2016,10,2]],"date-time":"2016-10-02T00:00:00Z","timestamp":1475366400000},"content-version":"vor","delay-in-days":2650,"URL":"http:\/\/creativecommons.org\/licenses\/by-nc\/2.0\/uk\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2009,9,15]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>Motivation: The clustering of expressed sequence tags (ESTs) is a crucial step in many sequence analysis studies that require a high level of redundancy. Chimeric sequences, while uncommon, can make achieving the optimal EST clustering a challenge. Single-linkage algorithms are particularly vulnerable to the effects of chimeras. To avoid chimera-facilitated erroneous merges, researchers using single-linkage algorithms are forced to use stringent sequence\u2013similarity thresholds. Such thresholds reduce the sensitivity of the clustering algorithm.<\/jats:p>\n               <jats:p>Results: We introduce the concept of k-link clustering for EST data. We evaluate how clustering error rates vary over a range of linkage thresholds. Using k-link, we show that Type II error decreases in response to increasing the number of shared ESTs (ie. links) required. We observe a base level of Type II error likely caused by the presence of unmasked low-complexity or repetitive sequence. We find that Type I error increases gradually with increased linkage. To minimize the Type I error introduced by increased linkage requirements, we propose an extension to k-link which modifies the required number of links with respect to the size of clusters being compared.<\/jats:p>\n               <jats:p>Availability: The implementation of k-link is available under the terms of the GPL from http:\/\/www.bioinformatics.csiro.au\/products.shtml. k-link is licensed under the GNU General Public License, and can be downloaded from http:\/\/www.bioinformatics.csiro.au\/products.shtml. k-link is written in C++.<\/jats:p>\n               <jats:p>Contact: \u00a0lauren.bragg@csiro.au<\/jats:p>\n               <jats:p>Supplementary information: \u00a0Supplementary data are available at Bioinformatics online.<\/jats:p>","DOI":"10.1093\/bioinformatics\/btp410","type":"journal-article","created":{"date-parts":[[2009,7,2]],"date-time":"2009-07-02T00:40:46Z","timestamp":1246495246000},"page":"2302-2308","source":"Crossref","is-referenced-by-count":4,"title":["<i>k<\/i>-link EST clustering: evaluating error introduced by chimeric sequences under different degrees of linkage"],"prefix":"10.1093","volume":"25","author":[{"given":"Lauren M.","family":"Bragg","sequence":"first","affiliation":[{"name":"1 CSIRO Mathematical and Information Sciences, North Ryde, NSW 2113 and 2Preventative Health National Research Flagship, Locked Bag 17, North Ryde, NSW 1670, Australia"},{"name":"1 CSIRO Mathematical and Information Sciences, North Ryde, NSW 2113 and 2Preventative Health National Research Flagship, Locked Bag 17, North Ryde, NSW 1670, Australia"}]},{"given":"Glenn","family":"Stone","sequence":"additional","affiliation":[{"name":"1 CSIRO Mathematical and Information Sciences, North Ryde, NSW 2113 and 2Preventative Health National Research Flagship, Locked Bag 17, North Ryde, NSW 1670, Australia"}]}],"member":"286","published-online":{"date-parts":[[2009,7,1]]},"reference":[{"key":"2023013112120032400_B1","doi-asserted-by":"crossref","first-page":"421","DOI":"10.1093\/bioinformatics\/btf881","article-title":"Redundancy based detection of sequence polymorphisms in expressed sequence tag data using autoSNP","volume":"19","author":"Barker","year":"2003","journal-title":"Bioinformatics"},{"key":"2023013112120032400_B2","doi-asserted-by":"crossref","first-page":"369","DOI":"10.1038\/ng0895-369","article-title":"Establishing a human transcript map","volume":"10","author":"Boguski","year":"1995","journal-title":"Nat. Genet."},{"key":"2023013112120032400_B3","doi-asserted-by":"crossref","first-page":"1135","DOI":"10.1101\/gr.9.11.1135","article-title":"d2_cluster: a validated method for clustering EST and full-length cDNA sequences","volume":"9","author":"Burke","year":"1999","journal-title":"Genome Res."},{"key":"2023013112120032400_B4","doi-asserted-by":"crossref","first-page":"1542","DOI":"10.1093\/bioinformatics\/btn203","article-title":"An overview of the wcd EST clustering tool","volume":"24","author":"Hazelhurst","year":"2008","journal-title":"Bioinformatics"},{"key":"2023013112120032400_B5","doi-asserted-by":"crossref","first-page":"199","DOI":"10.1089\/cmb.1994.1.199","article-title":"Biological evaluation of d2, an algorithm for high-performance sequence comparison","volume":"1","author":"Hide","year":"1994","journal-title":"J. Comput. Biol."},{"key":"2023013112120032400_B6","doi-asserted-by":"crossref","first-page":"807","DOI":"10.1101\/gr.6.9.807","article-title":"Generation and analysis of 280 000 human expressed sequence tag","volume":"6","author":"Hillier","year":"1996","journal-title":"Genome Res."},{"key":"2023013112120032400_B7","doi-asserted-by":"crossref","first-page":"868","DOI":"10.1101\/gr.9.9.868","article-title":"CAP3: a DNA sequence assembly program","volume":"9","author":"Huang","year":"1999","journal-title":"Genome Res."},{"key":"2023013112120032400_B8","first-page":"547","article-title":"\u00c9tude comparative de la distribution florale dans une portion des Alpes et des Jura","volume":"37","author":"Jaccard","year":"1901","journal-title":"Bull. Soc. Vaudoise Sci. Nat"},{"key":"2023013112120032400_B9","first-page":"656","article-title":"BLAT - the BLAST-like alignment tool","volume":"12","author":"Kent","year":"2002","journal-title":"Genome Res."},{"key":"2023013112120032400_B10","doi-asserted-by":"crossref","first-page":"996","DOI":"10.1101\/gr.229102","article-title":"The human genome browser at UCSC","volume":"12","author":"Kent","year":"2002","journal-title":"Genome Res."},{"key":"2023013112120032400_B11","doi-asserted-by":"crossref","first-page":"566","DOI":"10.1101\/gr.3030405","article-title":"ECgene: genome-based EST clustering and gene modeling for alternative splicing","volume":"15","author":"Kim","year":"2005","journal-title":"Genome Res."},{"key":"2023013112120032400_B12","doi-asserted-by":"crossref","first-page":"425","DOI":"10.1007\/BF02293606","article-title":"Merging groups to maximize object partition comparison","volume":"45","author":"Klastorin","year":"1980","journal-title":"Psychometrika"},{"key":"2023013112120032400_B13","volume-title":"BLAST: An Essential Guide to the Basic Alignment Search Tool","author":"Korf","year":"2003"},{"key":"2023013112120032400_B14","doi-asserted-by":"crossref","first-page":"1123","DOI":"10.1214\/aos\/1176345593","article-title":"A representation for multinomial cumulative distribution functions","volume":"9","author":"Levin","year":"1981","journal-title":"Ann. Stat."},{"key":"2023013112120032400_B15","doi-asserted-by":"crossref","first-page":"2232","DOI":"10.1093\/bioinformatics\/btl368","article-title":"RBR: library-less repeat detection for ESTs","volume":"22","author":"Malde","year":"2006","journal-title":"Bioinformatics"},{"key":"2023013112120032400_B16","doi-asserted-by":"crossref","first-page":"1471","DOI":"10.1186\/1471-2164-9-23","article-title":"Repeats and EST analysis for new organisms","volume":"9","author":"Malde","year":"2008","journal-title":"BMC Genomics"},{"key":"2023013112120032400_B17","doi-asserted-by":"crossref","first-page":"873","DOI":"10.1016\/j.jmva.2006.11.013","article-title":"Comparing clusterings an information based distance","volume":"98","author":"Meila","year":"2007","journal-title":"J. Multivar. Anal."},{"key":"2023013112120032400_B18","doi-asserted-by":"crossref","first-page":"551","DOI":"10.1038\/nature07723","article-title":"The Sorghum bicolor genome and the diversification of grasses","volume":"457","author":"Patterson","year":"2009","journal-title":"Nature"},{"key":"2023013112120032400_B19","doi-asserted-by":"crossref","first-page":"31","DOI":"10.1186\/1471-2105-3-31","article-title":"Making sense of EST sequences by CLOBBing them","volume":"3","author":"Parkinson","year":"2002","journal-title":"BMC Bioinformatics"},{"key":"2023013112120032400_B20","doi-asserted-by":"crossref","first-page":"651","DOI":"10.1093\/bioinformatics\/btg034","article-title":"TIGR gene indices clustering tools (TGICL): a software system for fast clustering of large EST datasets","volume":"19","author":"Pertea","year":"2003","journal-title":"Bioinformatics"},{"issue":"Suppl. 2","key":"2023013112120032400_B21","doi-asserted-by":"crossref","first-page":"S3","DOI":"10.1186\/1471-2105-6-S2-S3","article-title":"CLU: a new algorithm for EST clustering","volume":"6","author":"Ptitsyn","year":"2005","journal-title":"BMC Bioinformatics"},{"key":"2023013112120032400_B22","doi-asserted-by":"crossref","first-page":"159","DOI":"10.1093\/nar\/29.1.159","article-title":"The TIGR gene indices: analysis of gene transcript sequences in highly sampled eukaryotic species","volume":"29","author":"Quackenbush","year":"2001","journal-title":"Nucleic Acids Res."},{"key":"2023013112120032400_B23","doi-asserted-by":"crossref","first-page":"1067","DOI":"10.1093\/nar\/gkg170","article-title":"A novel algorithm for computational identification of contaminated EST libraries","volume":"31","author":"Sorek","year":"2003","journal-title":"Nucleic Acids Res."},{"issue":"Article 438","key":"2023013112120032400_B24","article-title":"QualitySNP: a pipeline for detecting Single nucleotide polymorphisms and insertions\/deletions in EST data from diploids and polyploidy species","volume":"7","author":"Tang","year":"2006","journal-title":"BMC Bioinformatics"},{"key":"2023013112120032400_B25","doi-asserted-by":"crossref","first-page":"2973","DOI":"10.1093\/bioinformatics\/bth342","article-title":"EST clustering error evaluation and correction","volume":"20","author":"Wang","year":"2004","journal-title":"Bioinformatics"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/25\/18\/2302\/48994262\/bioinformatics_25_18_2302.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/25\/18\/2302\/48994262\/bioinformatics_25_18_2302.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,31]],"date-time":"2023-01-31T21:35:28Z","timestamp":1675200928000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/25\/18\/2302\/196737"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2009,7,1]]},"references-count":25,"journal-issue":{"issue":"18","published-print":{"date-parts":[[2009,9,15]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btp410","relation":{},"ISSN":["1367-4811","1367-4803"],"issn-type":[{"value":"1367-4811","type":"electronic"},{"value":"1367-4803","type":"print"}],"subject":[],"published-other":{"date-parts":[[2009,9,15]]},"published":{"date-parts":[[2009,7,1]]}}}