{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,13]],"date-time":"2026-05-13T03:13:47Z","timestamp":1778642027963,"version":"3.51.4"},"reference-count":8,"publisher":"Springer Science and Business Media LLC","issue":"S2","license":[{"start":{"date-parts":[[2012,3,13]],"date-time":"2012-03-13T00:00:00Z","timestamp":1331596800000},"content-version":"tdm","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/2.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"published-print":{"date-parts":[[2012,12]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:sec>\n            <jats:title>Background<\/jats:title>\n            <jats:p>Modern pyrosequencing techniques make it possible to study complex bacterial populations, such as <jats:italic>16S rRNA<\/jats:italic>, directly from environmental or clinical samples without the need for laboratory purification. Alignment of sequences across the resultant large data sets (100,000+ sequences) is of particular interest for the purpose of identifying potential gene clusters and families, but such analysis represents a daunting computational task. The aim of this work is the development of an efficient pipeline for the clustering of large sequence read sets.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Methods<\/jats:title>\n            <jats:p>Pairwise alignment techniques are used here to calculate genetic distances between sequence pairs. These methods are pleasingly parallel and have been shown to more accurately reflect accurate genetic distances in highly variable regions of <jats:italic>rRNA<\/jats:italic> genes than do traditional multiple sequence alignment (MSA) approaches. By utilizing Needleman-Wunsch (NW) pairwise alignment in conjunction with novel implementations of interpolative multidimensional scaling (MDS), we have developed an effective method for visualizing massive biosequence data sets and quickly identifying potential gene clusters.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Results<\/jats:title>\n            <jats:p>This study demonstrates the use of interpolative MDS to obtain clustering results that are qualitatively similar to those obtained through full MDS, but with substantial cost savings. In particular, the wall clock time required to cluster a set of 100,000 sequences has been reduced from seven hours to less than one hour through the use of interpolative MDS.<\/jats:p>\n          <\/jats:sec>\n          <jats:sec>\n            <jats:title>Conclusions<\/jats:title>\n            <jats:p>Although work remains to be done in selecting the optimal training set size for interpolative MDS, substantial computational cost savings will allow us to cluster much larger sequence sets in the future.<\/jats:p>\n          <\/jats:sec>","DOI":"10.1186\/1471-2105-13-s2-s9","type":"journal-article","created":{"date-parts":[[2019,12,11]],"date-time":"2019-12-11T01:59:24Z","timestamp":1576029564000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":6,"title":["Interpolative multidimensional scaling techniques for the identification of clusters in very large sequence sets"],"prefix":"10.1186","volume":"13","author":[{"given":"Adam","family":"Hughes","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Yang","family":"Ruan","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Saliya","family":"Ekanayake","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Seung-Hee","family":"Bae","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Qunfeng","family":"Dong","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Mina","family":"Rho","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Judy","family":"Qiu","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Geoffrey","family":"Fox","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2012,3,13]]},"reference":[{"issue":"10","key":"5069_CR1","doi-asserted-by":"publisher","first-page":"e76","DOI":"10.1093\/nar\/gkp285","volume":"37","author":"Y Sun","year":"2009","unstructured":"Sun Y, Cai Y, Liu L, Farrell ML, McKendree W, Farmerie W: ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences. Nucl Acids Res. 2009, 37 (10): e76-10.1093\/nar\/gkp285.","journal-title":"Nucl Acids Res"},{"key":"5069_CR2","doi-asserted-by":"publisher","first-page":"443","DOI":"10.1016\/0022-2836(70)90057-4","volume":"48","author":"SB Needleman","year":"1970","unstructured":"Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970, 48: 443-453. 10.1016\/0022-2836(70)90057-4.","journal-title":"J Mol Biol"},{"key":"5069_CR3","volume-title":"Invited talk at the 2008 High Performance Computing & Simulation Conference (HPCS 2008) In Conjunction With The 22nd European Conference on Modelling and Simulation (ECMS 2008): 3-6 June 2008; Nicosia, Cyprus","author":"X Qiu","year":"2008","unstructured":"Qiu X, Fox GC, Yuan H, Bae S, Chrysanthakopoulos G, Nielsen HF: Parallel clustering and dimensional scaling on multicore systems. Invited talk at the 2008 High Performance Computing & Simulation Conference (HPCS 2008) In Conjunction With The 22nd European Conference on Modelling and Simulation (ECMS 2008): 3-6 June 2008; Nicosia, Cyprus. 2008, High Performance Computing & Simulation Conference (HPCS 2008) In Conjunction With The 22nd European Conference on Modelling and Simulation (ECMS 2008): 3-6 June 2008; Nicosia, Cyprus"},{"key":"5069_CR4","volume-title":"Proceedings of ACM HPDC","author":"S Bae","year":"2010","unstructured":"Bae S, Choi J, Qiu J, Fox G: Dimension reduction and visualization of large high-dimensional data via interpolation. Proceedings of ACM HPDC. 2010, conference: 20-25 June 2010; Chicago"},{"key":"5069_CR5","volume-title":"Proceedings of the First International Workshop on MapReduce and its Applications of ACM HPDC","author":"J Ekanayake","year":"2010","unstructured":"Ekanayake J, Li H, Zhang B, Gunarathne T, Bae S, Qiu J, Fox G: Twister: a runtime for iterative mapReduce. Proceedings of the First International Workshop on MapReduce and its Applications of ACM HPDC. 2010, conference: 20-25 June 2010; Chicago"},{"key":"5069_CR6","volume-title":"PhD thesis","author":"J Ekanayake","year":"2010","unstructured":"Ekanayake J: Architecture and performance of runtime environments for data intensive scalable computing. PhD thesis. 2010, Indiana University, School of Informatics and Computer Science"},{"key":"5069_CR7","unstructured":"Neethu S, Tang H, Doak T, Ye Y: Comparing bacterial communities inferred from 16S rRNA gene sequencing and shotgun metagenomics. Pac Symp Biocomput. 2011: 165-76."},{"key":"5069_CR8","first-page":"153","volume":"2010","author":"Y Ye","year":"2010","unstructured":"Ye Y: Identification and quantification of abundant species from pyrosequences of 16S rRNA by consensus alignment. Proceedings of BIBM. 2010, 2010: 153-157. : 18-21 December 2010; Hong Kong","journal-title":"Proceedings of BIBM"}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/1471-2105-13-S2-S9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/article\/10.1186\/1471-2105-13-S2-S9\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/1471-2105-13-S2-S9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,9,1]],"date-time":"2021-09-01T18:30:23Z","timestamp":1630521023000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/1471-2105-13-S2-S9"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2012,3,13]]},"references-count":8,"journal-issue":{"issue":"S2","published-print":{"date-parts":[[2012,12]]}},"alternative-id":["5069"],"URL":"https:\/\/doi.org\/10.1186\/1471-2105-13-s2-s9","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2012,3,13]]},"assertion":[{"value":"13 March 2012","order":1,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"S9"}}