{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,8,3]],"date-time":"2024-08-03T20:18:33Z","timestamp":1722716313504},"reference-count":18,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2016,2,24]],"date-time":"2016-02-24T00:00:00Z","timestamp":1456272000000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2016,2,24]],"date-time":"2016-02-24T00:00:00Z","timestamp":1456272000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["BMC Bioinformatics"],"abstract":"<jats:title>Abstract<\/jats:title><jats:sec>\n                <jats:title>Background<\/jats:title>\n                <jats:p>Low-cost DNA sequencing allows organizations to accumulate massive amounts of genomic data and use that data to answer a diverse range of research questions. Presently, users must search for relevant genomic data using a keyword, accession number of meta-data tag. However, in this search paradigm the form of the query \u2013 a text-based string \u2013 is mismatched with the form of the target \u2013 a genomic profile.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Results<\/jats:title>\n                <jats:p>To improve access to massive genomic data resources, we have developed a fast search engine, GEMINI, that uses a genomic profile as a query to search for similar genomic profiles. GEMINI implements a nearest-neighbor search algorithm using a vantage-point tree to store a database of <jats:italic>n<\/jats:italic> profiles and in certain circumstances achieves an <jats:inline-formula><jats:alternatives><jats:tex-math>$\\mathcal {O}(\\log n)$<\/jats:tex-math><mml:math xmlns:mml=\"http:\/\/www.w3.org\/1998\/Math\/MathML\">\n                        <mml:mi>O<\/mml:mi>\n                        <mml:mo>(<\/mml:mo>\n                        <mml:mo>log<\/mml:mo>\n                        <mml:mi>n<\/mml:mi>\n                        <mml:mo>)<\/mml:mo>\n                      <\/mml:math><\/jats:alternatives><\/jats:inline-formula> expected query time in the limit. We tested GEMINI on breast and ovarian cancer gene expression data from The Cancer Genome Atlas project and show that it achieves a query time that scales as the logarithm of the number of records in practice on genomic data. In a database with 10<jats:sup>5<\/jats:sup> samples, GEMINI identifies the nearest neighbor in 0.05 sec compared to a brute force search time of 0.6 sec.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Conclusions<\/jats:title>\n                <jats:p>GEMINI is a fast search engine that uses a query genomic profile to search for similar profiles in a very large genomic database. It enables users to identify similar profiles independent of sample label, data origin or other meta-data information.<\/jats:p>\n              <\/jats:sec>","DOI":"10.1186\/s12859-016-0934-8","type":"journal-article","created":{"date-parts":[[2016,2,24]],"date-time":"2016-02-24T09:18:45Z","timestamp":1456305525000},"update-policy":"http:\/\/dx.doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":13,"title":["GEMINI: a computationally-efficient search engine for large gene expression datasets"],"prefix":"10.1186","volume":"17","author":[{"given":"Timothy","family":"DeFreitas","sequence":"first","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Hachem","family":"Saddiki","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Patrick","family":"Flaherty","sequence":"additional","affiliation":[],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"297","published-online":{"date-parts":[[2016,2,24]]},"reference":[{"issue":"suppl 1","key":"934_CR1","doi-asserted-by":"publisher","first-page":"1005","DOI":"10.1093\/nar\/gkq1184","volume":"39","author":"T Barrett","year":"2011","unstructured":"Barrett T, Troup DB, Wilhite SE, Ledoux P, Evangelista C, Kim IF, et al.NCBI GEO: archive for functional genomics data sets\u201410 years on. Nucl Acids Res. 2011; 39(suppl 1):1005\u201310.","journal-title":"Nucl Acids Res."},{"issue":"7311","key":"934_CR2","doi-asserted-by":"publisher","first-page":"52","DOI":"10.1038\/nature09298","volume":"467","author":"International HapMap 3 Consortium","year":"2010","unstructured":"International HapMap 3 Consortium. Integrating common and rare genetic variation in diverse human populations. Nature. 2010; 467(7311):52\u20138.","journal-title":"Nature"},{"issue":"7418","key":"934_CR3","doi-asserted-by":"publisher","first-page":"61","DOI":"10.1038\/nature11412","volume":"490","author":"TCGA Network","year":"2012","unstructured":"Network TCGA. Comprehensive molecular portraits of human breast tumours. Nature. 2012; 490(7418):61\u201370.","journal-title":"Nature"},{"issue":"2","key":"934_CR4","doi-asserted-by":"publisher","first-page":"89","DOI":"10.1038\/nrg3394","volume":"14","author":"J Rung","year":"2013","unstructured":"Rung J, Brazma A. Reuse of public genome-wide gene expression data. Nat Rev Genet. 2013; 14(2):89\u201399.","journal-title":"Nat Rev Genet"},{"key":"934_CR5","unstructured":"Page L, et al. PageRank: Bringing order to the web. Vol. 72. Stanford Digital Libraries Working Paper. 1997."},{"issue":"10","key":"934_CR6","doi-asserted-by":"publisher","first-page":"925","DOI":"10.1038\/nmeth.2630","volume":"10","author":"GE Zinman","year":"2013","unstructured":"Zinman GE, Naiman S, Kanfi Y, Cohen H, Bar-Joseph Z. ExpressionBlast: mining large, unstructured expression databases. Nat Methods. 2013; 10(10):925\u20136.","journal-title":"Nat Methods"},{"issue":"3","key":"934_CR7","doi-asserted-by":"publisher","first-page":"43211","DOI":"10.1038\/nmeth.3249","volume":"12","author":"Q Zhu","year":"2015","unstructured":"Zhu Q, Wong AK, Krishnan A, Aure MR, Tadych A, Zhang R, et al.Targeted exploration and analysis of large cross-platform human transcriptomic compendia. Nat Methods. 2015; 12(3):43211\u20134.","journal-title":"Nat Methods"},{"issue":"1","key":"934_CR8","doi-asserted-by":"publisher","first-page":"548","DOI":"10.1186\/1471-2105-9-548","volume":"9","author":"R Chen","year":"2008","unstructured":"Chen R, Mallelwar R, Thosar A, Venkatasubrahmanyam S, Butte AJ. GeneChaser: identifying all biological and clinical conditions in which genes of interest are differentially expressed. BMC Bioinformatics. 2008; 9(1):548.","journal-title":"BMC Bioinformatics"},{"issue":"1","key":"934_CR9","doi-asserted-by":"publisher","first-page":"603","DOI":"10.1186\/1471-2105-11-603","volume":"11","author":"JM Engreitz","year":"2010","unstructured":"Engreitz JM, Morgan AA, Dudley JT, Chen R, Thathoo R, Altman RB, et al.Content-based microarray search using differential expression profiles. BMC Bioinformatics. 2010; 11(1):603.","journal-title":"BMC Bioinformatics"},{"issue":"1","key":"934_CR10","doi-asserted-by":"publisher","first-page":"14","DOI":"10.1007\/BF00264289","volume":"1","author":"DE Knuth","year":"1971","unstructured":"Knuth DE. Optimum binary search trees. Acta Informatica. 1971; 1(1):14\u201325.","journal-title":"Acta Informatica"},{"issue":"7","key":"934_CR11","doi-asserted-by":"publisher","first-page":"881","DOI":"10.1109\/TPAMI.2002.1017616","volume":"24","author":"T Kanungo","year":"2002","unstructured":"Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY. An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans Pattern Anal Mach Intell. 2002; 24(7):881\u201392.","journal-title":"IEEE Trans Pattern Anal Mach Intell"},{"issue":"2","key":"934_CR12","doi-asserted-by":"publisher","first-page":"369","DOI":"10.1145\/253262.253347","volume":"26","author":"N Katayama","year":"1997","unstructured":"Katayama N, Satoh S. The SR-tree: An Index Structure for High-Dimensional Nearest Neighbor Queries. ACM SIGMOD Record. 1997; 26(2):369\u201380.","journal-title":"ACM SIGMOD Record"},{"issue":"2","key":"934_CR13","first-page":"322","volume":"19","author":"N Beckmann","year":"1990","unstructured":"Beckmann N, Kriegel HP, Schneider R, Seeger B. The R*-tree: An Efficient and Robust Access Method for Points and Rectangles. ACM. 1990; 19(2):322\u201331.","journal-title":"ACM"},{"issue":"194","key":"934_CR14","first-page":"311","volume":"93","author":"PN Yianilos","year":"1993","unstructured":"Yianilos PN. Data Structures and Algorithms for Nearest Neighbor Search in General Metric Spaces. SODA. 1993; 93(194):311\u201321.","journal-title":"SODA"},{"key":"934_CR15","doi-asserted-by":"crossref","unstructured":"Nielsen F, Piro P, Barlaud M. Bregman Vantage Point Trees for Efficient Nearest Neighbor Queries. ICME. 2009:878\u201381.","DOI":"10.1109\/ICME.2009.5202635"},{"key":"934_CR16","unstructured":"Nguyen H. A python implementation of a vantage point tree. GitHub. 2014. https:\/\/github.com\/huyng\/algorithms\/tree\/master\/vptree."},{"key":"934_CR17","unstructured":"Harrison P. Python VP-tree implementation. 2006. http:\/\/www.logarithmic.net\/pfh\/blog\/01164790008."},{"key":"934_CR18","unstructured":"Archibald A. A python implementation of a KD tree. GitHub. 2008. https:\/\/github.com\/scipy\/scipy\/blob\/master\/scipy\/spatial\/kdtree.py."}],"container-title":["BMC Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-016-0934-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s12859-016-0934-8\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-016-0934-8","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s12859-016-0934-8.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,2,1]],"date-time":"2024-02-01T18:16:49Z","timestamp":1706811409000},"score":1,"resource":{"primary":{"URL":"https:\/\/bmcbioinformatics.biomedcentral.com\/articles\/10.1186\/s12859-016-0934-8"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2016,2,24]]},"references-count":18,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2016,12]]}},"alternative-id":["934"],"URL":"https:\/\/doi.org\/10.1186\/s12859-016-0934-8","relation":{},"ISSN":["1471-2105"],"issn-type":[{"value":"1471-2105","type":"electronic"}],"subject":[],"published":{"date-parts":[[2016,2,24]]},"assertion":[{"value":"26 March 2015","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"19 January 2016","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"24 February 2016","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}],"article-number":"102"}}