{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,2]],"date-time":"2026-04-02T08:01:43Z","timestamp":1775116903797,"version":"3.50.1"},"reference-count":36,"publisher":"Wiley","issue":"1","license":[{"start":{"date-parts":[[2008,11,16]],"date-time":"2008-11-16T00:00:00Z","timestamp":1226793600000},"content-version":"vor","delay-in-days":320,"URL":"http:\/\/creativecommons.org\/licenses\/by\/3.0\/"}],"content-domain":{"domain":["onlinelibrary.wiley.com"],"crossmark-restriction":true},"short-container-title":["Advances in Bioinformatics"],"published-print":{"date-parts":[[2008,1]]},"abstract":"<jats:p>\n\nA vast amount of microbial sequencing data is being generated through large\u2010scale projects in ecology, agriculture, and human health. Efficient high\u2010throughput methods are needed to analyze the mass amounts of metagenomic data, all DNA present in an environmental sample. A major obstacle in metagenomics is the inability to obtain accuracy using technology that yields short reads. We construct the unique  <jats:italic>N<\/jats:italic>\u2010mer frequency profiles of 635 microbial genomes publicly available as of February 2008. These profiles  are used to train a naive Bayes classifier (NBC) that can be used to identify the genome of any fragment. We show that our method is comparable to BLAST for small 25 bp fragments but does not have the ambiguity of BLAST\u2032s tied top scores. We demonstrate that this approach is scalable to identify any fragment from hundreds of genomes. It also performs quite well at the strain, species, and genera levels and achieves strain resolution despite classifying ubiquitous genomic fragments (gene and nongene regions). Cross\u2010validation analysis demonstrates that species\u2010accuracy achieves 90%  for highly\u2010represented species containing an average of 8 strains. We demonstrate that such a tool can be used on the Sargasso Sea dataset, and our analysis shows that NBC can be further enhanced. \n\n\n\n\n<\/jats:p>","DOI":"10.1155\/2008\/205969","type":"journal-article","created":{"date-parts":[[2008,11,17]],"date-time":"2008-11-17T15:03:18Z","timestamp":1226934198000},"update-policy":"https:\/\/doi.org\/10.1002\/crossmark_policy","source":"Crossref","is-referenced-by-count":80,"title":["Metagenome Fragment Classification Using <i>N<\/i>\u2010Mer  Frequency Profiles"],"prefix":"10.1155","volume":"2008","author":[{"given":"Gail","family":"Rosen","sequence":"first","affiliation":[]},{"given":"Elaine","family":"Garbarine","sequence":"additional","affiliation":[]},{"given":"Diamantino","family":"Caseiro","sequence":"additional","affiliation":[]},{"given":"Robi","family":"Polikar","sequence":"additional","affiliation":[]},{"given":"Bahrad","family":"Sokhansanj","sequence":"additional","affiliation":[]}],"member":"311","published-online":{"date-parts":[[2008,11,16]]},"reference":[{"key":"e_1_2_8_1_2","doi-asserted-by":"publisher","DOI":"10.1016\/S0022-5193(03)00082-1"},{"key":"e_1_2_8_2_2","doi-asserted-by":"crossref","unstructured":"YeoG.andgeneyeo@mit.edu BurgeC. B. cburge@mit.edu Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals Proceedings of the 7th Annual International Conference on Computational Molecular Biology (RECOMB \u203203) April 2003 Berlin Germany 322\u2013331 https:\/\/doi.org\/10.1145\/640075.640118.","DOI":"10.1145\/640075.640118"},{"key":"e_1_2_8_3_2","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btm484"},{"key":"e_1_2_8_4_2","doi-asserted-by":"publisher","DOI":"10.1006\/tpbi.2002.1589"},{"key":"e_1_2_8_5_2","doi-asserted-by":"publisher","DOI":"10.1073\/pnas.162359199"},{"key":"e_1_2_8_6_2","doi-asserted-by":"publisher","DOI":"10.1101\/gr.5969107"},{"key":"e_1_2_8_7_2","doi-asserted-by":"publisher","DOI":"10.1128\/AEM.02181-07"},{"key":"e_1_2_8_8_2","doi-asserted-by":"publisher","DOI":"10.1093\/nar\/gkn496"},{"key":"e_1_2_8_9_2","doi-asserted-by":"publisher","DOI":"10.1006\/jmbi.1990.9999"},{"key":"e_1_2_8_10_2","doi-asserted-by":"publisher","DOI":"10.1128\/AEM.00062-07"},{"key":"e_1_2_8_11_2","doi-asserted-by":"publisher","DOI":"10.1093\/nar\/gkn038"},{"key":"e_1_2_8_12_2","doi-asserted-by":"crossref","first-page":"443","DOI":"10.1016\/0022-2836(70)90057-4","article-title":"A general method applicable to the search for similarities in the amino acid sequence of two proteins","volume":"48","author":"Needleman S. B.","year":"1970","journal-title":"Journal of Molecular Biology"},{"key":"e_1_2_8_13_2","doi-asserted-by":"crossref","first-page":"195","DOI":"10.1016\/0022-2836(81)90087-5","article-title":"Identification of common molecular subsequences","volume":"147","author":"Smith T. F.","year":"1981","journal-title":"Journal of Molecular Biology"},{"key":"e_1_2_8_14_2","doi-asserted-by":"publisher","DOI":"10.1016\/0378-1119(88)90330-7"},{"key":"e_1_2_8_15_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.tim.2007.01.007"},{"key":"e_1_2_8_16_2","doi-asserted-by":"publisher","DOI":"10.1146\/annurev.micro.55.1.709"},{"key":"e_1_2_8_17_2","doi-asserted-by":"publisher","DOI":"10.1093\/nar\/gkl069"},{"key":"e_1_2_8_18_2","doi-asserted-by":"publisher","DOI":"10.1128\/AEM.69.8.4927-4934.2003"},{"key":"e_1_2_8_19_2","doi-asserted-by":"publisher","DOI":"10.1111\/j.1462-2920.2004.00575.x"},{"key":"e_1_2_8_20_2","doi-asserted-by":"publisher","DOI":"10.1101\/gr.306102"},{"key":"e_1_2_8_21_2","doi-asserted-by":"publisher","DOI":"10.1186\/1471-2105-5-163"},{"key":"e_1_2_8_22_2","doi-asserted-by":"publisher","DOI":"10.1101\/gr.335003"},{"key":"e_1_2_8_23_2","doi-asserted-by":"publisher","DOI":"10.1093\/nar\/gki489"},{"key":"e_1_2_8_24_2","doi-asserted-by":"crossref","unstructured":"GanapathirajuM. Klein-SeetharamanJ. RosenfeldR.et al. Comparative n-gram analysis of whole-genome sequences Proceedings of the Human Language Technologies Conference (HLT \u203202) March 2002 San Diego Calif USA.","DOI":"10.3115\/1289189.1289259"},{"key":"e_1_2_8_25_2","doi-asserted-by":"crossref","unstructured":"ApostolicoA. BockM. E. andLonardiS. Monotony of surprise and large-scale quest for unusual words Proceedings of the 6th Annual International Conference on Computational Molecular Biology (RECOMB \u203202) April 2002 Washington DC USA 22\u201331 https:\/\/doi.org\/10.1145\/565196.565200.","DOI":"10.1145\/565196.565200"},{"key":"e_1_2_8_26_2","doi-asserted-by":"publisher","DOI":"10.1038\/nmeth976"},{"key":"e_1_2_8_27_2","unstructured":"RishI. An empirical study of the naive bayes classifier Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI \u203201) August 2001 Seattle Wash USA 41\u201346."},{"key":"e_1_2_8_28_2","volume-title":"Human Behavior and the Principle of Least-Effort","author":"Zipf G. K.","year":"1949"},{"key":"e_1_2_8_29_2","doi-asserted-by":"crossref","unstructured":"HampikianG.andAndersenT. Absent sequences: nullomers and primes 12 Proceedings of the Pacific Symposium on Biocomputing January 2007 Boise Idaho USA 355\u2013366 https:\/\/doi.org\/10.1142\/9789812772435_0034.","DOI":"10.1142\/9789812772435_0034"},{"key":"e_1_2_8_30_2","unstructured":"FofanovV. Y. PutontiC. ChumakovS. PettittB. M. andFofanovY. Fast algorithm for the analysis of the presence of short oligonucleotide sequences in genomic sequences May2005 no. #UH-CS-05-11 University of Houston Houston Tex USA."},{"key":"e_1_2_8_31_2","doi-asserted-by":"publisher","DOI":"10.1101\/gr.186401"},{"key":"e_1_2_8_32_2","volume-title":"Data Mining: Practical Machine Learning Tools and Techniques","author":"Witten I. H.","year":"2005"},{"key":"e_1_2_8_33_2","doi-asserted-by":"publisher","DOI":"10.1126\/science.1093857"},{"key":"e_1_2_8_34_2","doi-asserted-by":"publisher","DOI":"10.1126\/science.1114057"},{"key":"e_1_2_8_35_2","doi-asserted-by":"publisher","DOI":"10.1038\/nature04203"},{"key":"e_1_2_8_36_2","doi-asserted-by":"publisher","DOI":"10.1128\/AEM.00599-08"}],"container-title":["Advances in Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/downloads.hindawi.com\/archive\/2008\/205969.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/downloads.hindawi.com\/archive\/2008\/205969.xml","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/pdf\/10.1155\/2008\/205969","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,7,1]],"date-time":"2024-07-01T13:46:08Z","timestamp":1719841568000},"score":1,"resource":{"primary":{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/10.1155\/2008\/205969"}},"subtitle":[],"editor":[{"given":"Rita","family":"Casadio","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2008,1]]},"references-count":36,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2008,1]]}},"alternative-id":["10.1155\/2008\/205969"],"URL":"https:\/\/doi.org\/10.1155\/2008\/205969","archive":["Portico"],"relation":{},"ISSN":["1687-8027","1687-8035"],"issn-type":[{"value":"1687-8027","type":"print"},{"value":"1687-8035","type":"electronic"}],"subject":[],"published":{"date-parts":[[2008,1]]},"assertion":[{"value":"2008-06-05","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2008-09-30","order":1,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2008-11-16","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}],"article-number":"205969"}}