{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,20]],"date-time":"2026-01-20T03:57:04Z","timestamp":1768881424286,"version":"3.49.0"},"reference-count":91,"publisher":"World Scientific Pub Co Pte Lt","issue":"01","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["J. Bioinform. Comput. Biol."],"published-print":{"date-parts":[[2003,4]]},"abstract":"<jats:p> We describe a methodology, as well as some related data mining tools, for analyzing sequence data. The methodology comprises three steps: (a) generating candidate features from the sequences, (b) selecting relevant features from the candidates, and (c) integrating the selected features to build a system to recognize specific properties in sequence data. We also give relevant techniques for each of these three steps. For generating candidate features, we present various types of features based on the idea of k-grams. For selecting relevant features, we discuss signal-to-noise, t-statistics, and entropy measures, as well as a correlation-based feature selection method. For integrating selected features, we use machine learning methods, including C4.5, SVM, and Naive Bayes. We illustrate this methodology on the problem of recognizing translation initiation sites. We discuss how to generate and select features that are useful for understanding the distinction between ATG sites that are translation initiation sites and those that are not. We also discuss how to use such features to build reliable systems for recognizing translation initiation sites in DNA sequences. <\/jats:p>","DOI":"10.1142\/s0219720003000216","type":"journal-article","created":{"date-parts":[[2003,4,28]],"date-time":"2003-04-28T10:49:23Z","timestamp":1051526963000},"page":"139-167","source":"Crossref","is-referenced-by-count":61,"title":["DATA MINING TOOLS FOR BIOLOGICAL SEQUENCES"],"prefix":"10.1142","volume":"01","author":[{"given":"HUIQING","family":"LIU","sequence":"first","affiliation":[{"name":"Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 119613, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"LIMSOON","family":"WONG","sequence":"additional","affiliation":[{"name":"Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 119613, Singapore"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"219","published-online":{"date-parts":[[2012,1,25]]},"reference":[{"key":"rf1","first-page":"2","volume":"6","author":"Agarwal P.","journal-title":"Intelligent Systems for Molecular Biology"},{"key":"rf2","doi-asserted-by":"publisher","DOI":"10.1002\/prot.340120410"},{"key":"rf3","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/18.1.198"},{"key":"rf4","first-page":"389","volume":"26","author":"Bajic V. B.","journal-title":"Informatica"},{"key":"rf5","first-page":"15","author":"Baldi P.","journal-title":"Theoretical and Computational Methods in Genome Research"},{"key":"rf6","doi-asserted-by":"publisher","DOI":"10.1089\/cmb.1994.1.311"},{"key":"rf7","volume-title":"Bioinformatics: The Machine Learning Approach","author":"Baldi P.","year":"1999"},{"key":"rf8","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/17.6.509"},{"key":"rf9","doi-asserted-by":"publisher","DOI":"10.1093\/nar\/27.1.260"},{"key":"rf10","doi-asserted-by":"publisher","DOI":"10.1038\/ng0893-332"},{"key":"rf11","doi-asserted-by":"publisher","DOI":"10.1126\/science.8091218"},{"key":"rf12","doi-asserted-by":"publisher","DOI":"10.1016\/0097-8485(93)85004-V"},{"key":"rf13","doi-asserted-by":"publisher","DOI":"10.1093\/nar\/23.17.3554"},{"key":"rf14","volume-title":"Classification and Regression Trees","author":"Breiman L.","year":"1984"},{"key":"rf15","volume-title":"Genetics: Analysis and Principles","author":"Brooker R. J.","year":"1999"},{"key":"rf16","doi-asserted-by":"publisher","DOI":"10.1073\/pnas.97.1.262"},{"key":"rf17","doi-asserted-by":"publisher","DOI":"10.1021\/tx9900627"},{"key":"rf18","doi-asserted-by":"publisher","DOI":"10.1023\/A:1009715923555"},{"key":"rf19","doi-asserted-by":"publisher","DOI":"10.1142\/p196"},{"key":"rf20","volume-title":"Backpropagation: Theory, Architectures, and Applications","author":"Chauvin Y.","year":"1995"},{"key":"rf21","doi-asserted-by":"publisher","DOI":"10.1016\/S0959-440X(97)80057-7"},{"key":"rf22","first-page":"436","volume":"68","author":"Demsar J.","journal-title":"Studies Health Technology and Informatics"},{"key":"rf24","volume-title":"Pattern Classification and Scene Analysis","author":"Duda R.","year":"1973"},{"key":"rf25","doi-asserted-by":"publisher","DOI":"10.1198\/016214502753479248"},{"key":"rf26","doi-asserted-by":"publisher","DOI":"10.1017\/CBO9780511790492"},{"key":"rf27","doi-asserted-by":"publisher","DOI":"10.1016\/S0959-440X(96)80056-X"},{"key":"rf28","doi-asserted-by":"publisher","DOI":"10.1110\/ps.8.5.978"},{"key":"rf30","doi-asserted-by":"publisher","DOI":"10.1093\/nar\/20.24.6441"},{"key":"rf31","doi-asserted-by":"publisher","DOI":"10.1006\/jmbi.1996.0874"},{"key":"rf33","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/16.10.906"},{"key":"rf34","doi-asserted-by":"publisher","DOI":"10.1016\/0022-2836(87)90689-9"},{"key":"rf36","doi-asserted-by":"publisher","DOI":"10.1002\/1098-2272(200009)19:2<97::AID-GEPI1>3.0.CO;2-9"},{"key":"rf37","doi-asserted-by":"publisher","DOI":"10.1126\/science.286.5439.531"},{"key":"rf38","doi-asserted-by":"publisher","DOI":"10.1146\/annurev.cellbio.14.1.399"},{"key":"rf40","volume-title":"Data Mining: Concepts and Techniques","author":"Han J.","year":"2000"},{"key":"rf41","doi-asserted-by":"publisher","DOI":"10.1007\/978-0-387-21606-5"},{"key":"rf42","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/18.2.343"},{"key":"rf43","unstructured":"D.\u00a0Heckerman, Advances in Knowledge Discovery and Data Mining (MIT Press, Cambridge, MA, 1996)\u00a0pp. 273\u2013305."},{"key":"rf44","doi-asserted-by":"publisher","DOI":"10.1107\/S0907444900004261"},{"key":"rf45","doi-asserted-by":"publisher","DOI":"10.1038\/nbt1098-966"},{"key":"rf46","doi-asserted-by":"publisher","DOI":"10.1067\/mcp.2000.106827"},{"key":"rf47","doi-asserted-by":"publisher","DOI":"10.1089\/10665270050081405"},{"key":"rf48","volume-title":"An Introduction to Bayesian Networks","author":"Jensen F. V.","year":"1996"},{"key":"rf50","doi-asserted-by":"publisher","DOI":"10.1023\/A:1005816823636"},{"key":"rf51","doi-asserted-by":"publisher","DOI":"10.2307\/2986296"},{"key":"rf52","first-page":"335","author":"Kasif S.","journal-title":"Computational Methods in Molecular Biology"},{"key":"rf53","doi-asserted-by":"publisher","DOI":"10.1083\/jcb.115.4.887"},{"key":"rf54","doi-asserted-by":"publisher","DOI":"10.1016\/S0378-1119(99)00210-3"},{"key":"rf55","doi-asserted-by":"publisher","DOI":"10.1093\/nar\/15.20.8125"},{"key":"rf56","first-page":"45","author":"Krogh A.","journal-title":"Computational Methods in Molecular Biology"},{"key":"rf57","doi-asserted-by":"publisher","DOI":"10.1006\/jmbi.1994.1104"},{"key":"rf58","doi-asserted-by":"publisher","DOI":"10.1016\/S0933-3657(96)00351-X"},{"key":"rf60","doi-asserted-by":"publisher","DOI":"10.1016\/0167-9473(93)E0056-A"},{"key":"rf61","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/19.1.71"},{"key":"rf66","first-page":"815","volume":"7","author":"Loh W. Y.","journal-title":"Statistica Sinica"},{"key":"rf67","doi-asserted-by":"publisher","DOI":"10.1080\/01621459.1988.10478652"},{"key":"rf68","doi-asserted-by":"publisher","DOI":"10.1097\/00000478-199504000-00001"},{"key":"rf69","doi-asserted-by":"publisher","DOI":"10.1002\/(SICI)1097-0142(20000401)88:7<1599::AID-CNCR14>3.0.CO;2-J"},{"key":"rf70","doi-asserted-by":"publisher","DOI":"10.1093\/nar\/25.5.0955"},{"key":"rf71","doi-asserted-by":"publisher","DOI":"10.1126\/science.283.5405.1168"},{"key":"rf72","doi-asserted-by":"publisher","DOI":"10.1287\/opre.43.4.570"},{"key":"rf73","doi-asserted-by":"publisher","DOI":"10.1101\/gr.10.2.204"},{"key":"rf75","volume-title":"Machine Learning","author":"Mitchell T. M.","year":"1997"},{"key":"rf76","first-page":"3","volume":"3","author":"Nielsen H.","journal-title":"Protein Sci."},{"key":"rf77","first-page":"182","volume":"4","author":"Pedersen A. G.","journal-title":"Intelligent Systems for Molecular Biology"},{"key":"rf78","first-page":"226","volume":"5","author":"Pedersen A. G.","journal-title":"Intelligent Systems for Molecular Biology"},{"key":"rf79","volume-title":"Advances in Kernel Methods \u2014 Support Vector Learning","author":"Platt J.","year":"1998"},{"key":"rf80","doi-asserted-by":"publisher","DOI":"10.1016\/0022-2836(88)90564-5"},{"key":"rf81","first-page":"81","volume":"1","author":"Quinlan J. R.","journal-title":"Mach. Learn."},{"key":"rf82","volume-title":"C4.5: Program for Machine Learning","author":"Quinlan J. R.","year":"1993"},{"key":"rf83","doi-asserted-by":"publisher","DOI":"10.1016\/S0300-9572(99)00089-1"},{"key":"rf85","doi-asserted-by":"publisher","DOI":"10.1016\/S0893-6080(99)00097-0"},{"key":"rf86","doi-asserted-by":"publisher","DOI":"10.1089\/cmb.1996.3.163"},{"key":"rf87","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/17.10.890"},{"key":"rf88","doi-asserted-by":"publisher","DOI":"10.1002\/prot.340190108"},{"key":"rf89","doi-asserted-by":"publisher","DOI":"10.1038\/323533a0"},{"key":"rf91","doi-asserted-by":"publisher","DOI":"10.1093\/nar\/26.2.544"},{"key":"rf92","volume-title":"Learning with Kernels","author":"Scholkopf B.","year":"2002"},{"key":"rf93","doi-asserted-by":"publisher","DOI":"10.2214\/ajr.175.2.1750399"},{"key":"rf94","first-page":"468","volume":"43","author":"Selker H. P.","journal-title":"J. Investig. Med."},{"key":"rf96","doi-asserted-by":"publisher","DOI":"10.1006\/jmbi.1995.0198"},{"key":"rf97","first-page":"121","author":"Steeg E. W.","journal-title":"Artificial Intelligence and Molecular Biology"},{"key":"rf98","first-page":"447","author":"Stultz C. M.","journal-title":"Protein Structural Biology in Biomedical Research"},{"key":"rf99","doi-asserted-by":"publisher","DOI":"10.3171\/jns.1995.82.5.0764"},{"key":"rf100","doi-asserted-by":"publisher","DOI":"10.1053\/ejvs.1999.0974"},{"key":"rf101","doi-asserted-by":"publisher","DOI":"10.1073\/pnas.88.24.11261"},{"key":"rf103","doi-asserted-by":"publisher","DOI":"10.1007\/978-1-4757-2440-0"},{"key":"rf104","doi-asserted-by":"publisher","DOI":"10.1002\/1097-0339(200009)23:3<171::AID-DC6>3.0.CO;2-F"},{"key":"rf107","doi-asserted-by":"publisher","DOI":"10.1016\/S1535-6108(02)00032-6"},{"key":"rf109","doi-asserted-by":"publisher","DOI":"10.1101\/gr.8.3.319"},{"key":"rf111","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/16.9.799"}],"container-title":["Journal of Bioinformatics and Computational Biology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.worldscientific.com\/doi\/pdf\/10.1142\/S0219720003000216","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2019,8,6]],"date-time":"2019-08-06T23:08:06Z","timestamp":1565132886000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.worldscientific.com\/doi\/abs\/10.1142\/S0219720003000216"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2003,4]]},"references-count":91,"journal-issue":{"issue":"01","published-online":{"date-parts":[[2012,1,25]]},"published-print":{"date-parts":[[2003,4]]}},"alternative-id":["10.1142\/S0219720003000216"],"URL":"https:\/\/doi.org\/10.1142\/s0219720003000216","relation":{},"ISSN":["0219-7200","1757-6334"],"issn-type":[{"value":"0219-7200","type":"print"},{"value":"1757-6334","type":"electronic"}],"subject":[],"published":{"date-parts":[[2003,4]]}}}