{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,30]],"date-time":"2025-10-30T22:45:30Z","timestamp":1761864330772,"version":"3.37.3"},"reference-count":34,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2021,8,17]],"date-time":"2021-08-17T00:00:00Z","timestamp":1629158400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,8,17]],"date-time":"2021-08-17T00:00:00Z","timestamp":1629158400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Big Data"],"published-print":{"date-parts":[[2021,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec>\n                <jats:title>Background<\/jats:title>\n                <jats:p>Gene expression provides a means for an organism to produce gene products necessary for the organism to live. Variation in the significant gene expression levels can distinguish the gene and the tissue in which the gene is expressed. Tissue-specific gene expression, often determined by single nucleotide polymorphisms (SNPs), provides potential molecular markers or therapeutic targets for disease progression. Therefore, SNPs are good candidates for identifying disease progression. The current bioinformatics literature uses gene network modeling to summarize complex interactions between transcription factors, genes, and gene products. Here, our focus is on the SNPs\u2019 impact on tissue-specific gene expression levels. To the best of our knowledge, we are not aware of any studies that distinguish tissue-specific genes using SNP expression levels.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Method<\/jats:title>\n                <jats:p>We propose a novel feature extraction method based on highly expressed SNPs using k-mers as features. We also propose optimal k-mer and feature sizes used in our approach. Determining the optimal sizes is still an open research question as it depends on the dataset and purpose of the analysis. Therefore, we evaluate our algorithm\u2019s performance on a range of k-mer and feature sizes using a multinomial naive Bayes (MNB) classifier on genes in the 49 human tissues from the Genotype-Tissue Expression (GTEx) portal.<\/jats:p>\n              <\/jats:sec><jats:sec>\n                <jats:title>Conclusions<\/jats:title>\n                <jats:p>Our approach achieves practical performance results with k-mers of size 3. Based on the purpose of the analysis and the number of tissue-specific genes under study, feature sizes [7, 8, 9] and [8, 9, 10] are typically optimal for the machine learning model.<\/jats:p>\n              <\/jats:sec>","DOI":"10.1186\/s40537-021-00497-9","type":"journal-article","created":{"date-parts":[[2021,8,17]],"date-time":"2021-08-17T11:03:08Z","timestamp":1629198188000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["A novel feature extraction method based on highly expressed SNPs for tissue-specific gene prediction"],"prefix":"10.1186","volume":"8","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7882-5676","authenticated-orcid":false,"given":"Jasbir","family":"Dhaliwal","sequence":"first","affiliation":[]},{"given":"John","family":"Wagner","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2021,8,17]]},"reference":[{"key":"497_CR1","doi-asserted-by":"publisher","first-page":"860","DOI":"10.1038\/35057062","volume":"409","author":"IHGS Consortium","year":"2001","unstructured":"Consortium IHGS. Initial sequencing and analysis of the human genome. Nature. 2001;409:860\u2013921.","journal-title":"Nature"},{"key":"497_CR2","doi-asserted-by":"publisher","first-page":"789","DOI":"10.1038\/nature02168","volume":"426","author":"TIH Project","year":"2003","unstructured":"Project TIH. The international hapmap project consortium. Nature. 2003;426:789\u201396.","journal-title":"Nature"},{"issue":"3","key":"497_CR3","doi-asserted-by":"publisher","first-page":"219","DOI":"10.1016\/j.bmhimx.2017.03.003","volume":"74","author":"H Quezada","year":"2017","unstructured":"Quezada H, Guzm\u00e1n-Ortiz A, D\u00edaz-S\u00e1nchez H, Valle-Rios R, Aguirre-Hern\u00e1ndez J. Omics-based biomarkers: current status and potential use in the clinic. Bolet\u00edn M\u00e9dico del Hospital Infantil de M\u00e9xico. 2017;74(3):219\u201326.","journal-title":"Bolet\u00edn M\u00e9dico del Hospital Infantil de M\u00e9xico"},{"issue":"12","key":"497_CR4","doi-asserted-by":"publisher","first-page":"1240","DOI":"10.1056\/NEJMoa0706728","volume":"358","author":"S Kathiresan","year":"2008","unstructured":"Kathiresan S, Melander O, Anevski D, Guiducci C, Burtt N, Roos C, Hirschhorn JN, Berglund G, Hedblad B, Groop L, Altshuler DM, Newton-Cheh C, Orho-Melander M. Polymorphisms associated with cholesterol and risk of cardiovascular events. N Engl J Med. 2008;358(12):1240\u20139.","journal-title":"N Engl J Med"},{"issue":"4","key":"497_CR5","doi-asserted-by":"publisher","first-page":"577","DOI":"10.1590\/S0004-27302008000400001","volume":"52","author":"D Miranda","year":"2008","unstructured":"Miranda D, Romano-Silva MA, De Marco L. Single nucleotide polymorphisms (snps) and the search for obesity-related genes. Arquivos Brasileiros de Endocrinologia Metabologia. 2008;52(4):577\u20138.","journal-title":"Arquivos Brasileiros de Endocrinologia Metabologia"},{"issue":"12","key":"497_CR6","doi-asserted-by":"publisher","first-page":"967","DOI":"10.1038\/nrc2540","volume":"8","author":"J Bertout","year":"2008","unstructured":"Bertout J, Patel S, Simon M. The impact of o2 availability on human cancer. Nat Rev Cancer. 2008;8(12):967\u201375.","journal-title":"Nat Rev Cancer"},{"issue":"12","key":"497_CR7","doi-asserted-by":"publisher","first-page":"1003110","DOI":"10.1371\/journal.pgen.1003110","volume":"8","author":"G Alkorta-Aranburu","year":"2012","unstructured":"Alkorta-Aranburu G, Beall CM, Witonsky DB, Gebremedhin A, Pritchard JK, Rienzo AD. The genetic architecture of adaptations to high altitude in Ethiopia. PLoS Genet. 2012;8(12):1003110.","journal-title":"PLoS Genet"},{"issue":"7","key":"497_CR8","doi-asserted-by":"publisher","first-page":"0180365","DOI":"10.1371\/journal.pone.0180365","volume":"12","author":"M Christiansen","year":"2017","unstructured":"Christiansen M, Larsen S, Nyegaard M, Neergaard-Petersen S, Ajjan R, W\u00fcrtz M, Grove EL, Hvas A-M, Jensen HK, Kristensen S. Coronary artery disease-associated genetic variants and biomarkers of inflammation. PLoS ONE. 2017;12(7):0180365.","journal-title":"PLoS ONE"},{"issue":"3","key":"497_CR9","doi-asserted-by":"publisher","first-page":"738","DOI":"10.1002\/cpt.1241","volume":"105","author":"SR Rashkin","year":"2019","unstructured":"Rashkin SR, Chua KC, Ho C, Mulkey F, Jiang C, Mushiroda T, Kubo M, Friedman PN, Rugo HS, McLeod HL, Ratain MJ, Castillos F, Naughton M, Overmoyer B, Toppmeyer D, Witte JS, Owzar K, Kroetz DL. A pharmacogenetic prediction model of progression-free survival in breast cancer using genome-wide genotyping data from calgb 40502 (alliance). Clin Pharmacol Ther. 2019;105(3):738\u201345.","journal-title":"Clin Pharmacol Ther"},{"issue":"6","key":"497_CR10","doi-asserted-by":"publisher","first-page":"1008","DOI":"10.1016\/j.ajhg.2013.05.002","volume":"92","author":"Z Wei","year":"2013","unstructured":"Wei Z, Wang W, Bradfield J, Li J, Cardinale C, Frackelton E, Kim C, Mentch F, Van Steen K, Visscher PM, Baldassano RN, Hakonarson H, the International IBD Genetics Consortium. Large sample size, wide variant spectrum, and advanced machine-learning technique boost risk prediction for inflammatory bowel disease. Am J Hum Genet. 2013;92(6):1008\u201312.","journal-title":"Am J Hum Genet."},{"key":"497_CR11","doi-asserted-by":"crossref","unstructured":"Montanez CAC, Fergus P, Montanez AC, Hussain A, Al-Jumeily D, Chalmers C. Deep learning classification of polygenic obesity using genome wide association study snps. In: International Joint Conference on Neural Networks (IJCNN), pp. 1\u20138; 2018.","DOI":"10.1109\/IJCNN.2018.8489048"},{"key":"497_CR12","doi-asserted-by":"publisher","first-page":"281","DOI":"10.1186\/s12911-019-1004-8","volume":"19","author":"S Uddin","year":"2019","unstructured":"Uddin S, Khan A, Hossain ME, Moni MA. Comparing different supervised machine learning algorithms for disease prediction. BMC Med Inf Decis Making. 2019;19:281.","journal-title":"BMC Med Inf Decis Making"},{"issue":"2","key":"497_CR13","doi-asserted-by":"publisher","first-page":"97","DOI":"10.1152\/physiolgenomics.00040.2001","volume":"7","author":"L Hsiao","year":"2001","unstructured":"Hsiao L, Dangond F, Yoshida T, Hong R, Jensen R, Misra J, Dillon W, Lee KF, Clark KE, Haverty P, Weng Z, Mutter GL, Frosch MP, MacDonald ME, Milford EL, Crum CP, Bueno R, Pratt RE, Mahadevappa M, Warrington JA, Stephanopoulos G, Stephanopoulos G, Gullans S. A compendium of gene expression in normal human tissues. Physiol Genomics. 2001;7(2):97\u2013104.","journal-title":"Physiol Genomics."},{"key":"497_CR14","unstructured":"NIH National Human Genome Research Institute. The Genotype-Tissue Expression Project (GTEx). https:\/\/www.genome.gov\/27549432\/gtex-surgical-donors. Accessed 20 Oct 2020"},{"key":"497_CR15","unstructured":"NIH National Institutes of\u00a0Health Office\u00a0of Strategic Coordination - The Common\u00a0Fund. Genotype-Tissue Expression. https:\/\/commonfund.nih.gov\/gtex. Accessed 26 Apr 2021"},{"issue":"4","key":"497_CR16","doi-asserted-by":"publisher","first-page":"1077","DOI":"10.1016\/j.celrep.2017.10.001","volume":"21","author":"AR Sonawane","year":"2017","unstructured":"Sonawane AR, Platig J, Fagny M, Chen C-Y, Paulson JN, Lopes-Ramos CM, DeMeo DL, Quackenbush J, Glass K, Kuijjer ML. Understanding tissue-specific gene regulation. Cell Rep. 2017;21(4):1077\u201388.","journal-title":"Cell Rep."},{"key":"497_CR17","unstructured":"NIH National Library of\u00a0Medicine National Center\u00a0for Biotechnology\u00a0Information. ClinVar Genomic variation as it relates to human health. https:\/\/www.ncbi.nlm.nih.gov\/clinvar\/variation\/1062\/. Accessed 7 July 2021"},{"key":"497_CR18","first-page":"125","volume":"1","author":"C Haruechaiyasak","year":"2008","unstructured":"Haruechaiyasak C, Kongyoung S, Dailey M. A comparative study on Thai word segmentation approaches. International Conference on Electrical Engineering\/Electronics, Computer, Telecommunications and Information Technology. 2008;1:125\u20138.","journal-title":"International Conference on Electrical Engineering\/Electronics, Computer, Telecommunications and Information Technology"},{"key":"497_CR19","doi-asserted-by":"publisher","first-page":"421","DOI":"10.1162\/tacl_a_00033","volume":"6","author":"Y Shao","year":"2018","unstructured":"Shao Y, Hardmeier C, Nivre J. Universal word segmentation: implementation and interpretation. Trans Assoc Comput Linguist. 2018;6:421\u201335.","journal-title":"Trans Assoc Comput Linguist"},{"key":"497_CR20","unstructured":"Clercq GD. Deep learning for classification of dna functional sequences. Ghent University; 2019. Master\u2019s thesis."},{"key":"497_CR21","unstructured":"Brownlee J. How to Encode Text Data for Machine Learning with scikit-learn. https:\/\/machinelearningmastery.com\/prepare-text-data-machine-learning-scikit-learn\/. Accessed 10 Oct 2020"},{"key":"497_CR22","unstructured":"Lebret RP. Word embeddings for natural language processing. PhD thesis, Ecole Polytechnique F\u00e9d\u00e9rale de Lausanne. 2016."},{"key":"497_CR23","doi-asserted-by":"publisher","first-page":"9","DOI":"10.1186\/1471-2105-10-S14-S9","volume":"10","author":"P Kuksa","year":"2009","unstructured":"Kuksa P, Pavlovic V. Efficient alignment-free dna barcode analytics. BMC Bioinform. 2009;10:9.","journal-title":"BMC Bioinform"},{"issue":"3","key":"497_CR24","doi-asserted-by":"publisher","first-page":"173","DOI":"10.1016\/j.artmed.2015.06.002","volume":"64","author":"A Fiannaca","year":"2015","unstructured":"Fiannaca A, La Rosa M, Rizzo R, Urso A. A k-mer-based barcode dna classification methodology based on spectral representation and a neural gas network. Artifi Intell Med. 2015;64(3):173\u201384.","journal-title":"Artifi Intell Med"},{"key":"497_CR25","doi-asserted-by":"crossref","unstructured":"Rizzo R, Fiannaca A, Rosa ML, Urso A. A deep learning approach to DNA sequence classification. In: Angelini C, Rancoita P, Rovetta, S. (eds.) Computational Intelligence Methods for Bioinformatics and Biostatistics (CIBB). Lecture Notes in Computer Science, vol. 9874, pp. 129\u201340. Springer, 2016.","DOI":"10.1007\/978-3-319-44332-4_10"},{"key":"497_CR26","doi-asserted-by":"publisher","first-page":"280","DOI":"10.4236\/jbise.2016.95021","volume":"9","author":"NG Nguyen","year":"2016","unstructured":"Nguyen NG, Tran VA, Ngo DL, Phan D, Lumbanraja FR, Faisal MR, Abapihi1 B, Kubo M, Satou K. Dna sequence classification by convolutional neural network. J Biomed Sci Eng. 2016;9:280\u20136.","journal-title":"J Biomed Sci Eng"},{"issue":"4","key":"497_CR27","doi-asserted-by":"publisher","first-page":"517","DOI":"10.1016\/j.cell.2005.06.026","volume":"122","author":"DK Pokholok","year":"2005","unstructured":"Pokholok DK, Harbison CT, Levine S, Cole M, Hannett NM, Lee TI, Bell GW, Walker K, Rolfe PA, Herbolsheimer E, Zeitlinger J,  Lewitter F, Gifford DK, Young RA. Genome-wide map of nucleosome acetylation and methylation in yeast. Cell. 2005;122(4):517\u201327.","journal-title":"Cell."},{"issue":"19","key":"497_CR28","doi-asserted-by":"publisher","first-page":"4103","DOI":"10.1093\/nar\/gkf543","volume":"30","author":"C Math\u00e9","year":"2002","unstructured":"Math\u00e9 C, Sagot M, Schiex T, Rouz\u00e9 P. Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res. 2002;30(19):4103\u201317.","journal-title":"Nucleic Acids Res"},{"issue":"24","key":"497_CR29","doi-asserted-by":"publisher","first-page":"6441","DOI":"10.1093\/nar\/20.24.6441","volume":"20","author":"J Fickett","year":"1992","unstructured":"Fickett J, Tung C-S. Assessment of protein coding measures. Nucleic Acids Res. 1992;20(24):6441\u201350.","journal-title":"Nucleic Acids Res"},{"key":"497_CR30","unstructured":"GTExPortal: GTExPortal. http:\/\/gtexportal.org\/home\/. Accessed 20 Feb 2021"},{"key":"497_CR31","doi-asserted-by":"publisher","first-page":"103","DOI":"10.1023\/A:1007413511361","volume":"29","author":"P Domingos","year":"1997","unstructured":"Domingos P, Pazzani M. On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn. 1997;29:103\u201330.","journal-title":"Mach Learn"},{"issue":"2","key":"497_CR32","first-page":"1","volume":"1","author":"M Ismail","year":"2020","unstructured":"Ismail M, Hassan N, Bafjaish SS. Comparative analysis of Naive Bayesian techniques in health-related for classification task. J Soft Comput Data Mining. 2020;1(2):1\u201310.","journal-title":"J Soft Comput Data Mining"},{"key":"497_CR33","doi-asserted-by":"crossref","unstructured":"Ashari A, Paryudi I, Tjoa AM. Performance comparison between na\u00efve bayes, decision tree and k-nearest neighbor in searching alternative design in an energy simulation tool. International Journal of Advanced Computer Science and Applications. 2013;4(11).","DOI":"10.14569\/IJACSA.2013.041105"},{"issue":"Suppl 1","key":"497_CR34","doi-asserted-by":"publisher","first-page":"97","DOI":"10.1093\/bioinformatics\/17.suppl_1.S97","volume":"17","author":"V Hatzivassiloglou","year":"2001","unstructured":"Hatzivassiloglou V, Dubou\u00e9 P, Rzhetsky A. Disambiguating proteins, genes, and rna in text: a machine learning approach. Bioinformatics. 2001;17(Suppl 1):97\u2013106.","journal-title":"Bioinformatics"}],"container-title":["Journal of Big Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-021-00497-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s40537-021-00497-9\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-021-00497-9.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,8,17]],"date-time":"2021-08-17T11:17:58Z","timestamp":1629199078000},"score":1,"resource":{"primary":{"URL":"https:\/\/journalofbigdata.springeropen.com\/articles\/10.1186\/s40537-021-00497-9"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,8,17]]},"references-count":34,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2021,12]]}},"alternative-id":["497"],"URL":"https:\/\/doi.org\/10.1186\/s40537-021-00497-9","relation":{},"ISSN":["2196-1115"],"issn-type":[{"type":"electronic","value":"2196-1115"}],"subject":[],"published":{"date-parts":[[2021,8,17]]},"assertion":[{"value":"27 April 2021","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"30 July 2021","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"17 August 2021","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare that they have no competing interests.","order":4,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"109"}}