{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,8]],"date-time":"2026-04-08T22:35:59Z","timestamp":1775687759169,"version":"3.50.1"},"reference-count":44,"publisher":"Oxford University Press (OUP)","issue":"17","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2007,9,1]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Motivation: Cluster analysis is one of the most important data mining tools for investigating high-throughput biological data. The existence of many scattered objects that should not be clustered has been found to hinder performance of most traditional clustering algorithms in such a high-dimensional complex situation. Very often, additional prior knowledge from databases or previous experiments is also available in the analysis. Excluding scattered objects and incorporating existing prior information are desirable to enhance the clustering performance.<\/jats:p><jats:p>Results: In this article, a class of loss functions is proposed for cluster analysis and applied in high-throughput genomic and proteomic data. Two major extensions from K-means are involved: penalization and weighting. The additive penalty term is used to allow a set of scattered objects without being clustered. Weights are introduced to account for prior information of preferred or prohibited cluster patterns to be identified. Their relationship with the classification likelihood of Gaussian mixture models is explored. Incorporation of good prior information is also shown to improve the global optimization issue in clustering. Applications of the proposed method on simulated data as well as high-throughput data sets from tandem mass spectrometry (MS\/MS) and microarray experiments are presented. Our results demonstrate its superior performance over most existing methods and its computational simplicity and extensibility in the application of large complex biological data sets.<\/jats:p><jats:p>Availability: \u00a0http:\/\/www.pitt.edu\/~ctseng\/research\/software.html<\/jats:p><jats:p>Contact: \u00a0ctseng@pitt.edu<\/jats:p><jats:p>Supplementary information: Supplementary data are available at Bioinformatics online.<\/jats:p>","DOI":"10.1093\/bioinformatics\/btm320","type":"journal-article","created":{"date-parts":[[2007,6,28]],"date-time":"2007-06-28T05:03:05Z","timestamp":1183006985000},"page":"2247-2255","source":"Crossref","is-referenced-by-count":77,"title":["Penalized and weighted<i>K<\/i>-means for clustering with scattered objects and prior information in high-throughput biological data"],"prefix":"10.1093","volume":"23","author":[{"given":"George C.","family":"Tseng","sequence":"first","affiliation":[{"name":"Department of Biostatistics, University of Pittsburgh, Pittsburgh, USA"}]}],"member":"286","published-online":{"date-parts":[[2007,6,27]]},"reference":[{"key":"2023041105583313400_","first-page":"59","article-title":"A probabilistic framework for semi-supervised clustering","author":"Basu","year":"2004"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"561","DOI":"10.1016\/S0167-9473(02)00163-9","article-title":"Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models","volume":"41","author":"Biernacki","year":"2003","journal-title":"Comput. Stat. Data Anal"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","DOI":"10.1093\/oso\/9780198538493.001.0001","volume-title":"Neural Networks for Pattern Recognition","author":"Bishop","year":"1995"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"147","DOI":"10.1207\/s15327906mbr2402_1","article-title":"Replicating cluster analysis: method, consistency, and validity","volume":"24","author":"Breckenridge","year":"1989","journal-title":"Multivariate Behav. Res"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"315","DOI":"10.1016\/0167-9473(92)90042-E","article-title":"A classification EM algorithm for clustering and two stochastic versions","volume":"14","author":"Celeux","year":"1992","journal-title":"Comput. Stat. Data Anal"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"687","DOI":"10.1081\/BIP-200025659","article-title":"A knowledge-based clustering algorithm driven by Gene Ontol-ogy","volume":"14","author":"Cheng","year":"2004","journal-title":"J. Biopharm. Stat"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"411","DOI":"10.1586\/14737159.3.4.411","article-title":"Cancer diagnosis using proteomic patterns","volume":"3","author":"Conrads","year":"2003","journal-title":"Expert Rev. Mol. Diagn"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"294","DOI":"10.1080\/01621459.1998.10474110","article-title":"Detecting features in spatial point processes with clutter via model-based clustering","volume":"93","author":"Dasgupta","year":"1998","journal-title":"J. Am. Stat. Assoc"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"1453","DOI":"10.1093\/bioinformatics\/bth078","article-title":"Open source clustering software","volume":"20","author":"De Hoon","year":"2004","journal-title":"Bioinformatics"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"0036.1","DOI":"10.1186\/gb-2002-3-7-research0036","article-title":"A prediction-based resampling method for estimating the number of clusters in a dataset","volume":"3","author":"Dudoit","year":"2002","journal-title":"Genome Biol"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"611","DOI":"10.1198\/016214502760047131","article-title":"Model-based clustering, discriminant analysis, and density estimation","volume":"97","author":"Fraley","year":"2002","journal-title":"J. Am. Stat. Assoc"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"455","DOI":"10.2307\/2347733","article-title":"Classification and mixture approach to clustering via maximum likelihood","volume":"38","author":"Ganesalingam","year":"1989","journal-title":"Appl. Stat"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","DOI":"10.1201\/9780367805302","volume-title":"Classification","author":"Gordon","year":"1999","edition":"2nd"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"303","DOI":"10.1023\/A:1016308404627","article-title":"Techniques of cluster algorithms in data mining","volume":"6","author":"Grabmeier","year":"2002","journal-title":"Data Mining Knowl. Discov"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"145","DOI":"10.1093\/bioinformatics\/18.suppl_1.S145","article-title":"Co-clustering of biological networks and gene expression data","volume":"18","author":"Hanisch","year":"2002","journal-title":"Bioinformatics"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"100","DOI":"10.2307\/2346830","article-title":"A K-means clustering algorithm","volume":"28","author":"Hartigan","year":"1979","journal-title":"Appl. Stat"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"0003.1","DOI":"10.1186\/gb-2000-1-2-research0003","article-title":"Gene shaving as a method for identifying distinct sets of genes with similar expression patterns","volume":"1","author":"Hastie","year":"2000","journal-title":"Genome Biol"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"1259","DOI":"10.1093\/bioinformatics\/btl065","article-title":"Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data","volume":"22","author":"Huang","year":"2006","journal-title":"Bioinformatics"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"5800","DOI":"10.1021\/ac0480949","article-title":"Statistical characterization of charge state and residue dependence of low energy CID peptide dissociation patterns","volume":"77","author":"Huang","year":"2005","journal-title":"Anal. Chem"},{"key":"2023041105583313400_","article-title":"A data mining scheme for identifying peptide structural motifs responsible for different MS\/MS fragmentation intensity patterns","volume-title":"Journal of Proteomic Research","author":"Huang","year":"2007"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"193","DOI":"10.1007\/BF01908075","article-title":"Comparing partitions","volume":"2","author":"Hubert","year":"1985","journal-title":"J. Classific"},{"key":"2023041105583313400_","volume-title":"Algorithms for Clustering Data","author":"Jain","year":"1988"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"158","DOI":"10.1198\/1061860043001","article-title":"A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model","volume":"13","author":"Jain","year":"2004","journal-title":"J. Comput. Graph. Stat"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","DOI":"10.1007\/978-1-4612-0921-8","volume-title":"Applied Multivariate Data Analysis","author":"Jobson","year":"1992"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","DOI":"10.1002\/9780470316801","volume-title":"Finding Groups in Data","author":"Kaufman","year":"1990"},{"key":"2023041105583313400_","volume-title":"Mixture Models","author":"McLachlan","year":"1987"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"413","DOI":"10.1093\/bioinformatics\/18.3.413","article-title":"A mixture model-based approach to the clustering of microarray expression data","volume":"18","author":"McLachlan","year":"2002","journal-title":"Bioinformatics"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"1194","DOI":"10.1093\/bioinformatics\/18.9.1194","article-title":"Bayesian infinite mixture model-based clustering of gene expression profiles","volume":"18","author":"Medvedovic","year":"2002","journal-title":"Bioinformatics"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"149","DOI":"10.1016\/S0167-739X(97)00018-6","article-title":"A comparative study of clustering methods","volume":"13","author":"Messatfa","year":"1997","journal-title":"Future Generation Comput. Syst"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"159","DOI":"10.1007\/BF02294245","article-title":"An examination of procedures for determining the number of clusters in a data set","volume":"50","author":"Milligan","year":"1985","journal-title":"Psychometrika"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"795","DOI":"10.1093\/bioinformatics\/btl011","article-title":"Incorporating gene functions as priors in model-based clustering of microarray gene expression data","volume":"22","author":"Pan","year":"2006","journal-title":"Bioinformatics"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"2388","DOI":"10.1093\/bioinformatics\/btl393","article-title":"Semi-supervised learning via penalized mixture model with application to microarray sample classification","volume":"22","author":"Pan","year":"2006","journal-title":"Bioinformatics"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","DOI":"10.1017\/CBO9780511812651","volume-title":"Pattern Recognition and Neural Network","author":"Ripley","year":"1996"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"i264","DOI":"10.1093\/bioinformatics\/btg1037","article-title":"Discovering molecular pathways from protein interaction and gene expression data","volume":"19","author":"Segal","year":"2003","journal-title":"Bioinformatics"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"1787","DOI":"10.1093\/bioinformatics\/btg232","article-title":"CLICK and EXPANDER: a system for clustering and visualizing gene expression data","volume":"19","author":"Sharan","year":"2003","journal-title":"Bioinformatics"},{"key":"2023041105583313400_","volume-title":"Cluster Analysis Algorithms","author":"Spaeth","year":"1984"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"3273","DOI":"10.1091\/mbc.9.12.3273","article-title":"Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization","volume":"9","author":"Spellman","year":"1998","journal-title":"Mol. Biol. Cell"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"2907","DOI":"10.1073\/pnas.96.6.2907","article-title":"Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation","volume":"96","author":"Tamayo","year":"1999","journal-title":"PNAS"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"281","DOI":"10.1038\/10343","article-title":"Systematic determination of genetic network architecture","volume":"22","author":"Tavazoie","year":"1999","journal-title":"Nat. Genet"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"2405","DOI":"10.1093\/bioinformatics\/btl406","article-title":"Evaluation and comparison of gene clustering methods in microarray analysis","volume":"22","author":"Thalamuthu","year":"2006","journal-title":"Bioinformatics"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"511","DOI":"10.1198\/106186005X59243","article-title":"Cluster validation by prediction strength","volume":"14","author":"Tibshirani","year":"2005","journal-title":"J. Comput. Graph. Stat"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"10","DOI":"10.1111\/j.0006-341X.2005.031032.x","article-title":"Tight clustering : a resampling-based approach for identifying stable and tight patterns in data","volume":"61","author":"Tseng","year":"2005","journal-title":"Biometrics"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"255","DOI":"10.1038\/ng906","article-title":"Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters","volume":"31","author":"Wu","year":"2002","journal-title":"Nat. Genet"},{"key":"2023041105583313400_","doi-asserted-by":"crossref","first-page":"977","DOI":"10.1093\/bioinformatics\/17.10.977","article-title":"Model-based clustering and data transformations for gene expression data","volume":"17","author":"Yeung","year":"2001","journal-title":"Bioinformatics"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/23\/17\/2247\/49817674\/bioinformatics_23_17_2247.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/23\/17\/2247\/49817674\/bioinformatics_23_17_2247.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,2,15]],"date-time":"2024-02-15T14:15:21Z","timestamp":1708006521000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/23\/17\/2247\/260413"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2007,6,27]]},"references-count":44,"journal-issue":{"issue":"17","published-print":{"date-parts":[[2007,9,1]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btm320","relation":{},"ISSN":["1367-4811","1367-4803"],"issn-type":[{"value":"1367-4811","type":"electronic"},{"value":"1367-4803","type":"print"}],"subject":[],"published-other":{"date-parts":[[2007,9,1]]},"published":{"date-parts":[[2007,6,27]]}}}