{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,27]],"date-time":"2026-02-27T06:21:51Z","timestamp":1772173311134,"version":"3.50.1"},"update-to":[{"DOI":"10.1371\/journal.pcbi.1012301","type":"new_version","label":"New version","source":"publisher","updated":{"date-parts":[[2024,9,13]],"date-time":"2024-09-13T00:00:00Z","timestamp":1726185600000}}],"reference-count":33,"publisher":"Public Library of Science (PLoS)","issue":"9","license":[{"start":{"date-parts":[[2024,9,3]],"date-time":"2024-09-03T00:00:00Z","timestamp":1725321600000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/100010269","name":"Wellcome Trust","doi-asserted-by":"publisher","award":["WT220788"],"award-info":[{"award-number":["WT220788"]}],"id":[{"id":"10.13039\/100010269","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100010269","name":"Wellcome Trust","doi-asserted-by":"publisher","award":["WT220024"],"award-info":[{"award-number":["WT220024"]}],"id":[{"id":"10.13039\/100010269","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100000265","name":"Medical Research Council","doi-asserted-by":"publisher","award":["MC_UU_00002\/4"],"award-info":[{"award-number":["MC_UU_00002\/4"]}],"id":[{"id":"10.13039\/501100000265","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100000265","name":"Medical Research Council","doi-asserted-by":"publisher","award":["MC UU 00002\/13"],"award-info":[{"award-number":["MC UU 00002\/13"]}],"id":[{"id":"10.13039\/501100000265","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["www.ploscompbiol.org"],"crossmark-restriction":false},"short-container-title":["PLoS Comput Biol"],"abstract":"<jats:p>Clustering is widely used in bioinformatics and many other fields, with applications from exploratory analysis to prediction. Many types of data have associated uncertainty or measurement error, but this is rarely used to inform the clustering. We present Dirichlet Process Mixtures with Uncertainty (DPMUnc), an extension of a Bayesian nonparametric clustering algorithm which makes use of the uncertainty associated with data points. We show that DPMUnc out-performs existing methods on simulated data. We cluster immune-mediated diseases (IMD) using GWAS summary statistics, which have uncertainty linked with the sample size of the study. DPMUnc separates autoimmune from autoinflammatory diseases and isolates other subgroups such as adult-onset arthritis. We additionally consider how DPMUnc can be used to cluster gene expression datasets that have been summarised using gene signatures. We first introduce a novel procedure for generating a summary of a gene signature on a dataset different to the one where it was discovered, which incorporates a measure of the variability in expression across signature genes within each individual. We summarise three public gene expression datasets containing patients with a range of IMD, using three relevant gene signatures. We find association between disease and the clusters returned by DPMUnc, with clustering structure replicated across the datasets. The significance of this work is two-fold. Firstly, we demonstrate that when data has associated uncertainty, this uncertainty should be used to inform clustering and we present a method which does this, DPMUnc. Secondly, we present a procedure for using gene signatures in datasets other than where they were originally defined. We show the value of this procedure by summarising gene expression data from patients with immune-mediated diseases using relevant gene signatures, and clustering these patients using DPMUnc.<\/jats:p>","DOI":"10.1371\/journal.pcbi.1012301","type":"journal-article","created":{"date-parts":[[2024,9,3]],"date-time":"2024-09-03T13:38:32Z","timestamp":1725370712000},"page":"e1012301","update-policy":"https:\/\/doi.org\/10.1371\/journal.pcbi.corrections_policy","source":"Crossref","is-referenced-by-count":0,"title":["Bayesian clustering with uncertain data"],"prefix":"10.1371","volume":"20","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-4070-1317","authenticated-orcid":true,"given":"Kath","family":"Nicholls","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4070-1317","authenticated-orcid":true,"given":"Paul D. W.","family":"Kirk","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9755-1703","authenticated-orcid":true,"given":"Chris","family":"Wallace","sequence":"additional","affiliation":[]}],"member":"340","published-online":{"date-parts":[[2024,9,3]]},"reference":[{"issue":"11","key":"pcbi.1012301.ref001","doi-asserted-by":"crossref","first-page":"1370","DOI":"10.1109\/TKDE.2004.68","article-title":"Cluster Analysis for Gene Expression Data: A Survey","volume":"16","author":"D Jiang","year":"2004","journal-title":"IEEE Transactions on Knowledge and Data Engineering"},{"issue":"5439","key":"pcbi.1012301.ref002","doi-asserted-by":"crossref","first-page":"531","DOI":"10.1126\/science.286.5439.531","article-title":"Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring","volume":"286","author":"TR Golub","year":"1999","journal-title":"Science"},{"issue":"1","key":"pcbi.1012301.ref003","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1038\/s41467-018-03424-4","article-title":"A Comprehensive Evaluation of Module Detection Methods for Gene Expression Data","volume":"9","author":"W Saelens","year":"2018","journal-title":"Nature Communications"},{"issue":"8","key":"pcbi.1012301.ref004","doi-asserted-by":"crossref","first-page":"e1002254","DOI":"10.1371\/journal.pgen.1002254","article-title":"Pervasive Sharing of Genetic Effects in Autoimmune Disease","volume":"7","author":"C Cotsapas","year":"2011","journal-title":"PLoS genetics"},{"issue":"2","key":"pcbi.1012301.ref005","doi-asserted-by":"crossref","first-page":"e1007139","DOI":"10.1371\/journal.pgen.1007139","article-title":"An Efficient Bayesian Meta-Analysis Approach for Studying Cross-Phenotype Genetic Associations","volume":"14","author":"A Majumdar","year":"2018","journal-title":"PLoS genetics"},{"issue":"2","key":"pcbi.1012301.ref006","doi-asserted-by":"crossref","first-page":"129","DOI":"10.1109\/TIT.1982.1056489","article-title":"Least Squares Quantization in PCM","volume":"28","author":"S Lloyd","year":"1982","journal-title":"IEEE Transactions on Information Theory"},{"key":"pcbi.1012301.ref007","unstructured":"MacQueen J. Some Methods for Classification and Analysis of Multivariate Observations. In: Proc. Fifth Berkeley Sympos. Math. Statist. and Probability (Berkeley, Calif., 1965\/66), Vol. I: Statistics. Univ. California Press, Berkeley, Calif.; 1967. p. 281\u2013297."},{"issue":"1","key":"pcbi.1012301.ref008","doi-asserted-by":"crossref","first-page":"289","DOI":"10.32614\/RJ-2016-021","article-title":"mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models","volume":"8","author":"L Scrucca","year":"2016","journal-title":"The R Journal"},{"key":"pcbi.1012301.ref009","unstructured":"Chang Y, Chen J, Cho M, Castaldi P, Silverman E, Dy J. Clustering from multiple uncertain experts. In: Artificial Intelligence and Statistics. PMLR; 2017. p. 28\u201336."},{"key":"pcbi.1012301.ref010","doi-asserted-by":"crossref","unstructured":"Gullo F, Ponti G, Tagarelli A. Clustering uncertain data via k-medoids. In: International Conference on Scalable Uncertainty Management. Springer; 2008. p. 229\u2013242.","DOI":"10.1007\/978-3-540-87993-0_19"},{"key":"pcbi.1012301.ref011","doi-asserted-by":"crossref","unstructured":"Gullo F, Tagarelli A. Uncertain centroid based partitional clustering of uncertain data. arXiv preprint arXiv:12036401. 2012;.","DOI":"10.14778\/2180912.2180914"},{"key":"pcbi.1012301.ref012","doi-asserted-by":"crossref","unstructured":"Z\u00fcfle A, Emrich T, Schmid KA, Mamoulis N, Zimek A, Renz M. Representative clustering of uncertain data. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining; 2014. p. 243\u2013252.","DOI":"10.1145\/2623330.2623725"},{"issue":"1","key":"pcbi.1012301.ref013","doi-asserted-by":"crossref","first-page":"119","DOI":"10.1109\/TKDE.2011.201","article-title":"Maximum likelihood estimation from uncertain data in the belief function framework","volume":"25","author":"T Denoeux","year":"2011","journal-title":"IEEE Transactions on knowledge and data engineering"},{"issue":"430","key":"pcbi.1012301.ref014","doi-asserted-by":"crossref","first-page":"577","DOI":"10.1080\/01621459.1995.10476550","article-title":"Bayesian Density Estimation and Inference Using Mixtures","volume":"90","author":"MD Escobar","year":"1995","journal-title":"Journal of the American Statistical Association"},{"key":"pcbi.1012301.ref015","unstructured":"Rasmussen CE. The Infinite Gaussian Mixture Model. In: Proceedings of the 12th International Conference on Neural Information Processing Systems. NIPS\u201999. Cambridge, MA, USA: MIT Press; 1999. p. 554\u2013560."},{"issue":"2","key":"pcbi.1012301.ref016","doi-asserted-by":"crossref","first-page":"249","DOI":"10.1080\/10618600.2000.10474879","article-title":"Markov Chain Sampling Methods for Dirichlet Process Mixture Models","volume":"9","author":"RM Neal","year":"2000","journal-title":"Journal of Computational and Graphical Statistics"},{"issue":"2","key":"pcbi.1012301.ref017","doi-asserted-by":"crossref","first-page":"129","DOI":"10.1109\/TIT.1982.1056489","article-title":"Least squares quantization in PCM","volume":"28","author":"S Lloyd","year":"1982","journal-title":"IEEE Transactions on Information Theory"},{"issue":"2","key":"pcbi.1012301.ref018","doi-asserted-by":"crossref","first-page":"411","DOI":"10.1111\/1467-9868.00293","article-title":"Estimating the Number of Clusters in a Data Set via the Gap Statistic","volume":"63","author":"R Tibshirani","year":"2001","journal-title":"Journal of the Royal Statistical Society Series B (Statistical Methodology)"},{"issue":"4","key":"pcbi.1012301.ref019","first-page":"401","article-title":"On a Measure of Divergence between Two Multinomial Populations","volume":"7","author":"A Bhattacharyya","year":"1946","journal-title":"Sankhy\u0101: The Indian Journal of Statistics (1933-1960)"},{"issue":"2","key":"pcbi.1012301.ref020","doi-asserted-by":"crossref","first-page":"367","DOI":"10.1214\/09-BA414","article-title":"Improved Criteria for Clustering Based on the Posterior Similarity Matrix","volume":"4","author":"A Fritsch","year":"2009","journal-title":"Bayesian Analysis"},{"issue":"1","key":"pcbi.1012301.ref021","doi-asserted-by":"crossref","first-page":"106","DOI":"10.1186\/s13073-020-00797-4","article-title":"Genetic Feature Engineering Enables Characterisation of Shared Risk Factors in Immune-Mediated Diseases","volume":"12","author":"OS Burren","year":"2020","journal-title":"Genome Medicine"},{"key":"pcbi.1012301.ref022","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1155\/2012\/969657","article-title":"Is Multiple Sclerosis an Autoimmune Disease?","volume":"2012","author":"B Wootla","year":"2012","journal-title":"Autoimmune Diseases"},{"issue":"587","key":"pcbi.1012301.ref023","doi-asserted-by":"crossref","first-page":"eabd5666","DOI":"10.1126\/scitranslmed.abd5666","article-title":"Transcriptional Networks in At-Risk Individuals Identify Signatures of Type 1 Diabetes Progression","volume":"13","author":"LP Xhonneux","year":"2021","journal-title":"Science translational medicine"},{"issue":"7","key":"pcbi.1012301.ref024","doi-asserted-by":"crossref","first-page":"2538","DOI":"10.2337\/db13-1777","article-title":"A Type I Interferon Transcriptional Signature Precedes Autoimmunity in Children Genetically at Risk for Type 1 Diabetes","volume":"63","author":"RC Ferreira","year":"2014","journal-title":"Diabetes"},{"issue":"4","key":"pcbi.1012301.ref025","doi-asserted-by":"crossref","first-page":"670","DOI":"10.1016\/j.immuni.2007.09.006","article-title":"Molecular Signature of CD8+ T Cell Exhaustion during Chronic Viral Infection","volume":"27","author":"EJ Wherry","year":"2007","journal-title":"Immunity"},{"issue":"7562","key":"pcbi.1012301.ref026","doi-asserted-by":"crossref","first-page":"612","DOI":"10.1038\/nature14468","article-title":"T-Cell Exhaustion, Co-Stimulation and Clinical Outcome in Autoimmunity and Infection","volume":"523","author":"EF McKinney","year":"2015","journal-title":"Nature"},{"issue":"1","key":"pcbi.1012301.ref027","doi-asserted-by":"crossref","first-page":"150","DOI":"10.1016\/j.immuni.2008.05.012","article-title":"A Modular Analysis Framework for Blood Genomics Studies: Application to Systemic Lupus Erythematosus","volume":"29","author":"D Chaussabel","year":"2008","journal-title":"Immunity"},{"issue":"6","key":"pcbi.1012301.ref028","doi-asserted-by":"crossref","first-page":"1208","DOI":"10.1136\/ard.2009.108043","article-title":"Novel Expression Signatures Identified by Transcriptional Analysis of Separated Leucocyte Subsets in Systemic Lupus Erythematosus and Vasculitis","volume":"69","author":"PA Lyons","year":"2010","journal-title":"Annals of the Rheumatic Diseases"},{"issue":"1","key":"pcbi.1012301.ref029","doi-asserted-by":"crossref","first-page":"193","DOI":"10.1007\/BF01908075","article-title":"Comparing Partitions","volume":"2","author":"L Hubert","year":"1985","journal-title":"Journal of Classification"},{"key":"pcbi.1012301.ref030","unstructured":"Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K. cluster: Cluster Analysis Basics and Extensions; 2022. R package version 2.1.4\u2014For new features, see the\u2019Changelog\u2019 file (in the package source). Available from: https:\/\/CRAN.R-project.org\/package=cluster."},{"issue":"suppl_1","key":"pcbi.1012301.ref031","doi-asserted-by":"crossref","first-page":"S96","DOI":"10.1093\/bioinformatics\/18.suppl_1.S96","article-title":"Variance Stabilization Applied to Microarray Data Calibration and to the Quantification of Differential Expression","volume":"18","author":"W Huber","year":"2002","journal-title":"Bioinformatics"},{"issue":"16","key":"pcbi.1012301.ref032","doi-asserted-by":"crossref","first-page":"3439","DOI":"10.1093\/bioinformatics\/bti525","article-title":"BioMart and Bioconductor: A Powerful Link between Biological Databases and Microarray Data Analysis","volume":"21","author":"S Durinck","year":"2005","journal-title":"Bioinformatics"},{"issue":"8","key":"pcbi.1012301.ref033","doi-asserted-by":"crossref","first-page":"1184","DOI":"10.1038\/nprot.2009.97","article-title":"Mapping Identifiers for the Integration of Genomic Datasets with the R\/Bioconductor Package biomaRt","volume":"4","author":"S Durinck","year":"2009","journal-title":"Nature Protocols"}],"updated-by":[{"DOI":"10.1371\/journal.pcbi.1012301","type":"new_version","label":"New version","source":"publisher","updated":{"date-parts":[[2024,9,13]],"date-time":"2024-09-13T00:00:00Z","timestamp":1726185600000}}],"container-title":["PLOS Computational Biology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dx.plos.org\/10.1371\/journal.pcbi.1012301","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,9,13]],"date-time":"2024-09-13T14:27:26Z","timestamp":1726237646000},"score":1,"resource":{"primary":{"URL":"https:\/\/dx.plos.org\/10.1371\/journal.pcbi.1012301"}},"subtitle":[],"editor":[{"given":"Ferhat","family":"Ay","sequence":"first","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2024,9,3]]},"references-count":33,"journal-issue":{"issue":"9","published-online":{"date-parts":[[2024,9,3]]}},"URL":"https:\/\/doi.org\/10.1371\/journal.pcbi.1012301","relation":{"has-preprint":[{"id-type":"doi","id":"10.1101\/2022.12.07.519476","asserted-by":"object"}]},"ISSN":["1553-7358"],"issn-type":[{"value":"1553-7358","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,9,3]]}}}