{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,22]],"date-time":"2026-02-22T07:46:35Z","timestamp":1771746395412,"version":"3.50.1"},"reference-count":51,"publisher":"Oxford University Press (OUP)","issue":"2","license":[{"start":{"date-parts":[[2019,8,1]],"date-time":"2019-08-01T00:00:00Z","timestamp":1564617600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/academic.oup.com\/journals\/pages\/open_access\/funder_policies\/chorus\/standard_publication_model"}],"funder":[{"name":"Singapore Ministry of Education Academic Research Fund","award":["R-253-000-139-114"],"award-info":[{"award-number":["R-253-000-139-114"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2020,1,15]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:sec>\n                  <jats:title>Motivation<\/jats:title>\n                  <jats:p>The identification of sub-populations of patients with similar characteristics, called patient subtyping, is important for realizing the goals of precision medicine. Accurate subtyping is crucial for tailoring therapeutic strategies that can potentially lead to reduced mortality and morbidity. Model-based clustering, such as Gaussian mixture models, provides a principled and interpretable methodology that is widely used to identify subtypes. However, they impose identical marginal distributions on each variable; such assumptions restrict their modeling flexibility and deteriorates clustering performance.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Results<\/jats:title>\n                  <jats:p>In this paper, we use the statistical framework of copulas to decouple the modeling of marginals from the dependencies between them. Current copula-based methods cannot scale to high dimensions due to challenges in parameter inference. We develop HD-GMCM, that addresses these challenges and, to our knowledge, is the first copula-based clustering method that can fit high-dimensional data. Our experiments on real high-dimensional gene-expression and clinical datasets show that HD-GMCM outperforms state-of-the-art model-based clustering methods, by virtue of modeling non-Gaussian data and being robust to outliers through the use of Gaussian mixture copulas. We present a case study on lung cancer data from TCGA. Clusters obtained from HD-GMCM can be interpreted based on the dependencies they model, that offers a new way of characterizing subtypes. Empirically, such modeling not only uncovers latent structure that leads to better clustering but also meaningful clinical subtypes in terms of survival rates of patients.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Availability and implementation<\/jats:title>\n                  <jats:p>An implementation of HD-GMCM in R is available at: https:\/\/bitbucket.org\/cdal\/hdgmcm\/.<\/jats:p>\n               <\/jats:sec>\n               <jats:sec>\n                  <jats:title>Supplementary information<\/jats:title>\n                  <jats:p>Supplementary data are available at Bioinformatics online.<\/jats:p>\n               <\/jats:sec>","DOI":"10.1093\/bioinformatics\/btz599","type":"journal-article","created":{"date-parts":[[2019,7,26]],"date-time":"2019-07-26T11:10:22Z","timestamp":1564139422000},"page":"621-628","source":"Crossref","is-referenced-by-count":23,"title":["Gaussian mixture copulas for high-dimensional clustering and dependency-based subtyping"],"prefix":"10.1093","volume":"36","author":[{"given":"Siva Rajesh","family":"Kasa","sequence":"first","affiliation":[{"name":"Department of Information Systems and Analytics, School of Computing , National University of Singapore, 117418 Singapore"}]},{"given":"Sakyajit","family":"Bhattacharya","sequence":"additional","affiliation":[{"name":"TCS Innovation Labs , Kolkata 700156, India"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6748-6864","authenticated-orcid":false,"given":"Vaibhav","family":"Rajan","sequence":"additional","affiliation":[{"name":"Department of Information Systems and Analytics, School of Computing , National University of Singapore, 117418 Singapore"}]}],"member":"286","published-online":{"date-parts":[[2019,8,1]]},"reference":[{"key":"2023013112063533000_btz599-B1","doi-asserted-by":"crossref","first-page":"1269","DOI":"10.1093\/bioinformatics\/btr112","article-title":"Mixtures of common t-factor analyzers for clustering high-dimensional microarray data","volume":"27","author":"Baek","year":"2011","journal-title":"Bioinformatics"},{"key":"2023013112063533000_btz599-B2","doi-asserted-by":"crossref","first-page":"45","DOI":"10.1007\/s11634-013-0155-1","article-title":"A LASSO-penalized BIC for mixture model selection","volume":"8","author":"Bhattacharya","year":"2014","journal-title":"Adv. Data Anal. Class"},{"key":"2023013112063533000_btz599-B3","article-title":"Unsupervised learning using Gaussian mixture copula model","author":"Bhattacharya","year":"2014"},{"key":"2023013112063533000_btz599-B4","doi-asserted-by":"crossref","first-page":"1","DOI":"10.18637\/jss.v070.i02","article-title":"GMCM: unsupervised clustering and meta-analysis using Gaussian mixture copula models","volume":"70","author":"Bilgrau","year":"2016","journal-title":"J. Stat. Software"},{"key":"2023013112063533000_btz599-B5","author":"Boulesteix","year":"2011"},{"key":"2023013112063533000_btz599-B6","doi-asserted-by":"crossref","first-page":"52","DOI":"10.1016\/j.csda.2012.12.008","article-title":"Model-based clustering of high-dimensional data: a review","volume":"71","author":"Bouveyron","year":"2014","journal-title":"Comput. Stat. Data Anal"},{"key":"2023013112063533000_btz599-B7","doi-asserted-by":"crossref","first-page":"502","DOI":"10.1016\/j.csda.2007.02.009","article-title":"High-dimensional data clustering","volume":"52","author":"Bouveyron","year":"2007","journal-title":"Comput. Stat. Data Anal"},{"key":"2023013112063533000_btz599-B8","doi-asserted-by":"crossref","first-page":"12253","DOI":"10.1073\/pnas.1304376110","article-title":"Biclustering with heterogeneous variance","volume":"110","author":"Chen","year":"2013","journal-title":"Proc. Natl. Acad. Sci. USA"},{"key":"2023013112063533000_btz599-B9","author":"Chung","year":"2018"},{"key":"2023013112063533000_btz599-B10","doi-asserted-by":"crossref","first-page":"39","DOI":"10.1007\/978-3-642-35407-6_3","volume-title":"Copulae in Mathematical and Quantitative Finance","author":"Elidan","year":"2013"},{"key":"2023013112063533000_btz599-B11","doi-asserted-by":"crossref","first-page":"18","DOI":"10.1214\/18-SS119","article-title":"Variable selection methods for model-based clustering","volume":"12","author":"Fop","year":"2018","journal-title":"Stat. Surv"},{"key":"2023013112063533000_btz599-B12","doi-asserted-by":"crossref","DOI":"10.1145\/2020408.2020509","article-title":"Online heterogeneous mixture modeling with marginal and copula selection","author":"Fujimaki","year":"2011"},{"key":"2023013112063533000_btz599-B13","doi-asserted-by":"crossref","first-page":"543","DOI":"10.1093\/biomet\/82.3.543","article-title":"A semiparametric estimation procedure of dependence parameters in multivariate families of distributions","volume":"82","author":"Genest","year":"1995","journal-title":"Biometrika"},{"key":"2023013112063533000_btz599-B14","volume-title":"The EM algorithm for mixtures of factor analyzers. Technical Report CRG-TR-96-1","author":"Ghahramani","year":"1997"},{"key":"2023013112063533000_btz599-B15","doi-asserted-by":"crossref","DOI":"10.1201\/b17895","volume-title":"Introduction to High-Dimensional Statistics","author":"Giraud","year":"2014"},{"key":"2023013112063533000_btz599-B16","doi-asserted-by":"crossref","first-page":"929","DOI":"10.1016\/j.cell.2014.06.049","article-title":"Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin","volume":"158","author":"Hoadley","year":"2014","journal-title":"Cell"},{"key":"2023013112063533000_btz599-B17","doi-asserted-by":"crossref","first-page":"265","DOI":"10.1214\/07-AOAS107","article-title":"Extending the rank likelihood for semiparametric copula estimation","volume":"1","author":"Hoff","year":"2007","journal-title":"Ann. Appl. Stat"},{"key":"2023013112063533000_btz599-B18","author":"Hothorn","year":"2018"},{"key":"2023013112063533000_btz599-B19","doi-asserted-by":"crossref","DOI":"10.1002\/0471725250","volume-title":"Robust Statistics","author":"Huber","year":"1981"},{"key":"2023013112063533000_btz599-B20","doi-asserted-by":"crossref","first-page":"193","DOI":"10.1007\/BF01908075","article-title":"Comparing partitions","volume":"2","author":"Hubert","year":"1985","journal-title":"J. Class"},{"key":"2023013112063533000_btz599-B21","author":"James","year":"2017"},{"key":"2023013112063533000_btz599-B22","doi-asserted-by":"crossref","DOI":"10.1201\/b17116","volume-title":"Dependence Modeling with Copulas","author":"Joe","year":"2014"},{"key":"2023013112063533000_btz599-B23","doi-asserted-by":"crossref","first-page":"1025","DOI":"10.1198\/016214507000000590","article-title":"Variable selection in finite mixture of regression models","volume":"102","author":"Khalili","year":"2007","journal-title":"J. Am. Stat. Assoc"},{"key":"2023013112063533000_btz599-B24","doi-asserted-by":"crossref","first-page":"1079","DOI":"10.1007\/s11222-015-9590-5","article-title":"Model-based clustering using copulas with applications","volume":"26","author":"Kosmidis","year":"2016","journal-title":"Stat. Comput"},{"key":"2023013112063533000_btz599-B25","doi-asserted-by":"crossref","first-page":"1752","DOI":"10.1214\/11-AOAS466","article-title":"Measuring reproducibility of high-throughput experiments","volume":"5","author":"Li","year":"2011","journal-title":"Ann. Appl. Stat"},{"key":"2023013112063533000_btz599-B26","doi-asserted-by":"crossref","first-page":"1536","DOI":"10.1093\/bioinformatics\/bty858","article-title":"Multimodal network diffusion predicts future disease\u2013gene\u2013chemical associations","volume":"35","author":"Lin","year":"2019","journal-title":"Bioinformatics"},{"key":"2023013112063533000_btz599-B27","doi-asserted-by":"crossref","first-page":"1049","DOI":"10.1007\/s11222-016-9670-1","article-title":"Variable selection for model-based clustering using the integrated complete-data likelihood","volume":"27","author":"Marbac","year":"2017","journal-title":"Stat. Comput"},{"key":"2023013112063533000_btz599-B28","doi-asserted-by":"crossref","first-page":"379","DOI":"10.1016\/S0167-9473(02)00183-4","article-title":"Modelling high-dimensional data by mixtures of factor analyzers","volume":"41","author":"McLachlan","year":"2003","journal-title":"Comput. Stat. Data Anal"},{"key":"2023013112063533000_btz599-B29","doi-asserted-by":"crossref","first-page":"285","DOI":"10.1007\/s11222-008-9056-0","article-title":"Parsimonious Gaussian mixture models","volume":"18","author":"McNicholas","year":"2008","journal-title":"Stat. Comput"},{"key":"2023013112063533000_btz599-B30","doi-asserted-by":"crossref","first-page":"711","DOI":"10.1016\/j.csda.2009.02.011","article-title":"Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models","volume":"54","author":"McNicholas","year":"2010","journal-title":"Comput. Stat. Data Anal"},{"key":"2023013112063533000_btz599-B31","author":"McNicholas","year":"2018"},{"key":"2023013112063533000_btz599-B32","doi-asserted-by":"crossref","first-page":"736","DOI":"10.1007\/s10618-013-0317-y","article-title":"Subspace clustering of high-dimensional data: a predictive approach","volume":"28","author":"McWilliams","year":"2014","journal-title":"Data Min. Knowl. Disc"},{"key":"2023013112063533000_btz599-B33","doi-asserted-by":"crossref","first-page":"511","DOI":"10.1111\/1467-9868.00082","article-title":"The EM algorithm\u2014an old folk-song sung to a fast new tune","volume":"59","author":"Meng","year":"1997","journal-title":"J. R. Stat. Soc. B"},{"key":"2023013112063533000_btz599-B34","doi-asserted-by":"crossref","first-page":"489","DOI":"10.1056\/NEJMp1114866","article-title":"Preparing for precision medicine","volume":"366","author":"Mirnezami","year":"2012","journal-title":"N. Engl. J. Med"},{"key":"2023013112063533000_btz599-B35","doi-asserted-by":"crossref","first-page":"334.","DOI":"10.1080\/10618600.2017.1366911","article-title":"Representing sparse Gaussian DAGs as sparse R-vines allowing for non-Gaussian dependence","volume":"27","author":"M\u00fcller","year":"2018","journal-title":"J. Comput. Graph. Stat"},{"key":"2023013112063533000_btz599-B36","first-page":"1145","article-title":"Penalized model-based clustering with application to variable selection","volume":"8","author":"Pan","year":"2007","journal-title":"J. Mach. Learn. Res"},{"key":"2023013112063533000_btz599-B37","doi-asserted-by":"crossref","first-page":"767","DOI":"10.1007\/978-3-540-71297-8_34","volume-title":"Handbook of Financial Time Series","author":"Patton","year":"2009"},{"key":"2023013112063533000_btz599-B38","author":"Rajan","year":"2016"},{"key":"2023013112063533000_btz599-B39","volume-title":"R: A Language and Environment for Statistical Computing","year":"2018"},{"key":"2023013112063533000_btz599-B40","author":"Rey","year":"2012"},{"key":"2023013112063533000_btz599-B41","doi-asserted-by":"crossref","first-page":"70","DOI":"10.1109\/MIS.2015.60","article-title":"Subtyping: what it is and its role in precision medicine","volume":"30","author":"Saria","year":"2015","journal-title":"IEEE Intell. Syst"},{"key":"2023013112063533000_btz599-B42","first-page":"229","article-title":"Fonctions de rpartition n dimensions et leurs marges","volume":"8","author":"Sklar","year":"1959","journal-title":"Publ. Inst. Statist. Univ. Paris"},{"key":"2023013112063533000_btz599-B43","doi-asserted-by":"crossref","first-page":"2890","DOI":"10.1093\/bioinformatics\/btx322","article-title":"Molecular heterogeneity at the network level: high-dimensional testing, clustering and a TCGA case study","volume":"33","author":"St\u00e4dler","year":"2017","journal-title":"Bioinformatics"},{"key":"2023013112063533000_btz599-B44","doi-asserted-by":"crossref","first-page":"1331","DOI":"10.1007\/s10994-016-5624-2","article-title":"Vine copulas for mixed data: multi-view clustering for mixed data beyond meta-Gaussian dependencies","volume":"106","author":"Tekumalla","year":"2017","journal-title":"Mach. Learn"},{"key":"2023013112063533000_btz599-B45","author":"Tewari","year":"2011"},{"key":"2023013112063533000_btz599-B46","doi-asserted-by":"crossref","first-page":"2405","DOI":"10.1093\/bioinformatics\/btl406","article-title":"Evaluation and comparison of gene clustering methods in microarray analysis","volume":"22","author":"Thalamuthu","year":"2006","journal-title":"Bioinformatics"},{"key":"2023013112063533000_btz599-B47","first-page":"2837","article-title":"Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance","volume":"11","author":"Vinh","year":"2010","journal-title":"J. Mach. Learn. Res"},{"key":"2023013112063533000_btz599-B48","doi-asserted-by":"crossref","first-page":"1113.","DOI":"10.1038\/ng.2764","article-title":"The cancer genome atlas pan-cancer analysis project","volume":"45","author":"Weinstein","year":"2013","journal-title":"Nat. Genet"},{"key":"2023013112063533000_btz599-B49","first-page":"1.0","article-title":"MPM: multivariate Projection Methods","author":"Wouters","year":"2011","journal-title":"R Package Version"},{"key":"2023013112063533000_btz599-B50","doi-asserted-by":"crossref","first-page":"501","DOI":"10.1093\/bioinformatics\/btp707","article-title":"Penalized mixtures of factor analyzers with application to clustering high-dimensional microarray data","volume":"26","author":"Xie","year":"2010","journal-title":"Bioinformatics"},{"key":"2023013112063533000_btz599-B51","doi-asserted-by":"crossref","first-page":"81.","DOI":"10.1016\/j.ijmedinf.2018.03.003","article-title":"SCADI: a standard dataset for self-care problems classification of children with physical and motor disability","volume":"114","author":"Zarchi","year":"2018","journal-title":"Int. J. Med. Inform"}],"container-title":["Bioinformatics"],"original-title":[],"language":"en","link":[{"URL":"http:\/\/academic.oup.com\/bioinformatics\/advance-article-pdf\/doi\/10.1093\/bioinformatics\/btz599\/29217452\/btz599.pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/36\/2\/621\/48990951\/btz599.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article-pdf\/36\/2\/621\/48990951\/btz599.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,1,31]],"date-time":"2023-01-31T21:23:14Z","timestamp":1675200194000},"score":1,"resource":{"primary":{"URL":"https:\/\/academic.oup.com\/bioinformatics\/article\/36\/2\/621\/5542387"}},"subtitle":[],"editor":[{"given":"Jonathan","family":"Wren","sequence":"additional","affiliation":[]}],"short-title":[],"issued":{"date-parts":[[2019,8,1]]},"references-count":51,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2020,1,15]]}},"URL":"https:\/\/doi.org\/10.1093\/bioinformatics\/btz599","relation":{},"ISSN":["1367-4803","1367-4811"],"issn-type":[{"value":"1367-4803","type":"print"},{"value":"1367-4811","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2020,1,15]]},"published":{"date-parts":[[2019,8,1]]}}}