{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,11,28]],"date-time":"2025-11-28T04:34:37Z","timestamp":1764304477540,"version":"build-2065373602"},"reference-count":35,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2017,7,28]],"date-time":"2017-07-28T00:00:00Z","timestamp":1501200000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Informatics"],"abstract":"<jats:p>While big data is revolutionizing scientific research, the tasks of data management and analytics are becoming more challenging than ever. One way to remit the difficulty is to obtain the multilevel hierarchy embedded in the data. Knowing the hierarchy enables not only the revelation of the nature of the data, it is also often the first step in big data analytics. However, current algorithms for learning the hierarchy are typically not scalable to large volumes of data with high dimensionality. To tackle this challenge, in this paper, we propose a new scalable approach for constructing the tree structure from data. Our method builds the tree in a bottom-up manner, with adapted incremental k-means. By referencing the distribution of point distances, one can flexibly control the height of the tree and the branching of each node. Dimension reduction is also conducted as a pre-process, to further boost the computing efficiency. The algorithm takes a parallel design and is implemented with CUDA (Compute Unified Device Architecture), so that it can be efficiently applied to big data. We test the algorithm with two real-world datasets, and the results are visualized with extended circular dendrograms and other visualization techniques.<\/jats:p>","DOI":"10.3390\/informatics4030024","type":"journal-article","created":{"date-parts":[[2017,7,28]],"date-time":"2017-07-28T10:14:34Z","timestamp":1501236874000},"page":"24","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":8,"title":["Big Data Management with Incremental K-Means Trees\u2013GPU-Accelerated Construction and Visualization"],"prefix":"10.3390","volume":"4","author":[{"given":"Jun","family":"Wang","sequence":"first","affiliation":[{"name":"Visual Analytics and Imaging Lab, Computer Science Department, Stony Brook University, Stony Brook, NY 11794, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0674-0910","authenticated-orcid":false,"given":"Alla","family":"Zelenyuk","sequence":"additional","affiliation":[{"name":"Chemical and Material Sciences Division, Pacific Northwest National Laboratory, Richland, WA 99352, USA"}]},{"given":"Dan","family":"Imre","sequence":"additional","affiliation":[{"name":"Imre Consulting, Richland, WA 99352, USA"}]},{"given":"Klaus","family":"Mueller","sequence":"additional","affiliation":[{"name":"Visual Analytics and Imaging Lab, Computer Science Department, Stony Brook University, Stony Brook, NY 11794, USA"}]}],"member":"1968","published-online":{"date-parts":[[2017,7,28]]},"reference":[{"key":"ref_1","unstructured":"(2008). Big data. Nature, 455, 1\u2013136."},{"key":"ref_2","unstructured":"(2011). Dealing with data. Science, 331, 639\u2013806."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"2227","DOI":"10.1109\/TPAMI.2014.2321376","article-title":"Scalable nearest neighbor algorithms for high dimensional data","volume":"36","author":"Muja","year":"2014","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Dasgupta, S., and Hsu, D. (2008, January 5\u20139). Hierarchical sampling for active learning. Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland.","DOI":"10.1145\/1390156.1390183"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"267","DOI":"10.1109\/TETC.2014.2330519","article-title":"A survey of clustering algorithms for big data: Taxonomy and empirical analysis","volume":"2","author":"Fahad","year":"2014","journal-title":"IEEE Trans. Emerg. Top. Comput."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"63","DOI":"10.1109\/MCG.2012.87","article-title":"The top 10 challenges in extreme-scale visual analytics","volume":"32","author":"Wong","year":"2012","journal-title":"IEEE Comput. Graph. Appl."},{"key":"ref_7","first-page":"281","article-title":"Some methods for classification and analysis of multivariate observations","volume":"Volume 1","author":"Lucien","year":"1967","journal-title":"Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability"},{"key":"ref_8","unstructured":"Farivar, R., Rebolledo, D., and Chan, E. (2008, January 14\u201317). A parallel implementation of k-means clustering on GPUs. Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA 2008, Las Vegas, NV, USA."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"191","DOI":"10.1016\/0098-3004(84)90020-7","article-title":"FCM: The fuzzy c-means clustering algorithm","volume":"10","author":"Bezdek","year":"1984","journal-title":"Comput. Geosci."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Zhang, T., Ramakrishnan, R., and Livny, M. (1996). BIRCH: An Efficient Data Clustering Method for Very Large Databases, ACM.","DOI":"10.1145\/233269.233324"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"35","DOI":"10.1016\/S0306-4379(01)00008-4","article-title":"CURE: An efficient clustering algorithm for large databases","volume":"26","author":"Guha","year":"2001","journal-title":"Inf. Syst."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"68","DOI":"10.1109\/2.781637","article-title":"Chameleon: hierarchical clustering using dynamic modeling","volume":"32","author":"Karypis","year":"1999","journal-title":"Computer"},{"key":"ref_13","unstructured":"Ester, M., Kriegel, H.P., Sander, J., and Xu, X. (1996, January 2\u20134). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proceedings of the International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Ankerst, M., Breunig, M.M., Kriegel, H., and Sander, J. (1999). OPTICS: Ordering Points to Identify the Clustering Structure, ACM.","DOI":"10.1145\/304182.304187"},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P. (1998). Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, ACM.","DOI":"10.1145\/276304.276314"},{"key":"ref_16","first-page":"1","article-title":"STING: A statistical information grid approach to spatial data mining","volume":"97","author":"Wang","year":"1997","journal-title":"Int. Conf. Very Large Data"},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"297","DOI":"10.1007\/s003579900058","article-title":"MCLUST: Software for Model-Based Cluster Analysis","volume":"16","author":"Fraley","year":"1999","journal-title":"J. Classif."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1111\/j.2517-6161.1977.tb01600.x","article-title":"Maximum Likelihood from Incomplete Data via the EM Algorithm","volume":"39","author":"Dempster","year":"1977","journal-title":"J. R. Stat. Soc. Ser. B"},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"509","DOI":"10.1145\/361002.361007","article-title":"Multidimensional binary search trees used for associative searching","volume":"18","author":"Bentley","year":"1975","journal-title":"Commun. ACM"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"209","DOI":"10.1145\/355744.355745","article-title":"An Algorithm for Finding Best Matches in Logarithmic Expected Time","volume":"3","author":"Freidman","year":"1977","journal-title":"ACM Trans. Math. Softw."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Silpa-Anan, C., and Hartley, R. (2008, January 23\u201328). Optimised KD-trees for fast image descriptor matching. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA.","DOI":"10.1109\/CVPR.2008.4587638"},{"key":"ref_22","unstructured":"Lamrous, S., and Taileb, M. (December, January 28). Divisive Hierarchical K-Means. Proceedings of the International Conference on Computational Inteligence for Modelling Control and Automation and International Conference on Intelligent Agents Web Technologies and International Commerce, Sydney, Australia."},{"key":"ref_23","unstructured":"Jose Antonio, M.H., Montero, J., Yanez, J., and Gomez, D. (2010, January 15\u201316). A divisive hierarchical k-means based algorithm for image segmentation. Proceedings of the IEEE International Conference on Intelligent Systems and Knowledge Engineering, Hangzhou, China."},{"key":"ref_24","unstructured":"Nister, D., and Stewenius, H. (2006, January 17\u201322). Scalable Recognition with a Vocabulary Tree. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New York, NY, USA."},{"key":"ref_25","unstructured":"Imrich, P., Mueller, K., Mugno, R., Imre, D., Zelenyuk, A., and Zhu, W. (2002, January 28\u201329). Interactive Poster: Visual data mining with the interactive dendrogram. Proceedings of the IEEE Information Visualization Symposium, Boston, MA, USA."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Satish, N., Harris, M., and Garland, M. (2009, January 23\u201329). Designing efficient sorting algorithms for manycore gpus. Proceedings of the IEEE International Parallel and Distributed Processing Symposium, Rome, Italy.","DOI":"10.1109\/IPDPS.2009.5161005"},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Papenhausen, E., Wang, B., Ha, S., Zelenyuk, A., Imre, D., and Mueller, K. (2013, January 6\u20139). GPU-accelerated incremental correlation clustering of large data with visual feedback. Proceedings of the IEEE International Conference on Big Data, Silicon Valley, CA, USA.","DOI":"10.1109\/BigData.2013.6691716"},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Herrero-lopez, S., Williams, J.R., and Sanchez, A. (2010, January 14). Parallel Multiclass Classification Using SVMs on GPUs. Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, Pittsburgh, PA, USA.","DOI":"10.1145\/1735688.1735692"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Liang, S., Liu, Y., Wang, C., and Jian, L. (2009, January 10\u201311). A CUDA-based parallel implementation of K-nearest neighbor algorithm. Proceedings of the International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery, Zhangjiajie, China.","DOI":"10.1109\/CYBERC.2009.5399145"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. (2014, January 3\u20137). Caffe: Convolutional architecture for fast feature embedding. Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, FL, USA.","DOI":"10.1145\/2647868.2654889"},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"39","DOI":"10.1109\/2945.981850","article-title":"A flexible approach for visual data mining","volume":"8","author":"Kreuseler","year":"2002","journal-title":"IEEE Trans. Vis. Comput. Graph."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"1693","DOI":"10.1109\/TVCG.2014.2346626","article-title":"Cupid: Cluster-Based Exploration of Geometry Generators with Parallel Coordinates and Radial Trees","volume":"20","author":"Beham","year":"2014","journal-title":"IEEE Trans. Vis. Comput. Graph."},{"key":"ref_33","first-page":"2579","article-title":"Visualizing high-dimensional data using t-sne","volume":"9","author":"Hinton","year":"2008","journal-title":"J. Mach. Learn. Res."},{"key":"ref_34","unstructured":"Harris, M. (2008). Optimizing Parallel Reduction in CUDA. NVIDIA CUDA SDK 2. 2008, NVIDIA Developer."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"411","DOI":"10.1080\/02786820802709243","article-title":"SPLAT II: An Aircraft Compatible, Ultra-Sensitive, High Precision Instrument for In-Situ Characterization of the Size and Composition of Fine and Ultrafine Particles","volume":"43","author":"Zelenyuk","year":"2009","journal-title":"Aerosol Sci. Technol."}],"container-title":["Informatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2227-9709\/4\/3\/24\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T18:44:23Z","timestamp":1760208263000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2227-9709\/4\/3\/24"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2017,7,28]]},"references-count":35,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2017,9]]}},"alternative-id":["informatics4030024"],"URL":"https:\/\/doi.org\/10.3390\/informatics4030024","relation":{},"ISSN":["2227-9709"],"issn-type":[{"type":"electronic","value":"2227-9709"}],"subject":[],"published":{"date-parts":[[2017,7,28]]}}}