{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,27]],"date-time":"2026-02-27T15:20:59Z","timestamp":1772205659028,"version":"3.50.1"},"reference-count":40,"publisher":"Association for Computing Machinery (ACM)","issue":"2","license":[{"start":{"date-parts":[[2023,6,13]],"date-time":"2023-06-13T00:00:00Z","timestamp":1686614400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"name":"Singapore Ministry of Education (MOE) Academic Research Fund (AcRF) Tier 2","award":["MOE-T2EP20122-0010"],"award-info":[{"award-number":["MOE-T2EP20122-0010"]}]},{"name":"National Research Foundation, Singapore and Infocomm Media Development Authority under its Future Communications Research & Development Programme","award":["FCP-SUTD-RG-2022-006"],"award-info":[{"award-number":["FCP-SUTD-RG-2022-006"]}]},{"name":"Key R&D Program of Hubei","award":["2020BAA020"],"award-info":[{"award-number":["2020BAA020"]}]},{"DOI":"10.13039\/501100012166","name":"National Key R&D Program of China","doi-asserted-by":"crossref","award":["2020AAA0108501"],"award-info":[{"award-number":["2020AAA0108501"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2023,6,13]]},"abstract":"<jats:p>Data Stream Clustering (DSC) plays an important role in mining continuous and unlabeled data streams in real-world applications. Over the last decades, numerous DSC algorithms have been proposed with promising clustering accuracy and efficiency. Despite the significant differences among existing DSC algorithms, they are commonly built around four key design aspects: summarizing data structure, window model, outlier detection mechanism, and offline refinement strategy. However, there is a lack of empirical studies on these key design aspects in the same codebase using real-world workloads with distinct characteristics. As a result, it is difficult for researchers to improve upon the state-of-the-art. In this paper, we conduct such a study of DSC on its four key design aspects. We implemented state-of-the-art variants of all of these design choices in an open-sourced platform from scratch and evaluated them using both real-world and synthetic workloads. Our analysis identifies the fundamental issues and trade-offs of each design choice in terms of both accuracy and efficiency. We even find that combining flexible design choices led to the development of a new algorithm called Benne, which can be tuned to achieve either better accuracy or better efficiency compared to the state-of-the-art.<\/jats:p>","DOI":"10.1145\/3589307","type":"journal-article","created":{"date-parts":[[2023,6,20]],"date-time":"2023-06-20T20:26:45Z","timestamp":1687292805000},"page":"1-26","source":"Crossref","is-referenced-by-count":9,"title":["Data Stream Clustering: An In-depth Empirical Study"],"prefix":"10.1145","volume":"1","author":[{"ORCID":"https:\/\/orcid.org\/0009-0007-6483-9357","authenticated-orcid":false,"given":"Xin","family":"Wang","sequence":"first","affiliation":[{"name":"Ohio State University, Columbus, OH, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0005-9070-7433","authenticated-orcid":false,"given":"Zhengru","family":"Wang","sequence":"additional","affiliation":[{"name":"Nvidia, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0981-5567","authenticated-orcid":false,"given":"Zhenyu","family":"Wu","sequence":"additional","affiliation":[{"name":"University of Manchester, Manchester, United Kingdom"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-9927-6925","authenticated-orcid":false,"given":"Shuhao","family":"Zhang","sequence":"additional","affiliation":[{"name":"Singapore University of Technology and Design, Singapore, Singapore"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8451-8656","authenticated-orcid":false,"given":"Xuanhua","family":"Shi","sequence":"additional","affiliation":[{"name":"Huazhong University of Science and Technology, Wuhan, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7904-8821","authenticated-orcid":false,"given":"Li","family":"Lu","sequence":"additional","affiliation":[{"name":"Sichuan University, Chengdu, China"}]}],"member":"320","published-online":{"date-parts":[[2023,6,20]]},"reference":[{"key":"e_1_2_2_1_1","unstructured":"[n.d.]. Covertype. http:\/\/ archive.ics.uci.edu\/ ml\/ datasets\/ Covertype."},{"key":"e_1_2_2_2_1","unstructured":"[n.d.]. Sensor. https:\/\/ www.cse.fau.edu\/ xqzhu\/ stream.html."},{"key":"e_1_2_2_3_1","unstructured":"[n.d.]. Ticat https:\/\/ github.com\/ innerr\/ ticat."},{"key":"e_1_2_2_4_1","volume-title":"Ackermann and et al","author":"Marcel","year":"2012","unstructured":"Marcel R. Ackermann and et al. 2012. StreamKM: A Clustering Algorithm for Data Streams. ACM J. Exp. Algorithmics 17 (May 2012), 30."},{"key":"e_1_2_2_5_1","volume-title":"Proceedings of the 29th International Conference on Very Large Data Bases -","volume":"29","author":"Aggarwal Charu C.","unstructured":"Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu. 2003. A Framework for Clustering Evolving Data Streams. In Proceedings of the 29th International Conference on Very Large Data Bases - Volume 29 (Berlin, Germany) (VLDB '03). VLDB Endowment, 81--92."},{"key":"e_1_2_2_6_1","doi-asserted-by":"publisher","DOI":"10.1145\/507515.507519"},{"key":"e_1_2_2_7_1","doi-asserted-by":"publisher","DOI":"10.1145\/507515.507519"},{"key":"e_1_2_2_8_1","unstructured":"Alessio Bechini and et al. 2020. TSF-DBSCAN: a Novel Fuzzy Density-based Approach for Clustering Unbounded Data Streams. IEEE Transactions on Fuzzy Systems (2020)."},{"key":"e_1_2_2_9_1","doi-asserted-by":"publisher","DOI":"10.5555\/1756006.1859903"},{"key":"e_1_2_2_10_1","doi-asserted-by":"publisher","DOI":"10.5555\/3495724.3496455"},{"key":"e_1_2_2_11_1","doi-asserted-by":"publisher","DOI":"10.1145\/545151.545176"},{"key":"e_1_2_2_12_1","doi-asserted-by":"publisher","DOI":"10.1137\/1.9781611972764.29"},{"key":"e_1_2_2_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/3075564.3078887"},{"key":"e_1_2_2_14_1","doi-asserted-by":"publisher","DOI":"10.1145\/1281192.1281210"},{"key":"e_1_2_2_15_1","first-page":"1","article-title":"SAMOA: Scalable Advanced Massive Online Analysis","volume":"16","author":"Francisci Morales Gianmarco De","year":"2015","unstructured":"Gianmarco De Francisci Morales and Albert Bifet. 2015. SAMOA: Scalable Advanced Massive Online Analysis. J. Mach. Learn. Res. 16, 1 (Jan. 2015), 149--153.","journal-title":"J. Mach. Learn. Res."},{"key":"e_1_2_2_16_1","unstructured":"M. Deepa P. Revathy and P. G. Student. 2012. Validation of Document Clustering based on Purity and Entropy measures."},{"key":"e_1_2_2_17_1","volume-title":"A density-based algorithm for discovering clusters in large spatial databases with noise","author":"Ester Martin","unstructured":"Martin Ester, Hans-Peter Kriegel, J\u00f6rg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. AAAI Press, 226--231."},{"key":"e_1_2_2_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/360402.360419"},{"key":"e_1_2_2_19_1","doi-asserted-by":"publisher","DOI":"10.1145\/3164135.3164136"},{"key":"e_1_2_2_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2016.2522412"},{"key":"e_1_2_2_21_1","doi-asserted-by":"crossref","unstructured":"Michael Hahsler Matthew Bolanos and John Forrest. 2015. streamMOA: Interface for MOA Stream Clustering Algorithms. https:\/\/cran.r-project.org\/web\/packages\/streamMOA\/","DOI":"10.32614\/CRAN.package.streamMOA"},{"key":"e_1_2_2_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE.2016.7498264"},{"key":"e_1_2_2_23_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10115-010-0342-8"},{"key":"e_1_2_2_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/2020408.2020555"},{"key":"e_1_2_2_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIT.1982.1056489"},{"key":"e_1_2_2_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2018.2876857"},{"key":"e_1_2_2_27_1","unstructured":"J. Macqueen. 1967. Some methods for classification and analysis of multivariate observations. In In 5-th Berkeley Symposium on Mathematical Statistics and Probability. 281--297."},{"key":"e_1_2_2_28_1","doi-asserted-by":"publisher","DOI":"10.1002\/sam.11380"},{"key":"e_1_2_2_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/TKDE.2010.61"},{"key":"e_1_2_2_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDM.2010.160"},{"key":"e_1_2_2_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/1060745.1060753"},{"key":"e_1_2_2_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/SFCS.2001.959917"},{"key":"e_1_2_2_33_1","doi-asserted-by":"publisher","DOI":"10.4018\/IJAEIS.2020010104"},{"key":"e_1_2_2_34_1","doi-asserted-by":"publisher","DOI":"10.1145\/2522968.2522981"},{"key":"e_1_2_2_35_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10618-020-00698--5"},{"key":"e_1_2_2_36_1","volume-title":"IEEE Symposium on Computational Intelligence for Security and Defense Applications. 1--6.","author":"Tavallaee Mahbod","unstructured":"Mahbod Tavallaee, Ebrahim Bagheri, Wei Lu, and Ali A. Ghorbani. 2009. A detailed analysis of the KDD CUP 99 data set. In IEEE Symposium on Computational Intelligence for Security and Defense Applications. 1--6."},{"key":"e_1_2_2_37_1","doi-asserted-by":"publisher","DOI":"10.1145\/1552303.1552307"},{"key":"e_1_2_2_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/235968.233324"},{"key":"e_1_2_2_39_1","volume-title":"BIRCH: A new data clustering algorithm and its applications. Data mining and knowledge discovery 1, 2","author":"Zhang Tian","year":"1997","unstructured":"Tian Zhang, Raghu Ramakrishnan, and Miron Livny. 1997. BIRCH: A new data clustering algorithm and its applications. Data mining and knowledge discovery 1, 2 (1997), 141--182."},{"key":"e_1_2_2_40_1","doi-asserted-by":"publisher","DOI":"10.5555\/3225662.3225976"}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3589307","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3589307","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,6,17]],"date-time":"2025-06-17T16:46:13Z","timestamp":1750178773000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3589307"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,6,13]]},"references-count":40,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2023,6,13]]}},"alternative-id":["10.1145\/3589307"],"URL":"https:\/\/doi.org\/10.1145\/3589307","relation":{},"ISSN":["2836-6573"],"issn-type":[{"value":"2836-6573","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,6,13]]}}}