{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,8]],"date-time":"2026-04-08T08:59:50Z","timestamp":1775638790919,"version":"3.50.1"},"reference-count":52,"publisher":"Association for Computing Machinery (ACM)","issue":"4","license":[{"start":{"date-parts":[[2024,9,30]],"date-time":"2024-09-30T00:00:00Z","timestamp":1727654400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.acm.org\/publications\/policies\/copyright_policy#Background"}],"funder":[{"DOI":"10.13039\/501100006374","name":"NSF","doi-asserted-by":"publisher","award":["IIS-1815866,IIS-1910880,CSSI-2103832,CNS-1852498,NRT-HDR-2021871,DBI-2327954"],"award-info":[{"award-number":["IIS-1815866,IIS-1910880,CSSI-2103832,CNS-1852498,NRT-HDR-2021871,DBI-2327954"]}],"id":[{"id":"10.13039\/501100006374","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100006374","name":"U.S. Department of Education","doi-asserted-by":"publisher","award":["P200A180088"],"award-info":[{"award-number":["P200A180088"]}],"id":[{"id":"10.13039\/501100006374","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Proc. ACM Manag. Data"],"published-print":{"date-parts":[[2024,10,1]]},"abstract":"<jats:p>Log anomaly detection, critical in identifying system failures and preempting security breaches, finds irregular patterns within large volumes of log data. Modern log anomaly detectors rely on training deep learning models on clean anomaly-free log data. However, such clean log data requires expensive and tedious human labeling. In this paper, we thus propose a robust log anomaly detection framework, PlutoNOSPACE, that automatically selects a clean representative sample subset of the polluted log sequence data to train a Transformer-based anomaly detection model. Pluto features three innovations. First, due to localized concentrations of anomalies inherent in the embedding space of log data, Pluto partitions the sequence embedding space generated by the model into regions that then allow it to identify and discard regions that are highly polluted by our pollution level estimation scheme, based on our pollution quantification via Gaussian mixture modeling. Second, for the remaining more slightly polluted regions, we select samples that maximally purify the eigenvector spectrum, which can be transformed into the NP-hard facility location problem; allowing us to leverage its greedy solution with a (1-(1\/e)) approximation guarantee in optimality. Third, by iteratively alternating between the above subset selection, a model re-training on the latest subset, and a subset filtering using dynamic training artifacts generated by the latest model, the data selected is progressively refined. The final sample set is used to retrain the final anomaly detection model. Our experiments on four real-world log benchmark datasets demonstrate that by retaining 77.7% (BGL) to 96.6% (ThunderBird) of the normal sequences while effectively removing 90.3% (BGL) to 100.0% (ThunderBird, HDFS) of the anomalies, Pluto provides a significant absolute F-1 improvement up to 68.86% (2.16% \u2192 71.02%) compared to the state-of-the-art sample selection methods. The implementation of this work is available at https:\/\/github.com\/LeiMa0324\/Pluto-SIGMOD25.<\/jats:p>","DOI":"10.1145\/3677139","type":"journal-article","created":{"date-parts":[[2024,9,30]],"date-time":"2024-09-30T17:41:44Z","timestamp":1727718104000},"page":"1-25","source":"Crossref","is-referenced-by-count":2,"title":["Pluto: Sample Selection for Robust Anomaly Detection on Polluted Log Data"],"prefix":"10.1145","volume":"2","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-9252-2492","authenticated-orcid":false,"given":"Lei","family":"Ma","sequence":"first","affiliation":[{"name":"Worcester Polytechnic Institute, Worcester, MA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9909-8607","authenticated-orcid":false,"given":"Lei","family":"Cao","sequence":"additional","affiliation":[{"name":"University of Arizona, Tucson, AZ, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0285-6019","authenticated-orcid":false,"given":"Peter M.","family":"VanNostrand","sequence":"additional","affiliation":[{"name":"Worcester Polytechnic Institute, Worcester, MA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8102-3081","authenticated-orcid":false,"given":"Dennis M.","family":"Hofmann","sequence":"additional","affiliation":[{"name":"Worcester Polytechnic Institute, Worcester, MA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9817-660X","authenticated-orcid":false,"given":"Yao","family":"Su","sequence":"additional","affiliation":[{"name":"Worcester Polytechnic Institute, Worcester, MA, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-5375-9254","authenticated-orcid":false,"given":"Elke A.","family":"Rundensteiner","sequence":"additional","affiliation":[{"name":"Worcester Polytechnic Institute, Worcester, MA, USA"}]}],"member":"320","published-online":{"date-parts":[[2024,9,30]]},"reference":[{"key":"e_1_2_1_1_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICMIRA.2013.45"},{"key":"e_1_2_1_2_1","volume-title":"International Conference on Machine Learning. PMLR, 233--242","author":"Arpit Devansh","year":"2017","unstructured":"Devansh Arpit, Stanislaw Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. 2017. A closer look at memorization in deep networks. In International Conference on Machine Learning. PMLR, 233--242."},{"key":"e_1_2_1_3_1","doi-asserted-by":"publisher","DOI":"10.1145\/342009.335388"},{"key":"e_1_2_1_4_1","doi-asserted-by":"publisher","DOI":"10.1145\/1541880.1541882"},{"key":"e_1_2_1_5_1","volume-title":"Anomaly detection for discrete sequences: A survey","author":"Chandola Varun","year":"2010","unstructured":"Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2010. Anomaly detection for discrete sequences: A survey. IEEE transactions on knowledge and data engineering 24, 5 (2010), 823--839."},{"key":"e_1_2_1_6_1","volume-title":"International Conference on Machine Learning. PMLR, 1062--1070","author":"Chen Pengfei","year":"2019","unstructured":"Pengfei Chen, Ben Ben Liao, Guangyong Chen, and Shengyu Zhang. 2019. Understanding and utilizing deep neural networks trained with noisy labels. In International Conference on Machine Learning. PMLR, 1062--1070."},{"key":"e_1_2_1_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDM.2016.0103"},{"key":"e_1_2_1_8_1","doi-asserted-by":"publisher","DOI":"10.1145\/3133956.3134015"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1111\/j.1469-1809.1936.tb02137.x"},{"key":"e_1_2_1_10_1","doi-asserted-by":"publisher","DOI":"10.56021\/9781421407944"},{"key":"e_1_2_1_11_1","doi-asserted-by":"publisher","DOI":"10.1109\/IJCNN52387.2021.9534113"},{"key":"e_1_2_1_12_1","volume-title":"Co-teaching: Robust training of deep neural networks with extremely noisy labels. Advances in neural information processing systems 31","author":"Han Bo","year":"2018","unstructured":"Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. 2018. Co-teaching: Robust training of deep neural networks with extremely noisy labels. Advances in neural information processing systems 31 (2018)."},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.52202\/068431-2329"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICWS.2017.13"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1109\/ISSRE.2016.21"},{"key":"e_1_2_1_16_1","doi-asserted-by":"publisher","DOI":"10.1145\/3588918"},{"key":"e_1_2_1_17_1","unstructured":"Lu Jiang Zhengyuan Zhou Thomas Leung Li-Jia Li and Li Fei-Fei. 2018. MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels. arXiv:1712.05055 [cs.CV]"},{"key":"e_1_2_1_18_1","first-page":"24137","article-title":"Fine samples for learning with noisy labels","volume":"34","author":"Kim Taehyeon","year":"2021","unstructured":"Taehyeon Kim, Jongwoo Ko, JinHwan Choi, Se-Young Yun, et al. 2021. Fine samples for learning with noisy labels. Advances in Neural Information Processing Systems 34 (2021), 24137--24149.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.mlwa.2023.100470"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3510003.3510155"},{"key":"e_1_2_1_21_1","volume-title":"International conference on machine learning. PMLR, 3763--3772","author":"Lee Kimin","year":"2019","unstructured":"Kimin Lee, Sukmin Yun, Kibok Lee, Honglak Lee, Bo Li, and Jinwoo Shin. 2019. Robust inference via generative classifiers for handling noisy labels. In International conference on machine learning. PMLR, 3763--3772."},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1145\/3514221.3517861"},{"key":"e_1_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDM.2007.46"},{"key":"e_1_2_1_24_1","doi-asserted-by":"publisher","DOI":"10.1145\/2889160.2889232"},{"key":"e_1_2_1_25_1","volume-title":"Kai Ming Ting, and Zhi-Hua Zhou","author":"Liu Fei Tony","year":"2008","unstructured":"Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest. In 2008 eighth ieee international conference on data mining. IEEE, 413--422."},{"key":"e_1_2_1_26_1","doi-asserted-by":"publisher","DOI":"10.1214\/20-AOS2044"},{"key":"e_1_2_1_27_1","doi-asserted-by":"publisher","DOI":"10.1137\/0725041"},{"key":"e_1_2_1_28_1","doi-asserted-by":"publisher","DOI":"10.1007\/BFb0006528"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.5555\/2886521.2886572"},{"key":"e_1_2_1_30_1","first-page":"11465","article-title":"Coresets for robust training of deep neural networks against noisy labels","volume":"33","author":"Mirzasoleiman Baharan","year":"2020","unstructured":"Baharan Mirzasoleiman, Kaidi Cao, and Jure Leskovec. 2020. Coresets for robust training of deep neural networks against noisy labels. Advances in Neural Information Processing Systems 33 (2020), 11465--11477.","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_1_31_1","volume-title":"An analysis of approximations for maximizing submodular set functions?I. Mathematical programming 14","author":"Nemhauser George L","year":"1978","unstructured":"George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. 1978. An analysis of approximations for maximizing submodular set functions?I. Mathematical programming 14 (1978), 265--294."},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/DSN.2007.103"},{"key":"e_1_2_1_33_1","volume-title":"Generalization guarantees for neural networks via harnessing the low-rank structure of the jacobian. arXiv preprint arXiv:1906.05392","author":"Oymak Samet","year":"2019","unstructured":"Samet Oymak, Zalan Fabian, Mingchen Li, and Mahdi Soltanolkotabi. 2019. Generalization guarantees for neural networks via harnessing the low-rank structure of the jacobian. arXiv preprint arXiv:1906.05392 (2019)."},{"key":"e_1_2_1_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/JPROC.2021.3052449"},{"key":"e_1_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDCSW.2011.20"},{"key":"e_1_2_1_36_1","volume-title":"Estimating the support of a high-dimensional distribution. Neural computation 13, 7","author":"Sch\u00f6lkopf Bernhard","year":"2001","unstructured":"Bernhard Sch\u00f6lkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C Williamson. 2001. Estimating the support of a high-dimensional distribution. Neural computation 13, 7 (2001), 1443--1471."},{"key":"e_1_2_1_37_1","volume-title":"International Conference on Machine Learning. PMLR, 5739--5748","author":"Shen Yanyao","year":"2019","unstructured":"Yanyao Shen and Sujay Sanghavi. 2019. Learning with bad training data via iterative trimmed loss minimization. In International Conference on Machine Learning. PMLR, 5739--5748."},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.1145\/1390156.1390274"},{"key":"e_1_2_1_39_1","volume-title":"How does early stopping help generalization against label noise? arXiv preprint arXiv:1911.08059","author":"Song Hwanjun","year":"2019","unstructured":"Hwanjun Song, Minseok Kim, Dongmin Park, and Jae-Gil Lee. 2019. How does early stopping help generalization against label noise? arXiv preprint arXiv:1911.08059 (2019)."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.1109\/TNNLS.2022.3152527"},{"key":"e_1_2_1_41_1","first-page":"2579","article-title":"Visualizing Data using t-SNE","volume":"9","author":"van der Maaten Laurens","year":"2008","unstructured":"Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing Data using t-SNE. Journal of Machine Learning Research 9, 86 (2008), 2579--2605. http:\/\/jmlr.org\/papers\/v9\/vandermaaten08a.html","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/3447548.3467125"},{"key":"e_1_2_1_43_1","volume-title":"A topological filter for learning with label noise. Advances in neural information processing systems 33","author":"Wu Pengxiang","year":"2020","unstructured":"Pengxiang Wu, Songzhu Zheng, Mayank Goswami, Dimitris Metaxas, and Chao Chen. 2020. A topological filter for learning with label noise. Advances in neural information processing systems 33 (2020), 21382--21393."},{"key":"e_1_2_1_44_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00013"},{"key":"e_1_2_1_45_1","doi-asserted-by":"publisher","DOI":"10.1145\/1629575.1629587"},{"key":"e_1_2_1_46_1","doi-asserted-by":"publisher","DOI":"10.1145\/1629575.1629587"},{"key":"e_1_2_1_47_1","volume-title":"International Conference on Machine Learning. PMLR, 10789--10798","author":"Yao Quanming","year":"2020","unstructured":"Quanming Yao, Hansi Yang, Bo Han, Gang Niu, and James Tin-Yau Kwok. 2020. Searching to exploit memorization effect in learning with noisy labels. In International Conference on Machine Learning. PMLR, 10789--10798."},{"key":"e_1_2_1_48_1","volume-title":"Identifying Hard Noise in Long-Tailed Sample Distribution. In European Conference on Computer Vision.","author":"Yi Xuanyu","year":"2022","unstructured":"Xuanyu Yi, Kaihua Tang, Xiansheng Hua, Joo Hwee Lim, and Hanwang Zhang. 2022. Identifying Hard Noise in Long-Tailed Sample Distribution. In European Conference on Computer Vision."},{"key":"e_1_2_1_49_1","volume-title":"International Conference on Machine Learning. PMLR, 7164--7173","author":"Yu Xingrui","year":"2019","unstructured":"Xingrui Yu, Bo Han, Jiangchao Yao, Gang Niu, Ivor Tsang, and Masashi Sugiyama. 2019. How does disagreement help generalization against label corruption?. In International Conference on Machine Learning. PMLR, 7164--7173."},{"key":"e_1_2_1_50_1","volume-title":"International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=Sy8gdB9xx","author":"Zhang Chiyuan","year":"2017","unstructured":"Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2017. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=Sy8gdB9xx"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.1145\/3534678.3539155"},{"key":"e_1_2_1_52_1","volume-title":"International Conference on Learning Representations.","author":"Zhou Tianyi","year":"2021","unstructured":"Tianyi Zhou, Shengjie Wang, and Jeff Bilmes. 2021. Robust curriculum learning: from clean label detection to noisy label self-correction. In International Conference on Learning Representations."}],"container-title":["Proceedings of the ACM on Management of Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3677139","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3677139","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,31]],"date-time":"2026-03-31T17:11:03Z","timestamp":1774977063000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3677139"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,9,30]]},"references-count":52,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2024,10,1]]}},"alternative-id":["10.1145\/3677139"],"URL":"https:\/\/doi.org\/10.1145\/3677139","relation":{},"ISSN":["2836-6573"],"issn-type":[{"value":"2836-6573","type":"electronic"}],"subject":[],"published":{"date-parts":[[2024,9,30]]}}}