{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T00:56:51Z","timestamp":1760057811410,"version":"build-2065373602"},"reference-count":19,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2025,2,25]],"date-time":"2025-02-25T00:00:00Z","timestamp":1740441600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Agencia Estatal de Investigaci\u00f3n in Spain","award":["PID2020-113462RB-I00"],"award-info":[{"award-number":["PID2020-113462RB-I00"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Data"],"abstract":"<jats:p>Network traffic datasets are essential for the construction of traffic models, often using machine learning (ML) techniques. Among other applications, these models can be employed to solve complex optimization problems or to identify anomalous behaviors, i.e., behaviors that deviate from the established model. However, the performance of the ML model depends, among other factors, on the quality of the data used to train it. Benchmark datasets, with a profound impact on research findings, are often assumed to be of good quality by default. In this paper, we derive four variants of a benchmark dataset in network anomaly detection (UGR\u201916, a flow-based real-world traffic dataset designed for anomaly detection), and show that the choice among variants has a larger impact on model performance than the ML technique used to build the model. To analyze this phenomenon, we propose a methodology to investigate the causes of these differences and to assess the quality of the data labeling. Our results underline the importance of paying more attention to data quality assessment in network anomaly detection.<\/jats:p>","DOI":"10.3390\/data10030033","type":"journal-article","created":{"date-parts":[[2025,2,25]],"date-time":"2025-02-25T06:13:18Z","timestamp":1740463998000},"page":"33","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["Data Quality Tools to Enhance a Network Anomaly Detection Benchmark"],"prefix":"10.3390","volume":"10","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-9804-8122","authenticated-orcid":false,"given":"Jos\u00e9","family":"Camacho","sequence":"first","affiliation":[{"name":"Research Centre for Information and Communication Technologies (CITIC-UGR), University of Granada, 18014 Granada, Spain"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7159-8706","authenticated-orcid":false,"given":"Rafael A.","family":"Rodr\u00edguez-G\u00f3mez","sequence":"additional","affiliation":[{"name":"Research Centre for Information and Communication Technologies (CITIC-UGR), University of Granada, 18014 Granada, Spain"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2025,2,25]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","unstructured":"Kalmbach, P., Zerwas, J., Babarczi, P., Blenk, A., Kellerer, W., and Schmid, S. (2018, January 24). Empowering self-driving networks. Proceedings of the Afternoon Workshop on Self-Driving Networks, Budapest, Hungary.","DOI":"10.1145\/3229584.3229587"},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"1251","DOI":"10.1109\/COMST.2020.2964534","article-title":"Machine Learning for Resource Management in Cellular and IoT Networks: Potentials, Current Solutions, and Open Challenges","volume":"22","author":"Hussain","year":"2020","journal-title":"IEEE Commun. Surv. Tutor."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3472753","article-title":"A Survey on Data-driven Network Intrusion Detection","volume":"54","author":"Chou","year":"2021","journal-title":"ACM Comput. Surv."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"5371","DOI":"10.1109\/ACCESS.2020.3048319","article-title":"Tight Arms Race: Overview of Current Malware Threats and Trends in Their Detection","volume":"9","author":"Caviglione","year":"2021","journal-title":"IEEE Access"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"41","DOI":"10.1186\/s40537-020-00318-5","article-title":"Cybersecurity data science: An overview from machine learning perspective","volume":"7","author":"Sarker","year":"2020","journal-title":"J. Big Data"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Camacho P\u00e1ez, J., Wasielewska, K., Espinosa, P., and Fuentes Garc\u00eda, M. (2023, January 8\u201312). Quality In\/Quality Out: Data quality more relevant than model choice in anomaly detection with the UGR\u201916. Proceedings of the NOMS 2023-2023 IEEE\/IFIP Network Operations and Management Symposium, Miami, FL, USA.","DOI":"10.1109\/NOMS56928.2023.10154333"},{"key":"ref_7","doi-asserted-by":"crossref","unstructured":"Camacho, J., and Wasielewska, K. (2022, January 25\u201329). Dataset Quality Assessment in Autonomous Networks with Permutation Testing. Proceedings of the NOMS 2022\u20132022 IEEE\/IFIP Network Operations and Management Symposium, Budapest, Hungary.","DOI":"10.1109\/NOMS54207.2022.9789767"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"411","DOI":"10.1016\/j.cose.2017.11.004","article-title":"UGR\u201816: A new dataset for the evaluation of cyclostationarity-based network IDSs","volume":"73","author":"Camacho","year":"2018","journal-title":"Comput. Secur."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"217","DOI":"10.1145\/1090191.1080118","article-title":"Mining anomalies using traffic feature distributions","volume":"35","author":"Lakhina","year":"2005","journal-title":"ACM SIGCOMM Comput. Commun. Rev."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Camacho, J., Maci\u00e1-Fern\u00e1ndez, G., D\u00edaz-Verdejo, J., and Garc\u00eda-Teodoro, P. (May, January 27). Tackling the big data 4 vs for anomaly detection. Proceedings of the 2014 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Toronto, ON, Canada.","DOI":"10.1109\/INFCOMW.2014.6849282"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"101603","DOI":"10.1016\/j.cose.2019.101603","article-title":"Multivariate Big Data Analysis for intrusion detection: 5 steps from the haystack to the needle","volume":"87","author":"Camacho","year":"2019","journal-title":"Comput. Secur."},{"key":"ref_12","unstructured":"Fuentes Garc\u00eda, N.M. (2020). Multivariate Statistical Network Monitoring for Network Security Based on Principal Component Analysis. [Ph.D. Thesis, Universidad de Granada]."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"118","DOI":"10.1016\/j.cose.2016.02.008","article-title":"PCA-based Multivariate Statistical Network Monitoring for anomaly detection","volume":"59","author":"Camacho","year":"2016","journal-title":"Comput. Secur."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"1207","DOI":"10.1162\/089976600300015565","article-title":"New Support Vector Algorithms","volume":"12","author":"Smola","year":"2000","journal-title":"Neural Comput."},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"1443","DOI":"10.1162\/089976601750264965","article-title":"Estimating the Support of a High-Dimensional Distribution","volume":"13","author":"Platt","year":"2001","journal-title":"Neural Comput."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Alpcan, T., and Ba\u015far, T. (2010). Network Security: A Decision and Game-Theoretic Approach, Cambridge University Press.","DOI":"10.1017\/CBO9780511760778"},{"key":"ref_17","unstructured":"Collins, M., and Collins, M.S. (2014). Network Security Through Data Analysis: Building Situational Awareness, O\u2019Reilly Media, Inc."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"194","DOI":"10.1016\/j.chemolab.2017.12.008","article-title":"Evaluation of diagnosis methods in PCA-based Multivariate Statistical Process Control","volume":"172","author":"Camacho","year":"2018","journal-title":"Chemom. Intell. Lab. Syst."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"592","DOI":"10.1002\/cem.1405","article-title":"Observation-based missing data methods for exploratory data analysis to unveil the connection between observations and variables in latent subspace models","volume":"25","author":"Camacho","year":"2011","journal-title":"J. Chemom."}],"container-title":["Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2306-5729\/10\/3\/33\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T16:42:04Z","timestamp":1760028124000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2306-5729\/10\/3\/33"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,2,25]]},"references-count":19,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2025,3]]}},"alternative-id":["data10030033"],"URL":"https:\/\/doi.org\/10.3390\/data10030033","relation":{},"ISSN":["2306-5729"],"issn-type":[{"type":"electronic","value":"2306-5729"}],"subject":[],"published":{"date-parts":[[2025,2,25]]}}}