{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,1]],"date-time":"2026-05-01T03:09:45Z","timestamp":1777604985947,"version":"3.51.4"},"reference-count":40,"publisher":"MDPI AG","issue":"7","license":[{"start":{"date-parts":[[2020,7,17]],"date-time":"2020-07-17T00:00:00Z","timestamp":1594944000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"European Social Fund, through the Regional Operational Program Centro 2020","award":["UIDB\/05583\/2020"],"award-info":[{"award-number":["UIDB\/05583\/2020"]}]},{"name":"European Social Fund, through the Regional Operational Program Centro 2020","award":["UID\/CEC\/00326\/2020"],"award-info":[{"award-number":["UID\/CEC\/00326\/2020"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Electronics"],"abstract":"<jats:p>Computing and networking systems traditionally record their activity in log files, which have been used for multiple purposes, such as troubleshooting, accounting, post-incident analysis of security breaches, capacity planning and anomaly detection. In earlier systems those log files were processed manually by system administrators, or with the support of basic applications for filtering, compiling and pre-processing the logs for specific purposes. However, as the volume of these log files continues to grow (more logs per system, more systems per domain), it is becoming increasingly difficult to process those logs using traditional tools, especially for less straightforward purposes such as anomaly detection. On the other hand, as systems continue to become more complex, the potential of using large datasets built of logs from heterogeneous sources for detecting anomalies without prior domain knowledge becomes higher. Anomaly detection tools for such scenarios face two challenges. First, devising appropriate data analysis solutions for effectively detecting anomalies from large data sources, possibly without prior domain knowledge. Second, adopting data processing platforms able to cope with the large datasets and complex data analysis algorithms required for such purposes. In this paper we address those challenges by proposing an integrated scalable framework that aims at efficiently detecting anomalous events on large amounts of unlabeled data logs. Detection is supported by clustering and classification methods that take advantage of parallel computing environments. We validate our approach using the the well known NASA Hypertext Transfer Protocol (HTTP) logs datasets. Fourteen features were extracted in order to train a k-means model for separating anomalous and normal events in highly coherent clusters. A second model, making use of the XGBoost system implementing a gradient tree boosting algorithm, uses the previous binary clustered data for producing a set of simple interpretable rules. These rules represent the rationale for generalizing its application over a massive number of unseen events in a distributed computing environment. The classified anomaly events produced by our framework can be used, for instance, as candidates for further forensic and compliance auditing analysis in security management.<\/jats:p>","DOI":"10.3390\/electronics9071164","type":"journal-article","created":{"date-parts":[[2020,7,20]],"date-time":"2020-07-20T06:08:17Z","timestamp":1595225297000},"page":"1164","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":57,"title":["Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets"],"prefix":"10.3390","volume":"9","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7380-9511","authenticated-orcid":false,"given":"Jo\u00e3o","family":"Henriques","sequence":"first","affiliation":[{"name":"Department of Informatics Engineering, University of Coimbra, 3030-290 Coimbra, Portugal"},{"name":"Informatics Department, Polytechnic of Viseu, 3504-510 Viseu, Portugal"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7558-2330","authenticated-orcid":false,"given":"Filipe","family":"Caldeira","sequence":"additional","affiliation":[{"name":"Department of Informatics Engineering, University of Coimbra, 3030-290 Coimbra, Portugal"},{"name":"CISeD\u2014Research Centre in Digital Services, Polytechnic of Viseu, 3504-510 Viseu, Portugal"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9278-6503","authenticated-orcid":false,"given":"Tiago","family":"Cruz","sequence":"additional","affiliation":[{"name":"Department of Informatics Engineering, University of Coimbra, 3030-290 Coimbra, Portugal"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-5079-8327","authenticated-orcid":false,"given":"Paulo","family":"Sim\u00f5es","sequence":"additional","affiliation":[{"name":"Department of Informatics Engineering, University of Coimbra, 3030-290 Coimbra, Portugal"}]}],"member":"1968","published-online":{"date-parts":[[2020,7,17]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"72","DOI":"10.1016\/j.ijcip.2018.04.004","article-title":"Integrated protection of industrial control systems from cyber-attacks: The ATENA approach","volume":"21","author":"Adamsky","year":"2018","journal-title":"Int. J. Crit. Infrastruct. Prot."},{"key":"ref_2","unstructured":"Rosa, L., Proen\u00e7a, J., Henriques, J., Graveto, V., Cruz, T., Sim\u00f5es, P., Caldeira, F., and Monteiro, E. (2017, January 29\u201330). An Evolved Security Architecture for Distributed Industrial Automation and Control Systems. Proceedings of the 16th European Conference on Cyber Warfare and Security (ECCWS 2017), Dublin, Ireland."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1007\/s10115-007-0114-2","article-title":"Top 10 algorithms in data mining","volume":"14","author":"Wu","year":"2008","journal-title":"Knowl. Inf. Syst."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"1189","DOI":"10.1214\/aos\/1013203451","article-title":"Greedy function approximation: A gradient boosting machine","volume":"29","author":"Friedman","year":"2001","journal-title":"Ann. Stat."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Chen, T., and Guestrin, C. (2016, January 13\u201317). Xgboost: A scalable tree boosting system. Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA.","DOI":"10.1145\/2939672.2939785"},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Rocklin, M. (2015, January 6\u201312). Dask: Parallel computation with blocked algorithms and task scheduling. Proceedings of the 14th Python in Science Conference, Austin, TX, USA.","DOI":"10.25080\/Majora-7b98e3ed-013"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1080\/00401706.1969.10490657","article-title":"Procedures for detecting outlying observations in samples","volume":"11","author":"Grubbs","year":"1969","journal-title":"Technometrics"},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"570","DOI":"10.1093\/comjnl\/bxr026","article-title":"A survey of outlier detection methods in network anomaly identification","volume":"54","author":"Gogoi","year":"2011","journal-title":"Comput. J."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Zheng, Y., Zhang, H., and Yu, Y. (2015, January 3\u20136). Detecting collective anomalies from multiple spatio-temporal datasets across different domains. Proceedings of the 23rd SIGSPATIAL international conference on advances in geographic information systems, Seattle, WA, USA.","DOI":"10.1145\/2820783.2820813"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"597","DOI":"10.1142\/S0219622006002258","article-title":"10 challenging problems in data mining research","volume":"5","author":"Yang","year":"2006","journal-title":"Int. J. Inf. Technol. Decis. Making"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"178","DOI":"10.1016\/j.proeng.2011.08.036","article-title":"Anomaly detection based on enhanced DBScan algorithm","volume":"15","author":"Chen","year":"2011","journal-title":"Procedia Eng."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Syarif, I., Prugel-Bennett, A., and Wills, G. (2012, January 24\u201326). Unsupervised clustering approach for network anomaly detection. Proceedings of the International Conference on Networked Digital Technologies, Dubai, UAE.","DOI":"10.1007\/978-3-642-30507-8_13"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Hoglund, A.J., Hatonen, K., and Sorvari, A.S. (2000, January 27). A computer host-based user anomaly detection system using the self-organizing map. Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN 2000), Como, Italy.","DOI":"10.1109\/IJCNN.2000.861504"},{"key":"ref_14","unstructured":"Lichodzijewski, P., Zincir-Heywood, A.N., and Heywood, M.I. (2002, January 12\u201317). Host-based intrusion detection using self-organizing maps. Proceedings of the 2002 International Joint Conference on Neural Networks (IJCNN\u201902), Honolulu, HI, USA."},{"key":"ref_15","unstructured":"M\u00fcnz, G., Li, S., and Carle, G. (2017, June 15). Traffic Anomaly Detection Using K-Means Clustering. GI\/ITG Workshop MMBnet, 2007; pp. 13\u201314. Available online: https:\/\/pdfs.semanticscholar.org\/634e\/2f1a20755e7ab18e8e8094f48e140a32dacd.pdf."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Tian, L., and Jianwen, W. (2009, January 25\u201327). Research on network intrusion detection system based on improved k-means clustering algorithm. Proceedings of the 2009 International Forum on Computer Science-Technology and Applications, Chongqing, China.","DOI":"10.1109\/IFCSTA.2009.25"},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Eslamnezhad, M., and Varjani, A.Y. (2014, January 9\u201311). Intrusion detection based on MinMax K-means clustering. Proceedings of the 7\u2019th International Symposium on Telecommunications (IST\u20192014), Tehran, Iran.","DOI":"10.1109\/ISTEL.2014.7000814"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"29","DOI":"10.5121\/ijdkp.2014.4203","article-title":"A new clustering approach for anomaly intrusion detection","volume":"4","author":"Ranjan","year":"2014","journal-title":"Int. J. Data Min. Knowl. Manage. Process"},{"key":"ref_19","unstructured":"Makanju, A., Zincir-Heywood, A.N., and Milios, E.E. (2013, January 27\u201331). Investigating event log analysis with minimum apriori information. Proceedings of the 2013 IFIP\/IEEE International Symposium on Integrated Network Management (IM 2013), Ghent, Belgium."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"798","DOI":"10.3923\/itj.2011.798.806","article-title":"Filtering events using clustering in heterogeneous security logs","volume":"10","author":"Udzir","year":"2011","journal-title":"Inf. Technol. J."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"1117","DOI":"10.3906\/elk-1302-19","article-title":"An unsupervised heterogeneous log-based framework for anomaly detection","volume":"24","author":"Hajamydeen","year":"2016","journal-title":"Turkish J. Electr. Eng. Comput. Sci."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Varuna, S., and Natesan, P. (2015, January 26\u201328). An integration of k-means clustering and na\u00efve bayes classifier for Intrusion Detection. Proceedings of the 2015 3rd International Conference on Signal Processing, Communication and Networking, ICSCN 2015, Tamil Nadu, India.","DOI":"10.1109\/ICSCN.2015.7219835"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"13","DOI":"10.33736\/jita.45.2014","article-title":"K-means clustering and naive bayes classification for intrusion detection","volume":"4","author":"Muda","year":"2016","journal-title":"J. IT Asia"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"753","DOI":"10.1016\/j.asej.2013.01.003","article-title":"A hybrid network intrusion detection framework based on random forests and weighted k-means","volume":"4","author":"Elbasiony","year":"2013","journal-title":"Ain Shams Eng. J."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Sequeira, K., and Zaki, M. (2002, January 23\u201326). ADMIT: Anomaly-based data mining for intrusions. Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, AB, Canada.","DOI":"10.1145\/775047.775103"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"439","DOI":"10.1016\/S0167-4048(02)00514-X","article-title":"Use of k-nearest neighbor classifier for intrusion detection","volume":"21","author":"Liao","year":"2002","journal-title":"Comput. Secur."},{"key":"ref_27","doi-asserted-by":"crossref","unstructured":"Breunig, M.M., Kriegel, H.P., Ng, R.T., and Sander, J. (2000, January 16\u201318). LOF: Identifying density-based local outliers. Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, TX, USA.","DOI":"10.1145\/342009.335388"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"622","DOI":"10.14778\/2180912.2180915","article-title":"Scalable k-means++","volume":"5","author":"Bahmani","year":"2012","journal-title":"Proc. VLDB Endowment"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"596","DOI":"10.1007\/s00454-011-9340-1","article-title":"K-means requires exponentially many iterations even in the plane","volume":"45","author":"Vattani","year":"2011","journal-title":"Discrete Comput. Geom."},{"key":"ref_30","unstructured":"Arthur, D., and Vassilvitskii, S. (2006, January 5\u20137). How slow is the k-means method?. Proceedings of the twenty-second annual symposium on Computational geometry, Sedona, AZ, USA."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"123","DOI":"10.1007\/BF00058655","article-title":"Bagging predictors","volume":"24","author":"Breiman","year":"1996","journal-title":"Mach. Learn."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1023\/A:1010933404324","article-title":"Random forests","volume":"45","author":"Breiman","year":"2001","journal-title":"Mach. Learn."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"337","DOI":"10.1214\/aos\/1016218223","article-title":"Additive logistic regression: A statistical view of boosting (with discussion and a rejoinder by the authors)","volume":"28","author":"Friedman","year":"2000","journal-title":"Ann. Statist."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"942","DOI":"10.1109\/TPAMI.2013.159","article-title":"Learning nonlinear functions using regularized greedy forest","volume":"36","author":"Johnson","year":"2014","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"He, X., Pan, J., Jin, O., Xu, T., Liu, B., Xu, T., Shi, Y., Atallah, A., Herbrich, R., and Bowers, S. (2014, January 24). Practical lessons from predicting clicks on ads at facebook. Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, New York, NY, USA.","DOI":"10.1145\/2648584.2648589"},{"key":"ref_36","unstructured":"(2018, July 16). Kaggle. Available online: www.Kaggle.com."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"367","DOI":"10.1016\/S0167-9473(01)00065-2","article-title":"Stochastic gradient boosting","volume":"38","author":"Friedman","year":"2002","journal-title":"Comput. Stat. Data Anal."},{"key":"ref_38","unstructured":"(2020, February 15). H2O.ai. H2O Framework for Machine Learning. Available online: http:\/\/docs.h2o.ai\/h2o\/latest-stable\/h2o-docs\/index.html."},{"key":"ref_39","first-page":"1235","article-title":"MLlib: Machine Learning in Apache Spark","volume":"17","author":"Meng","year":"2016","journal-title":"J. Mach. Learn. Res."},{"key":"ref_40","unstructured":"NASA (2018, July 01). NASA HTTP, Available online: http:\/\/ita.ee.lbl.gov\/html\/contrib\/NASA-HTTP.html."}],"container-title":["Electronics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2079-9292\/9\/7\/1164\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T09:49:30Z","timestamp":1760176170000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2079-9292\/9\/7\/1164"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,7,17]]},"references-count":40,"journal-issue":{"issue":"7","published-online":{"date-parts":[[2020,7]]}},"alternative-id":["electronics9071164"],"URL":"https:\/\/doi.org\/10.3390\/electronics9071164","relation":{},"ISSN":["2079-9292"],"issn-type":[{"value":"2079-9292","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,7,17]]}}}