{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,27]],"date-time":"2025-10-27T10:54:14Z","timestamp":1761562454367,"version":"build-2065373602"},"reference-count":37,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2018,12,20]],"date-time":"2018-12-20T00:00:00Z","timestamp":1545264000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["BDCC"],"abstract":"<jats:p>Cybersecurity ventures expect that cyber-attack damage costs will rise to $11.5 billion in 2019 and that a business will fall victim to a cyber-attack every 14 seconds. Notice here that the time frame for such an event is seconds. With petabytes of data generated each day, this is a challenging task for traditional intrusion detection systems (IDSs). Protecting sensitive information is a major concern for both businesses and governments. Therefore, the need for a real-time, large-scale and effective IDS is a must. In this work, we present a cloud-based, fault tolerant, scalable and distributed IDS that uses Apache Spark Structured Streaming and its Machine Learning library (MLlib) to detect intrusions in real-time. To demonstrate the efficacy and effectivity of this system, we implement the proposed system within Microsoft Azure Cloud, as it provides both processing power and storage capabilities. A decision tree algorithm is used to predict the nature of incoming data. For this task, the use of the MAWILab dataset as a data source will give better insights about the system capabilities against cyber-attacks. The experimental results showed a 99.95% accuracy and more than 55,175 events per second were processed by the proposed system on a small cluster.<\/jats:p>","DOI":"10.3390\/bdcc3010001","type":"journal-article","created":{"date-parts":[[2018,12,20]],"date-time":"2018-12-20T12:54:36Z","timestamp":1545310476000},"page":"1","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":30,"title":["Comparative Study between Big Data Analysis Techniques in Intrusion Detection"],"prefix":"10.3390","volume":"3","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7161-2897","authenticated-orcid":false,"given":"Mounir","family":"Hafsa","sequence":"first","affiliation":[{"name":"Higher Institute of Computer Science and Telecom (ISITCOM), University of Sousse, Hammam Sousse 4011, Tunisia"}]},{"given":"Farah","family":"Jemili","sequence":"additional","affiliation":[{"name":"MARS Research Lab LR17ES05, Higher Institute of Computer Science and Telecom (ISITCOM), University of Sousse, Hammam Sousse 4011, Tunisia"}]}],"member":"1968","published-online":{"date-parts":[[2018,12,20]]},"reference":[{"key":"ref_1","unstructured":"(2018, June 23). Top-5-Cybersecurity-Concerns-for-2018. Available online: https:\/\/www.csoonline.com\/article\/3241766\/cyber-attacks-espionage\/top-5-cybersecurity-concerns-for-2018.html."},{"key":"ref_2","unstructured":"(2018, August 09). Cisco Cybersecurity Reports. Available online: https:\/\/www.cisco.com\/c\/en\/us\/products\/security\/security-reports.html#~stickynav=2."},{"key":"ref_3","unstructured":"Myers, S., Musacchio, J., and Bao, N. (2010). Intrusion Detection Systems: A Feature and Capability Analysis, Baskin School of Engineering."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"964","DOI":"10.1016\/j.future.2016.11.031","article-title":"Secure integration of IoT and Cloud Computing","volume":"78","author":"Stergiou","year":"2018","journal-title":"FuTure Gener. Comput. Syst."},{"key":"ref_5","unstructured":"(2018, December 08). Apache Hadoop. Available online: www.apache.com\/hadoop."},{"key":"ref_6","unstructured":"(2018, December 08). Apache Spark. Available online: www.apache.com\/spark."},{"key":"ref_7","unstructured":"Ar, L., Levent, E., Vipin, K., Aysel, O., and Jaideep, S. (2003, January 27\u201331). A comparative study of anomaly detection schemes in network intrusion detection. Proceedings of the SIAM Conference on Applications of Dynamical. Systems, Snowbird, UT, USA."},{"key":"ref_8","first-page":"39","article-title":"Recognizing unexplained behavior in network traffic","volume":"55","author":"Massimiliano","year":"2013","journal-title":"Netw. Sci. Cybersecur."},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Manzoor, M.A., and Morgan, Y. (2016, January 13\u201315). Real-time Support Vector Machine based Network Intrusion Detection system using Apache Storm. Proceedings of the IEEE 7th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), Vancouver, BC, Canada.","DOI":"10.1109\/IEMCON.2016.7746264"},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1016\/j.procs.2018.01.091","article-title":"Performance evaluation of intrusion detection based on machine learning using Apache Spark","volume":"127","author":"Belouch","year":"2018","journal-title":"Procedia Comput. Sci."},{"key":"ref_11","doi-asserted-by":"crossref","unstructured":"Pallaprolu, S.C., Sankineni, R., Thevar, M., Karabatis, G., and Wang, J. (2017, January 25\u201330). Zero-Day Attack Identification in Streaming Data Using Semantics and Spark. Proceedings of the IEEE International Congress on Big Data (BigData Congress), Honolulu, HI, USA.","DOI":"10.1109\/BigDataCongress.2017.25"},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"824","DOI":"10.1016\/j.procs.2016.07.238","article-title":"A Framework for Fast and Efficient Cyber Security Network Intrusion Detection Using Apache Spark","volume":"93","author":"Gupta","year":"2016","journal-title":"Procedia Comput. Sci."},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Terzi, D.S., Terzi, R., and Sagiroglu, S. (2017, January 5\u20138). Big data analytics for network anomaly detection from netflow data. Proceedings of the International Conference on Computer Science and Engineering (UBMK), Antalya, Turkey.","DOI":"10.1109\/UBMK.2017.8093473"},{"key":"ref_14","unstructured":"(2018, June 02). Cisco Systems NetFlow Services Export Version 9. Available online: https:\/\/tools.ietf.org\/html\/rfc3954."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Casas, P., Soro, F., Vanerio, J., Settanni, G., and D\u2032Alconzo, A. (2017, January 25\u201327). Network security and anomaly detection with Big-DAMA, a big data analytics framework. Proceedings of the IEEE 6th International Conference on Cloud Networking (CloudNet), Prague, Czech Republic.","DOI":"10.1109\/CloudNet.2017.8071525"},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Callegari, C., Giordano, S., and Pagano, M. (2016, January 23\u201325). Statistical Network Anomaly Detection: An Experimental Study. Proceedings of the International Conference on Future Network Systems and Security, Paris, France.","DOI":"10.1007\/978-3-319-48021-3_2"},{"key":"ref_17","unstructured":"Fontugne, R., Borgnat, P., Abry, P., and Fukuda, K. (December, January 30). MAWILab: Combining diverse anomaly detectors for automated anomaly labeling and performance benchmarking. Proceedings of the International Conference on emerging Networking EXperiments and Technologies (CoNEXT), Philadelphia, PA, USA."},{"key":"ref_18","unstructured":"Fukuda Lab (2018, December 08). Documentation. Available online: http:\/\/www.fukuda-lab.org\/mawilab\/documentation.html."},{"key":"ref_19","unstructured":"Dataricks (2018, May 06). About Databricks. Available online: https:\/\/databricks.com\/spark\/about."},{"key":"ref_20","unstructured":"Zubair, N. (2016). Pro Spark Streaming the Zen of Real-Time Analytics Using Apache Spark, Apress."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Armbrust, M., Das, T., Torres, J., Yavuz, B., Zhu, S., Xin, R., Ghodsi, A., Stoica, I., and Zaharia, M. (2018, January 10\u201315). Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark. Proceedings of the International Conference on Management of Data, Houston, TX, USA.","DOI":"10.1145\/3183713.3190664"},{"key":"ref_22","unstructured":"(2018, May 06). Real-time Streaming ETL with Structured Streaming in Apache Spark 2.1. Available online: https:\/\/databricks.com\/blog\/2017\/01\/19\/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html."},{"key":"ref_23","unstructured":"(2018, May 07). Benchmarking Structured Streaming on Databricks Runtime against State-of-the-Art Streaming Systems. Available online: https:\/\/databricks.com\/blog\/2017\/10\/11\/benchmarking-structured-streaming-on-databricks-runtime-against-state-of-the-art-streaming-systems.html."},{"key":"ref_24","unstructured":"(2018, December 08). Yahoo Streaming Benchmarks. Available online: https:\/\/github.com\/yahoo\/streaming-benchmarks."},{"key":"ref_25","unstructured":"Apache (2018, June 07). Structured Streaming Programming Guide. Available online: https:\/\/spark.apache.org\/docs\/latest\/structured-streaming-programming-guide.html."},{"key":"ref_26","unstructured":"Microsoft (2018, May 05). Azure Regions. Available online: https:\/\/azure.microsoft.com\/en-us\/global-infrastructure\/regions\/."},{"key":"ref_27","unstructured":"(2018, May 05). What is Microsoft Azure and Why Use It?. Available online: https:\/\/www.sumologic.com\/resource\/white-paper\/what-is-microsoft-azure-and-why-use-it\/."},{"key":"ref_28","unstructured":"Microsoft (2018, December 08). Azure Storage Blobs. Available online: https:\/\/azure.microsoft.com\/en-us\/services\/storage\/blobs\/."},{"key":"ref_29","unstructured":"Microsoft (2018, December 08). Azure SDK for PYTHON. Available online: https:\/\/github.com\/Azure\/azure-sdk-for-python."},{"key":"ref_30","unstructured":"(2018, February 06). Apache Parquet vs. CSV Files\u2014DZone Database. Available online: https:\/\/dzone.com\/articles\/how-to-be-a-hero-with-powerful-parquet-google-and."},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Ullah, F., and Babar, M.A. (arXiv, 2018). Architectural Tactics for Big Data Cybersecurity Analytic Systems: A Review, arXiv.","DOI":"10.1016\/j.jss.2019.01.051"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"60","DOI":"10.1109\/MSP.2015.121","article-title":"Security Analytics: Essential Data Analytics Knowledge for Cybersecurity Professionals and Students","volume":"13","author":"Verma","year":"2015","journal-title":"IEEE Secur. Priv."},{"key":"ref_33","unstructured":"(2018, June 03). Mllib Evaluation Metrics. Available online: https:\/\/spark.apache.org\/docs\/2.1.0\/mllib-evaluation-metrics.html."},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Ivanov, T., and Taaffe, J. (2018, January 9\u201313). Exploratory Analysis of Spark Structured Streaming. Proceedings of the International Conference on Performance Engineering, Berlin, Germany.","DOI":"10.1145\/3185768.3186360"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Gaied, I., Jemili, F., and Korbaa, O. (2015, January 17\u201320). Intrusion detection based on Neuro-Fuzzy classification. Proceedings of the 2015 IEEE\/ACS 12th International Conference of Computer Systems and Applications (AICCSA), Marrakech, Morocco.","DOI":"10.1109\/AICCSA.2015.7507112"},{"key":"ref_36","doi-asserted-by":"crossref","unstructured":"Essid, M., and Jemili, F. (2016, January 9\u201312). Combining intrusion detection datasets using MapReduce. Proceedings of the 2016 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Budapest, Hungary.","DOI":"10.1109\/SMC.2016.7844977"},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Li, Z. (2013). A Neural Network Based Distributed Intrusion Detection Sysem on Cloud Platform, The University of Toledo.","DOI":"10.1109\/CCIS.2012.6664371"}],"container-title":["Big Data and Cognitive Computing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2504-2289\/3\/1\/1\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T15:35:05Z","timestamp":1760196905000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2504-2289\/3\/1\/1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2018,12,20]]},"references-count":37,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2019,3]]}},"alternative-id":["bdcc3010001"],"URL":"https:\/\/doi.org\/10.3390\/bdcc3010001","relation":{},"ISSN":["2504-2289"],"issn-type":[{"type":"electronic","value":"2504-2289"}],"subject":[],"published":{"date-parts":[[2018,12,20]]}}}