{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,4]],"date-time":"2026-02-04T19:06:33Z","timestamp":1770231993930,"version":"3.49.0"},"reference-count":35,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2020,9,17]],"date-time":"2020-09-17T00:00:00Z","timestamp":1600300800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2020,9,17]],"date-time":"2020-09-17T00:00:00Z","timestamp":1600300800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["J Big Data"],"published-print":{"date-parts":[[2020,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:sec>\n<jats:title>Introduction<\/jats:title>\n<jats:p>Nowadays large data volumes are daily generated at a high rate. Data from health system, social network, financial, government, marketing, bank transactions as well as the censors and smart devices are increasing. The tools and models have to be optimized. In this paper we applied and compared Machine Learning algorithms (Linear Regression, Na\u00efve bayes, Decision Tree) to predict diabetes. Further more, we performed analytics on flight delays. The main contribution of this paper is to give an overview of Big Data tools and machine learning models. We highlight some metrics that allow us to choose a more accurate model. We predict diabetes disease using three machine learning models and then compared their performance. Further more we analyzed flight delay and produced a dashboard which can help managers of flight companies to have a 360\u00b0 view of their flights and take strategic decisions.<\/jats:p>\n<\/jats:sec><jats:sec>\n<jats:title>Case description<\/jats:title>\n<jats:p>We applied three Machine Learning algorithms for predicting diabetes and we compared the performance to see what model give the best results. We performed analytics on flights datasets to help decision making and predict flight delays.<\/jats:p>\n<\/jats:sec><jats:sec>\n<jats:title>Discussion and evaluation<\/jats:title>\n<jats:p>The experiment shows that the Linear Regression, Naive Bayesian and Decision Tree give the same accuracy (0.766) but Decision Tree outperforms the two other models with the greatest score (1) and the smallest error (0). For the flight delays analytics, the model could show for example the airport that recorded the most flight delays.<\/jats:p>\n<\/jats:sec><jats:sec>\n<jats:title>Conclusions<\/jats:title>\n<jats:p>Several tools and machine learning models to deal with big data analytics have been discussed in this paper. We concluded that for the same datasets, we have to carefully choose the model to use in prediction. In our future works, we will test different models in other fields (climate, banking, insurance.).<\/jats:p>\n<\/jats:sec>","DOI":"10.1186\/s40537-020-00355-0","type":"journal-article","created":{"date-parts":[[2020,9,17]],"date-time":"2020-09-17T15:15:30Z","timestamp":1600355730000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":34,"title":["Using Big Data-machine learning models for diabetes prediction and flight delays analytics"],"prefix":"10.1186","volume":"7","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-0307-887X","authenticated-orcid":false,"given":"Th\u00e9rence","family":"Nibareke","sequence":"first","affiliation":[]},{"given":"Jalal","family":"Laassiri","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2020,9,17]]},"reference":[{"key":"355_CR1","doi-asserted-by":"publisher","first-page":"546","DOI":"10.1016\/j.future.2018.04.032","volume":"86","author":"W Inoubli","year":"2018","unstructured":"Inoubli W, Aridhi S, Mezni H, Maddouri M, Mephu Nguifo E. An experimental survey on big data frameworks. Future Gener Comput Syst. 2018;86:546\u201364.","journal-title":"Future Gener Comput Syst"},{"key":"355_CR2","doi-asserted-by":"publisher","first-page":"109","DOI":"10.1016\/j.procs.2018.08.243","volume":"136","author":"M Petrov","year":"2018","unstructured":"Petrov M, Butakov N, Nasonov D, Melnik M. Adaptive performance model for dynamic scaling Apache Spark Streaming. Procedia Comput Sci. 2018;136:109\u201317.","journal-title":"Procedia Comput Sci"},{"key":"355_CR3","doi-asserted-by":"publisher","first-page":"203","DOI":"10.1016\/j.procs.2016.06.043","volume":"89","author":"M Brahmwar","year":"2016","unstructured":"Brahmwar M, Kumar M, Sikka G. Tolhit\u2014a scheduling algorithm for Hadoop Cluster. Procedia Comput Sci. 2016;89:203\u20138.","journal-title":"Procedia Comput Sci"},{"key":"355_CR4","doi-asserted-by":"publisher","first-page":"183","DOI":"10.1016\/j.procs.2018.10.166","volume":"141","author":"S Al-Saqqa","year":"2018","unstructured":"Al-Saqqa S, Al-Naymat G, Awajan A. A large-scale sentiment data classification for online reviews under apache spark. Procedia Comput Sci. 2018;141:183\u20139.","journal-title":"Procedia Comput Sci"},{"key":"355_CR5","doi-asserted-by":"publisher","first-page":"244","DOI":"10.1016\/j.future.2017.12.004","volume":"82","author":"W Zheng","year":"2018","unstructured":"Zheng W, Qin Y, Bugingo E, Zhang D, Chen J. Cost optimization for deadline-aware scheduling of big-data processing jobs on clouds. Future Gener Comput Syst. 2018;82:244\u201355.","journal-title":"Future Gener Comput Syst"},{"key":"355_CR6","doi-asserted-by":"publisher","first-page":"91","DOI":"10.1016\/j.egyr.2017.11.002","volume":"4","author":"H Akhavan-Hejazi","year":"2018","unstructured":"Akhavan-Hejazi H, Mohsenian-Rad H. Power systems big data analytics: an assessment of paradigm shift barriers and prospects. Energy Rep. 2018;4:91\u2013100.","journal-title":"Energy Rep"},{"key":"355_CR7","doi-asserted-by":"publisher","first-page":"1890","DOI":"10.1016\/j.sbspro.2015.06.429","volume":"195","author":"C Uzunkaya","year":"2015","unstructured":"Uzunkaya C, Ensari T, Kavurucu Y. Hadoop ecosystem and its analysis on tweets. Procedia Soc Behav Sci. 2015;195:1890\u20137.","journal-title":"Procedia Soc Behav Sci"},{"key":"355_CR8","doi-asserted-by":"publisher","first-page":"423","DOI":"10.1016\/j.future.2018.07.043","volume":"90","author":"NS Naik","year":"2019","unstructured":"Naik NS, Negi A, Anitha R. A data locality based scheduler to enhance MapReduce performance in heterogeneous environments. Future Gener Comput Syst. 2019;90:423\u201334.","journal-title":"Future Gener Comput Syst"},{"key":"355_CR9","doi-asserted-by":"publisher","first-page":"596","DOI":"10.1016\/j.procs.2018.07.294","volume":"126","author":"OA Sarumi","year":"2018","unstructured":"Sarumi OA, Leung CK, Adetunmbi AO. Spark-based data analytics of sequence motifs in large omics data. Procedia Comput Sci. 2018;126:596\u2013605.","journal-title":"Procedia Comput Sci"},{"key":"355_CR10","doi-asserted-by":"publisher","first-page":"1076","DOI":"10.1016\/j.future.2017.07.003","volume":"86","author":"\u00c1B Hern\u00e1ndez","year":"2018","unstructured":"Hern\u00e1ndez \u00c1B, Perez MS, Gupta S, Munt\u00e9s-Mulero V. Using machine learning to optimize parallelism in big data applications. Future Gener Comput Syst. 2018;86:1076\u201392.","journal-title":"Future Gener Comput Syst"},{"key":"355_CR11","doi-asserted-by":"publisher","first-page":"413","DOI":"10.1016\/j.future.2018.05.084","volume":"88","author":"N Hidalgo","year":"2018","unstructured":"Hidalgo N, Rosas E, Vasquez C, Wladdimiro D. Measuring stream processing systems adaptability under dynamic workloads. Future Gener Comput Syst. 2018;88:413\u201323.","journal-title":"Future Gener Comput Syst"},{"key":"355_CR12","doi-asserted-by":"publisher","first-page":"392","DOI":"10.1016\/j.future.2018.12.002","volume":"95","author":"S Lu","year":"2019","unstructured":"Lu S, Wei X, Rao B, Tak B, Wang L, Wang L. LADRA: log-based abnormal task detection and root-cause analysis in big data processing with Spark. Future Gener Comput Syst. 2019;95:392\u2013403.","journal-title":"Future Gener Comput Syst"},{"key":"355_CR13","doi-asserted-by":"publisher","DOI":"10.1016\/j.jksuci.2018.09.022","author":"ANM JayaLakshmi","year":"2018","unstructured":"JayaLakshmi ANM, Krishna Kishore KV. Performance evaluation of DNN with other machine learning techniques in a cluster using Apache Spark and MLlib. J King Saud Univ Comput Inf Sci. 2018. https:\/\/doi.org\/10.1016\/j.jksuci.2018.09.022.","journal-title":"J King Saud Univ Comput Inf Sci"},{"issue":"3","key":"355_CR14","doi-asserted-by":"publisher","first-page":"161","DOI":"10.1016\/j.dcan.2017.10.002","volume":"4","author":"MS Mahdavinejad","year":"2018","unstructured":"Mahdavinejad MS, Rezvan M, Barekatain M, Adibi P, Barnaghi P, Sheth AP. Machine learning for internet of things data analysis: a survey. Digit Commun Netw. 2018;4(3):161\u201375.","journal-title":"Digit Commun Netw"},{"key":"355_CR15","doi-asserted-by":"publisher","first-page":"85","DOI":"10.1016\/j.jnca.2017.11.017","volume":"103","author":"V Rao Chandakanna","year":"2018","unstructured":"Rao Chandakanna V. REHDFS: a random read\/write enhanced HDFS. J Netw Comput Appl. 2018;103:85\u2013100.","journal-title":"J Netw Comput Appl"},{"key":"355_CR16","doi-asserted-by":"crossref","unstructured":"Landset S, Khoshgoftaar TM, Richter AN, Hasanin T. A survey of open source tools for machine learning with big data in the Hadoop ecosystem. J Big Data. 2(1). 2015. http:\/\/www.journalofbigdata.com\/content\/2\/1\/24.","DOI":"10.1186\/s40537-015-0032-1"},{"key":"355_CR17","doi-asserted-by":"publisher","first-page":"456","DOI":"10.1016\/j.procs.2015.04.015","volume":"50","author":"V Subramaniyaswamy","year":"2015","unstructured":"Subramaniyaswamy V, Vijayakumar V, Logesh R, Indragandhi V. Unstructured data analysis on big data using map reduce. Procedia Comput Sci. 2015;50:456\u201365.","journal-title":"Procedia Comput Sci"},{"key":"355_CR18","doi-asserted-by":"crossref","unstructured":"Raj P. The Hadoop ecosystem technologies and tools. In: Advances in computers, vol. 109. Elsevier; 2018. pp. 279\u2013320.","DOI":"10.1016\/bs.adcom.2017.09.002"},{"issue":"4","key":"355_CR19","doi-asserted-by":"publisher","first-page":"3767","DOI":"10.1016\/j.aej.2018.03.006","volume":"57","author":"S Mustafa","year":"2018","unstructured":"Mustafa S, Elghandour I, Ismail MA. A machine learning approach for predicting execution time of spark jobs. Alex Eng J. 2018;57(4):3767\u201378.","journal-title":"Alex Eng J"},{"key":"355_CR20","unstructured":"Chambers B, Zaharia M. Spark: The definitive guide; 2018. p. 600."},{"key":"355_CR21","doi-asserted-by":"publisher","first-page":"182","DOI":"10.1016\/j.inffus.2017.09.005","volume":"41","author":"F Carcillo","year":"2018","unstructured":"Carcillo F, Dal Pozzolo A, Le Borgne Y-A, Caelen O, Mazzer Y, Bontempi G. SCARFF: a scalable framework for streaming credit card fraud detection with spark. Inf Fusion. 2018;41:182\u201394.","journal-title":"Inf Fusion"},{"key":"355_CR22","unstructured":"McDonald C. Getting started with Apache Spark from inception to production; 2018. p. 174."},{"key":"355_CR23","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1016\/j.pmcj.2018.09.003","volume":"51","author":"E Garcia-Ceja","year":"2018","unstructured":"Garcia-Ceja E, Riegler M, Nordgreen T, Jakobsen P, Oedegaard KJ, T\u00f8rresen J. Mental health monitoring with multimodal sensing and machine learning: a survey. Pervasive Mob Comput. 2018;51:1\u201326.","journal-title":"Pervasive Mob Comput"},{"issue":"1","key":"355_CR24","doi-asserted-by":"publisher","first-page":"13","DOI":"10.1186\/s40537-019-0175-6","volume":"6","author":"N Sneha","year":"2019","unstructured":"Sneha N, Gangil T. Analysis of diabetes mellitus for early prediction using optimal features selection. J Big Data. 2019;6(1):13. https:\/\/doi.org\/10.1186\/s40537-019-0175-6.","journal-title":"J Big Data"},{"issue":"1","key":"355_CR25","doi-asserted-by":"publisher","first-page":"26","DOI":"10.1186\/s40537-017-0082-7","volume":"4","author":"N Jayanthi","year":"2017","unstructured":"Jayanthi N, Babu BV, Rao NS. Survey on clinical prediction models for diabetes prediction. J Big Data. 2017;4(1):26. https:\/\/doi.org\/10.1186\/s40537-017-0082-7.","journal-title":"J Big Data"},{"issue":"1","key":"355_CR26","doi-asserted-by":"publisher","first-page":"12","DOI":"10.1186\/s40294-016-0023-x","volume":"4","author":"K Farooq","year":"2016","unstructured":"Farooq K, Hussain A. A novel ontology and machine learning driven hybrid cardiovascular clinical prognosis as a complex adaptive clinical system. Complex Adapt Syst Model. 2016;4(1):12. https:\/\/doi.org\/10.1186\/s40294-016-0023-x.","journal-title":"Complex Adapt Syst Model"},{"key":"355_CR27","unstructured":"Sternberg A, Soares J, Carvalho D, et al. A review on flight delay prediction. 2017. arXiv preprint arXiv:1703.06118. https:\/\/arxiv.org\/abs\/1703.06118."},{"key":"355_CR28","doi-asserted-by":"crossref","unstructured":"Chen J, Li M. Chained predictions of flight delay using machine learning. In: AIAA Scitech 2019 Forum. 2019. p. 1661. https:\/\/www.researchgate.net\/publication\/330185077.","DOI":"10.2514\/6.2019-1661"},{"key":"355_CR29","doi-asserted-by":"publisher","DOI":"10.1186\/s40537-018-0136-5","author":"M Zettam","year":"2018","unstructured":"Zettam M, Laassiri J, Enneya N. A MapReduce-based Adjoint method for preventing brain disease. J Big Data. 2018. https:\/\/doi.org\/10.1186\/s40537-018-0136-5.","journal-title":"J Big Data"},{"key":"355_CR30","doi-asserted-by":"publisher","DOI":"10.1186\/s40537-019-0180-9","author":"IM Al-Zuabi","year":"2019","unstructured":"Al-Zuabi IM, Jafar A, Aljoumaa K. Predicting customer\u2019s gender and age depending on mobile phone data. J Big Data. 2019. https:\/\/doi.org\/10.1186\/s40537-019-0180-9.","journal-title":"J Big Data"},{"key":"355_CR31","doi-asserted-by":"publisher","DOI":"10.1186\/s40537-019-0169-4","author":"K Dahdouh","year":"2019","unstructured":"Dahdouh K, Dakkak A, Oughdir L, Ibriz A. Large-scale e-learning recommender system based on Spark and Hadoop. J Big Data. 2019. https:\/\/doi.org\/10.1186\/s40537-019-0169-4.","journal-title":"J Big Data"},{"issue":"1","key":"355_CR32","doi-asserted-by":"publisher","first-page":"104","DOI":"10.1186\/s40537-019-0271-7","volume":"6","author":"A Ed-daoudy","year":"2019","unstructured":"Ed-daoudy A, Maalmi K. A new Internet of Things architecture for real-time prediction of various diseases using machine learning on big data environment. J Big Data. 2019;6(1):104. https:\/\/doi.org\/10.1186\/s40537-019-0271-7.","journal-title":"J Big Data"},{"issue":"1","key":"355_CR33","doi-asserted-by":"publisher","first-page":"238","DOI":"10.1186\/2193-1801-2-238","volume":"2","author":"F Hosseinzadeh","year":"2013","unstructured":"Hosseinzadeh F, Kayvanjoo AH, Ebrahimi M, et al. Prediction of lung tumor types based on protein attributes by machine learning algorithms. SpringerPlus. 2013;2(1):238.","journal-title":"SpringerPlus"},{"issue":"1","key":"355_CR34","doi-asserted-by":"publisher","first-page":"97","DOI":"10.1186\/1475-925X-10-97","volume":"10","author":"M Behera","year":"2011","unstructured":"Behera M, Fowler EE, Owonikoko TK, et al. Statistical learning methods as a preprocessing step for survival analysis: evaluation of concept using lung cancer data. Biomed Eng Online. 2011;10(1):97.","journal-title":"Biomed Eng Online"},{"key":"355_CR35","doi-asserted-by":"crossref","unstructured":"Chakrabarty N. A data mining approach to flight arrival delay prediction for american airlines. 2019. arXiv preprint arXiv:1903.06740.","DOI":"10.1109\/IEMECONX.2019.8876970"}],"container-title":["Journal of Big Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-020-00355-0.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s40537-020-00355-0\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s40537-020-00355-0.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2021,9,16]],"date-time":"2021-09-16T23:37:58Z","timestamp":1631835478000},"score":1,"resource":{"primary":{"URL":"https:\/\/journalofbigdata.springeropen.com\/articles\/10.1186\/s40537-020-00355-0"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,9,17]]},"references-count":35,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2020,12]]}},"alternative-id":["355"],"URL":"https:\/\/doi.org\/10.1186\/s40537-020-00355-0","relation":{},"ISSN":["2196-1115"],"issn-type":[{"value":"2196-1115","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,9,17]]},"assertion":[{"value":"23 October 2019","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"3 September 2020","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"17 September 2020","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"Not applicable.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Ethics approval and consent to participate"}},{"value":"Not applicable.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Consent for publication"}},{"value":"The authors declare that they have no competing interests.","order":3,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"78"}}