{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,12]],"date-time":"2025-10-12T02:09:46Z","timestamp":1760234986627,"version":"build-2065373602"},"reference-count":42,"publisher":"MDPI AG","issue":"7","license":[{"start":{"date-parts":[[2021,7,7]],"date-time":"2021-07-07T00:00:00Z","timestamp":1625616000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Data"],"abstract":"<jats:p>Distributed clustering algorithms have proven to be effective in dramatically reducing execution time. However, distributed environments are characterized by a high rate of failure. Nodes can easily become unreachable. Furthermore, it is not guaranteed that messages are delivered to their destination. As a result, fault tolerance mechanisms are of paramount importance to achieve resiliency and guarantee continuous progress. In this paper, a fault-tolerant distributed k-means algorithm is proposed on a grid of commodity machines. Machines in such an environment are connected in a peer-to-peer fashion and managed by a gossip protocol with the actor model used as the concurrency model. The fact that no synchronization is needed makes it a good fit for parallel processing. Using the passive replication technique for the leader node and the active replication technique for the workers, the system exhibited robustness against failures. The results showed that the distributed k-means algorithm with no fault-tolerant mechanisms achieved up to a 34% improvement over the Hadoop-based k-means algorithm, while the robust one achieved up to a 12% improvement. The experiments also showed that the overhead, using such techniques, was negligible. Moreover, the results indicated that losing up to 10% of the messages had no real impact on the overall performance.<\/jats:p>","DOI":"10.3390\/data6070073","type":"journal-article","created":{"date-parts":[[2021,7,7]],"date-time":"2021-07-07T12:31:25Z","timestamp":1625661085000},"page":"73","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity Machines"],"prefix":"10.3390","volume":"6","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2414-0193","authenticated-orcid":false,"given":"Salah","family":"Taamneh","sequence":"first","affiliation":[{"name":"Department of Computer Science, The Hashemite University, Zarqa 13133, Jordan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-4633-9870","authenticated-orcid":false,"given":"Mo\u2019taz","family":"Al-Hami","sequence":"additional","affiliation":[{"name":"Department of Computer Information Systems, The Hashemite University, Zarqa 13133, Jordan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2962-9449","authenticated-orcid":false,"given":"Hani","family":"Bani-Salameh","sequence":"additional","affiliation":[{"name":"Department of Software Engineering, The Hashemite University, Zarqa 13133, Jordan"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-8959-9969","authenticated-orcid":false,"given":"Alaa E.","family":"Abdallah","sequence":"additional","affiliation":[{"name":"Department of Computer Science, The Hashemite University, Zarqa 13133, Jordan"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2021,7,7]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"321","DOI":"10.1016\/j.cmpb.2016.10.006","article-title":"Optimizing R with SparkR on a commodity cluster for biomedical research","volume":"137","author":"Sedlmayr","year":"2016","journal-title":"Comput. Methods Programs Biomed."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"944","DOI":"10.1016\/j.procs.2015.05.230","article-title":"Leveraging workflows and clouds for a multi-frontal solver for finite element meshes","volume":"51","author":"Balis","year":"2015","journal-title":"Procedia Comput. Sci."},{"key":"ref_3","doi-asserted-by":"crossref","first-page":"107","DOI":"10.1002\/spe.2341","article-title":"Iterative big data clustering algorithms: A review","volume":"46","author":"Mohebi","year":"2016","journal-title":"Softw. Pract. Exp."},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"2705","DOI":"10.1007\/s11227-018-2310-0","article-title":"A vectorized k-means algorithm for compressed datasets: Design and experimental analysis","volume":"74","author":"Cebrian","year":"2018","journal-title":"J. Supercomput."},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"195","DOI":"10.1016\/j.neucom.2018.02.072","article-title":"K-means: A revisit","volume":"291","author":"Zhao","year":"2018","journal-title":"Neurocomputing"},{"key":"ref_6","first-page":"1","article-title":"May-happen-in-parallel analysis for actor-based concurrency","volume":"17","author":"Albert","year":"2015","journal-title":"ACM Trans. Comput. Log. (TOCL)"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"439","DOI":"10.1016\/j.future.2020.12.011","article-title":"Akka framework based on the Actor model for executing distributed Fog Computing applications","volume":"117","author":"Srirama","year":"2021","journal-title":"Future Gener. Comput. Syst."},{"doi-asserted-by":"crossref","unstructured":"Yuan, C., and Yang, H. (2019). Research on K-value selection method of k-means clustering algorithm. J. Multidiscip. Sci. J., 2.","key":"ref_8","DOI":"10.3390\/j2020016"},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"379","DOI":"10.1016\/j.jocs.2017.09.008","article-title":"Learning automata clustering","volume":"24","author":"Rezvanian","year":"2018","journal-title":"J. Comput. Sci."},{"doi-asserted-by":"crossref","unstructured":"Arsan, T., and Hameez, M.M.N. (2019). A clustering-based approach for improving the accuracy of UWB sensor-based indoor positioning system. Mob. Inf. Syst.","key":"ref_10","DOI":"10.1155\/2019\/6372073"},{"key":"ref_11","first-page":"183","article-title":"Optimized k-means clustering model based on gap statistic","volume":"10","author":"Mahmoud","year":"2019","journal-title":"Int. J. Adv. Comput. Sci. Appl."},{"key":"ref_12","doi-asserted-by":"crossref","first-page":"283","DOI":"10.1007\/s11390-015-1522-5","article-title":"Accelerating iterative big data computing through mpi","volume":"30","author":"Liang","year":"2015","journal-title":"J. Comput. Sci. Technol."},{"key":"ref_13","doi-asserted-by":"crossref","first-page":"80","DOI":"10.1504\/IJGUC.2016.077487","article-title":"A novel near-parallel version of k-means algorithm for n-dimensional data objects using mpi","volume":"7","author":"Savvas","year":"2016","journal-title":"Int. J. Grid Util. Comput."},{"doi-asserted-by":"crossref","unstructured":"Savvas, I.K., and Sofianidou, G.N. (2014, January 23\u201325). Parallelizing k-means algorithm for 1-d data using mpi. Proceedings of the 2014 IEEE 23rd International WETICE Conference, Parma, Italy.","key":"ref_14","DOI":"10.1109\/WETICE.2014.13"},{"key":"ref_15","first-page":"27","article-title":"Comparative Study between Parallel K-Means and Parallel K-Medoids with Message Passing Interface (MPI)","volume":"2","author":"Nhita","year":"2016","journal-title":"Int. J. Inf. Commun. Technol. (IJoICT)"},{"key":"ref_16","first-page":"1017","article-title":"A parallel clustering algorithm with mpi-mkmeans","volume":"8","author":"Zhang","year":"2013","journal-title":"J. Comput."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"200","DOI":"10.1016\/j.fcij.2018.03.003","article-title":"An analysis of MapReduce efficiency in document clustering using parallel k-means algorithm","volume":"3","author":"Sardar","year":"2018","journal-title":"Future Comput. Inform. J."},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"467","DOI":"10.1016\/j.future.2020.01.026","article-title":"Fault tolerance of MPI applications in exascale systems: The ULFM solution","volume":"106","author":"Losada","year":"2020","journal-title":"Future Gener. Comput. Syst."},{"unstructured":"Park, D., Wang, J., and Kee, Y.S. (2016). In-storage computing for Hadoop MapReduce framework: Challenges and possibilities. IEEE Trans. Comput.","key":"ref_19"},{"doi-asserted-by":"crossref","unstructured":"Bani-Salameh, H., Al-Qawaqneh, M., and Taamneh, S. (2021). Investigating the Adoption of Big Data Management in Healthcare in Jordan. Data, 6.","key":"ref_20","DOI":"10.3390\/data6020016"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s40537-017-0087-2","article-title":"Clustering large datasets using k-means modified inter and intra clustering (KM-I2C) in Hadoop","volume":"4","author":"Sreedhar","year":"2017","journal-title":"J. Big Data"},{"key":"ref_22","doi-asserted-by":"crossref","first-page":"95","DOI":"10.1007\/s40031-019-00388-x","article-title":"Data Categorization Using Hadoop MapReduce-Based Parallel K-Means Clustering","volume":"100","author":"Ansari","year":"2019","journal-title":"J. Inst. Eng. (India) Ser. B"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"1349","DOI":"10.1080\/09720529.2019.1692444","article-title":"Performance evaluation of k-means clustering on Hadoop infrastructure","volume":"22","author":"Vats","year":"2019","journal-title":"J. Discret. Math. Sci. Cryptogr."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"239","DOI":"10.1007\/s10723-019-09503-0","article-title":"Improved k-means clustering algorithm for big data mining under Hadoop parallel framework","volume":"18","author":"Lu","year":"2019","journal-title":"J. Grid Comput."},{"key":"ref_25","first-page":"8","article-title":"Comparing apache spark and map reduce with performance analysis using k-means","volume":"113","author":"Gopalani","year":"2015","journal-title":"Int. J. Comput. Appl."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1186\/s13677-016-0053-0","article-title":"Performance characterization and analysis for Hadoop k-means iteration","volume":"5","author":"Issa","year":"2016","journal-title":"J. Cloud Comput."},{"key":"ref_27","first-page":"2734","article-title":"Apache Hadoop performance evaluation with resources monitoring tools, and parameters optimization: IOT emerging demand","volume":"99","author":"Maabreh","year":"2021","journal-title":"J. Theor. Appl. Inf. Technol."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"2657","DOI":"10.1007\/s11227-016-1949-7","article-title":"Moving metadata from ad hoc files to database tables for robust, highly available, and scalable HDFS","volume":"73","author":"Won","year":"2017","journal-title":"J. Supercomput."},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"182","DOI":"10.1016\/j.envsoft.2018.10.004","article-title":"The land transformation model-cluster framework: Applying k-means and the Spark computing environment for large scale land change analytics","volume":"111","author":"Omrani","year":"2019","journal-title":"Environ. Model. Softw."},{"key":"ref_30","doi-asserted-by":"crossref","first-page":"514","DOI":"10.1007\/s11227-016-1896-3","article-title":"Cloud implementation of the k-means algorithm for hyperspectral image analysis","volume":"73","author":"Haut","year":"2017","journal-title":"J. Supercomput."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"183","DOI":"10.1007\/s10776-019-00440-z","article-title":"Intelligent Classification Method of Remote Sensing Image Based on Big Data in Spark Environment","volume":"26","author":"Xing","year":"2019","journal-title":"Int. J. Wirel. Inf. Netw."},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"2110","DOI":"10.14778\/2831360.2831365","article-title":"Clash of the titans: Mapreduce vs. spark for large scale data analytics","volume":"8","author":"Shi","year":"2015","journal-title":"Proc. VLDB Endow."},{"doi-asserted-by":"crossref","unstructured":"Abu\u00edn, J.M., Lopes, N., Ferreira, L., Pena, T.F., and Schmidt, B. (2020). Big Data in metagenomics: Apache Spark vs MPI. PLoS ONE, 15.","key":"ref_33","DOI":"10.1371\/journal.pone.0239741"},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"316","DOI":"10.1007\/s11227-016-1863-z","article-title":"Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applications","volume":"73","author":"Losada","year":"2017","journal-title":"J. Supercomput."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"e3847","DOI":"10.1002\/cpe.3847","article-title":"Small files storing and computing optimization in Hadoop parallel rendering","volume":"29","author":"Zhang","year":"2017","journal-title":"Concurr. Comput. Pract. Exp."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"102573","DOI":"10.1016\/j.parco.2019.102573","article-title":"Distributed ant colony optimization based on actor model","volume":"90","author":"Starzec","year":"2019","journal-title":"Parallel Comput."},{"key":"ref_37","doi-asserted-by":"crossref","first-page":"116","DOI":"10.1016\/j.jss.2018.05.034","article-title":"Coordinated actor model of self-adaptive track-based traffic control systems","volume":"143","author":"Bagheri","year":"2018","journal-title":"J. Syst. Softw."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"586","DOI":"10.1109\/JSAC.2019.2894287","article-title":"NFVactor: A resilient NFV system using the distributed actor model","volume":"37","author":"Duan","year":"2019","journal-title":"IEEE J. Sel. Areas Commun."},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"379","DOI":"10.3233\/MGS-200336","article-title":"Parallel and fault-tolerant k-means clustering based on the actor model","volume":"16","author":"Taamneh","year":"2020","journal-title":"Multiagent Grid Syst."},{"unstructured":"Gupta, M. (2012). Akka Essentials, Packt Publishing Ltd.","key":"ref_40"},{"doi-asserted-by":"crossref","unstructured":"Friesen, J. (2019). Processing JSON with Jackson. Java XML and JSON, Springer.","key":"ref_41","DOI":"10.1007\/978-1-4842-4330-5"},{"doi-asserted-by":"crossref","unstructured":"Hayashibara, N., Defago, X., Yared, R., and Katayama, T. (2004, January 18\u201320). The\/spl phi\/accrual failure detector. Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, Florianopolis, Brazil.","key":"ref_42","DOI":"10.1109\/RELDIS.2004.1353004"}],"container-title":["Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2306-5729\/6\/7\/73\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T06:27:18Z","timestamp":1760164038000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2306-5729\/6\/7\/73"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,7,7]]},"references-count":42,"journal-issue":{"issue":"7","published-online":{"date-parts":[[2021,7]]}},"alternative-id":["data6070073"],"URL":"https:\/\/doi.org\/10.3390\/data6070073","relation":{},"ISSN":["2306-5729"],"issn-type":[{"type":"electronic","value":"2306-5729"}],"subject":[],"published":{"date-parts":[[2021,7,7]]}}}