{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,12]],"date-time":"2025-10-12T02:47:35Z","timestamp":1760237255974,"version":"build-2065373602"},"reference-count":46,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2020,3,24]],"date-time":"2020-03-24T00:00:00Z","timestamp":1585008000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Algorithms"],"abstract":"<jats:p>At the dawn of the 10V or big data data era, there are a considerable number of sources such as smart phones, IoT devices, social media, smart city sensors, as well as the health care system, all of which constitute but a small portion of the data lakes feeding the entire big data ecosystem. This 10V data growth poses two primary challenges, namely storing and processing. Concerning the latter, new frameworks have been developed including distributed platforms such as the Hadoop ecosystem. Classification is a major machine learning task typically executed on distributed platforms and as a consequence many algorithmic techniques have been developed tailored for these platforms. This article extensively relies in two ways on classifiers implemented in MLlib, the main machine learning library for the Hadoop ecosystem. First, a vast number of classifiers is applied to two datasets, namely Higgs and PAMAP. Second, a two-step classification is ab ovo performed to the same datasets. Specifically, the singular value decomposition of the data matrix determines first a set of transformed attributes which in turn drive the classifiers of MLlib. The twofold purpose of the proposed architecture is to reduce complexity while maintaining a similar if not better level of the metrics of accuracy, recall, and     F 1    . The intuition behind this approach stems from the engineering principle of breaking down complex problems to simpler and more manageable tasks. The experiments based on the same Spark cluster indicate that the proposed architecture outperforms the individual classifiers with respect to both complexity and the abovementioned metrics.<\/jats:p>","DOI":"10.3390\/a13030071","type":"journal-article","created":{"date-parts":[[2020,3,24]],"date-time":"2020-03-24T13:04:04Z","timestamp":1585055044000},"page":"71","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":16,"title":["Two-Step Classification with SVD Preprocessing of Distributed Massive Datasets in Apache Spark"],"prefix":"10.3390","volume":"13","author":[{"given":"Athanasios","family":"Alexopoulos","sequence":"first","affiliation":[{"name":"Computer Engineering and Informatics Department, University of Patras, 26504 Patras, Greece"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0975-1877","authenticated-orcid":false,"given":"Georgios","family":"Drakopoulos","sequence":"additional","affiliation":[{"name":"Department of Informatics, Ionian University, 49100 Corfu, Greece"}]},{"given":"Andreas","family":"Kanavos","sequence":"additional","affiliation":[{"name":"Computer Engineering and Informatics Department, University of Patras, 26504 Patras, Greece"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-6916-3129","authenticated-orcid":false,"given":"Phivos","family":"Mylonas","sequence":"additional","affiliation":[{"name":"Department of Informatics, Ionian University, 49100 Corfu, Greece"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-9555-4775","authenticated-orcid":false,"given":"Gerasimos","family":"Vonitsanos","sequence":"additional","affiliation":[{"name":"Computer Engineering and Informatics Department, University of Patras, 26504 Patras, Greece"}]}],"member":"1968","published-online":{"date-parts":[[2020,3,24]]},"reference":[{"key":"ref_1","first-page":"112","article-title":"Incremental Learning for Large Scale Classification Systems","volume":"Volume 520","author":"Iliadis","year":"2018","journal-title":"IFIP International Conference on Artificial Intelligence Applications and Innovations"},{"key":"ref_2","unstructured":"Hand, D.J., Mannila, H., and Smyth, P. (2001). Principles of Data Mining, MIT Press."},{"key":"ref_3","first-page":"37","article-title":"From Data Mining to Knowledge Discovery in Databases","volume":"17","author":"Fayyad","year":"1996","journal-title":"AI Mag."},{"key":"ref_4","unstructured":"Witten, I.H., Eibe, F., Hall, M.A., and Pal, C.J. (2016). Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann. [3rd ed.]."},{"key":"ref_5","first-page":"2313","article-title":"The Dantzig Selector: Statistical Estimation when p is Much Larger than n","volume":"35","author":"Candes","year":"2007","journal-title":"Ann. Stat."},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"799","DOI":"10.3150\/09-BEJ187","article-title":"The Dantzig Selector and Sparsity Oracle Inequalities","volume":"15","author":"Koltchinskii","year":"2009","journal-title":"Bernoulli"},{"key":"ref_7","unstructured":"Koperski, K., Han, J., and Stefanovic, N. (1998, January 11\u201315). An Efficient Two-Step Method for Classification of Spatial Data. Proceedings of the International Symposium on Spatial Data Handling (SDH), Vancouver, BC, Canada."},{"key":"ref_8","unstructured":"Christen, P. (2007, January 3\u20134). A Two-Step Classification Approach to Unsupervised Record Linkage. Proceedings of the 6th Australasian Data Mining Conference (AusDM), Gold Coast, Australia."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"2318","DOI":"10.1016\/j.patcog.2015.01.019","article-title":"An Evidential Classifier based on Feature Selection and Two-Step Classification Strategy","volume":"48","author":"Lian","year":"2015","journal-title":"Pattern Recogn."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Drakopoulos, G., Stathopoulou, F., Kanavos, A., Paraskevas, M., Tzimas, G., Mylonas, P., and Iliadis, L. (2019). A Genetic Algorithm for Spatiosocial Tensor Clustering: Exploiting TensorFlow Potential, Springer.","DOI":"10.1007\/s12530-019-09274-9"},{"key":"ref_11","first-page":"47","article-title":"Implementation of Machine Learning and Data Mining to Improve Cybersecurity and Limit Vulnerabilities to Cyber Attacks","volume":"Volume 855","author":"Yang","year":"2020","journal-title":"Nature-Inspired Computation in Data Mining and Machine Learning"},{"key":"ref_12","first-page":"113","article-title":"Prospects of Machine and Deep Learning in Analysis of Vital Signs for the Improvement of Healthcare Services","volume":"Volume 855","author":"Yang","year":"2020","journal-title":"Nature-Inspired Computation in Data Mining and Machine Learning"},{"key":"ref_13","unstructured":"Moniruzzaman, A.B.M., and Hossain, S.A. (2020, March 22). NoSQL Database: New Era of Databases for Big data Analytics - Classification, Characteristics and Comparison. Available online: http:\/\/article.nadiapub.com\/IJDTA\/vol6_no4\/1.pdf."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"330","DOI":"10.14778\/1920841.1920886","article-title":"Dremel: Interactive Analysis of Web-Scale Datasets","volume":"3","author":"Melnik","year":"2010","journal-title":"Proc. VLDB Endow."},{"key":"ref_15","doi-asserted-by":"crossref","unstructured":"Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., and Czajkowski, G. (2010, January 6\u201311). Pregel: A system for large-scale graph processing. Proceedings of the ACM International Conference on Management of Data (SIGMOD), Indianapolis, IN, USA.","DOI":"10.1145\/1807167.1807184"},{"key":"ref_16","unstructured":"Isard, M., Budiu, M., Yu, Y., Birrell, A., and Fetterly, D. (2007, January 21). Dryad: Distributed Data-Parallel Programs from Sequential. Building Blocks. Proceedings of the 2nd ACM SIGOPS\/EuroSys European Conference on Computer Systems, New York, NA, USA."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., and Hellerstein, J.M. (2020, March 22). Distributed GraphLab: A Framework for Machine Learning in the Cloud. Available online: http:\/\/vldb.org\/pvldb\/vol5\/p716_yuchenglow_vldb2012.pdf.","DOI":"10.14778\/2212351.2212354"},{"key":"ref_18","doi-asserted-by":"crossref","first-page":"56","DOI":"10.1145\/2934664","article-title":"Apache Spark: A Unified Engine for Big Data Processing","volume":"59","author":"Zaharia","year":"2016","journal-title":"Comm. ACM"},{"key":"ref_19","doi-asserted-by":"crossref","unstructured":"Shvachko, K., Kuang, H., Radia, S., and Chansler, R. (2010, January 3\u20137). The Hadoop Distributed File System. Proceedings of the IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), Incline Village, NV, USA.","DOI":"10.1109\/MSST.2010.5496972"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"107","DOI":"10.1145\/1327452.1327492","article-title":"MapReduce: Simplified Data Processing on Large Clusters","volume":"51","author":"Dean","year":"2008","journal-title":"Comm. ACM"},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"2110","DOI":"10.14778\/2831360.2831365","article-title":"Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics","volume":"8","author":"Shi","year":"2015","journal-title":"Proc. VLDB Endow."},{"key":"ref_22","unstructured":"Koliopoulos, A., Yiapanis, P., Tekiner, F., Nenadic, G., and Keane, J.A. (July, January 27). A Parallel Distributed Weka Framework for Big Data Mining Using Spark. Proceedings of the IEEE International Congress on Big Data, New York, NY, USA."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"10","DOI":"10.1145\/1656274.1656278","article-title":"The WEKA Data Mining Software: An Update","volume":"11","author":"Hall","year":"2009","journal-title":"SIGKDD Explor."},{"key":"ref_24","unstructured":"Yang, L., and Shi, Z. (July, January 29). An Efficient Data Mining Framework on Hadoop using Java Persistence API. Proceedings of the 10th IEEE International Conference on Computer and Information Technology (CIT), Bradford, UK."},{"key":"ref_25","doi-asserted-by":"crossref","unstructured":"Zhao, W., Ma, H., and He, Q. (2009). Parallel K-Means Clustering Based on MapReduce, Springer.","DOI":"10.1007\/978-3-642-10665-1_71"},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"29","DOI":"10.1145\/2481244.2481249","article-title":"Big Graph Mining: Algorithms and Discoveries","volume":"14","author":"Kang","year":"2012","journal-title":"SIGKDD Explor."},{"key":"ref_27","unstructured":"Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., and Isard, M. (2016, January 2\u20134). TensorFlow: A system for large-scale machine learning. Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), Savannah, GA, USA."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Kyriazidou, I., Drakopoulos, G., Kanavos, A., Makris, C., and Mylonas, P. (2020, March 22). Towards Predicting Mentions to Verified Twitter Accounts: Building Prediction Models over MongoDB with Keras. Available online: https:\/\/www.researchgate.net\/profile\/Georgios_Drakopoulos\/publication\/334681267_Towards_Predicting_Mentions_To_Verified_Twitter_Accounts_Building_Prediction_Models_Over_MongoDB_With_Keras\/links\/5d39dfe792851cd046869a4c\/Towards-Predicting-Mentions-To-Verified-Twitter-Accounts-Building-Prediction-Models-Over-MongoDB-With-Keras.pdf.","DOI":"10.5220\/0007810200002366"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Zhang, T. (2004, January 4). Solving Large Scale Linear Prediction Problems Using Stochastic Gradient Descent Algorithms. Proceedings of the 21st International Conference on Machine Learning (ICML), Banff, AB, Canada.","DOI":"10.1145\/1015330.1015332"},{"key":"ref_30","unstructured":"Younis, O., and Fahmy, S. (2004, January 7\u201311). Distributed Clustering in Ad-hoc Sensor Networks: A Hybrid, Energy-Efficient Approach. Proceedings of the 23rd Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM), Hong Kong, China."},{"key":"ref_31","doi-asserted-by":"crossref","first-page":"88","DOI":"10.1007\/978-3-540-24741-8_7","article-title":"DBDC: Density Based Distributed Clustering","volume":"Volume 2992","author":"Bertino","year":"2004","journal-title":"Advances in Database Technology - EDBT 2004"},{"key":"ref_32","unstructured":"Gorodetsky, V., Karsaev, O., and Samoilov, V. (2003, January 13\u201317). Multi-agent Technology for Distributed Data Mining and Classification. Proceedings of the IEEE\/WIC International Conference on Intelligent Agent Technology (IAT), Halifax, NS, Canada."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"425","DOI":"10.1109\/TITB.2009.2036722","article-title":"Structural Action Recognition in Body Sensor Networks: Distributed Classification Based on String Matching","volume":"14","author":"Ghasemzadeh","year":"2010","journal-title":"IEEE Trans. Inform. Tech. Biomed."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"104","DOI":"10.1109\/TSP.2010.2086450","article-title":"Distributed Classification of Multiple Observation Sets by Consensus","volume":"59","author":"Kokiopoulou","year":"2011","journal-title":"IEEE Trans. Signal Process."},{"key":"ref_35","doi-asserted-by":"crossref","first-page":"193","DOI":"10.1007\/3-540-36978-3_13","article-title":"Collaborative Signal Processing for Distributed Classification in Sensor Networks","volume":"Volume 2634","author":"Zhao","year":"2003","journal-title":"Information Processing in Sensor Networks"},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"2582","DOI":"10.1016\/j.comnet.2008.05.008","article-title":"Distributed Classification of Acoustic Targets in Wireless audio-sensor Networks","volume":"52","author":"Malhotra","year":"2008","journal-title":"Comput. Netw."},{"key":"ref_37","doi-asserted-by":"crossref","unstructured":"Raychaudhuri, S. (2008, January 7\u201310). Introduction to Monte Carlo simulation. Proceedings of the Winter Simulation Conference, Miami, FL, USA.","DOI":"10.1109\/WSC.2008.4736059"},{"key":"ref_38","first-page":"1235","article-title":"MLlib: Machine Learning in Apache Spark","volume":"17","author":"Meng","year":"2016","journal-title":"J. Mach. Learn. Res."},{"key":"ref_39","first-page":"15","article-title":"An Apache Spark Implementation for Sentiment Analysis on Twitter Data","volume":"Volume 10230","author":"Sellis","year":"2016","journal-title":"Algorithmic Aspects of Cloud Computing"},{"key":"ref_40","first-page":"146","article-title":"Survey of Machine Learning Algorithms on Spark Over DHT-based Structures","volume":"Volume 10230","author":"Sellis","year":"2016","journal-title":"Algorithmic Aspects of Cloud Computing"},{"key":"ref_41","doi-asserted-by":"crossref","unstructured":"Kanavos, A., Nodarakis, N., Sioutas, S., Tsakalidis, A., Tsolis, D., and Tzimas, G. (2017). Large Scale Implementations for Twitter Sentiment Classification. Algorithms, 10.","DOI":"10.3390\/a10010033"},{"key":"ref_42","first-page":"249","article-title":"Supervised Machine Learning: A Review of Classification Techniques","volume":"31","author":"Kotsiantis","year":"2007","journal-title":"Inform. (Slovenia)"},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"5","DOI":"10.1023\/A:1010933404324","article-title":"Random Forests","volume":"45","author":"Breiman","year":"2001","journal-title":"Mach. Learn."},{"key":"ref_44","unstructured":"Mohan, A., Chen, Z., and Weinberger, K.Q. (2020, March 22). Web-Search Ranking with Initialized Gradient Boosted Regression Trees. Available online: http:\/\/proceedings.mlr.press\/v14\/mohan11a\/mohan11a.pdf."},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"4308","DOI":"10.1038\/ncomms5308","article-title":"Searching for Exotic Particles in High-energy Physics with Deep Learning","volume":"5","author":"Baldi","year":"2014","journal-title":"Nat. Comm."},{"key":"ref_46","doi-asserted-by":"crossref","unstructured":"Reiss, A., and Stricker, D. (2012, January 18\u201322). Introducing a New Benchmarked Dataset for Activity Monitoring. Proceedings of the International Symposium on Wearable Computers (ISWC), Newcastle, UK.","DOI":"10.1109\/ISWC.2012.13"}],"container-title":["Algorithms"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1999-4893\/13\/3\/71\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T09:11:08Z","timestamp":1760173868000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1999-4893\/13\/3\/71"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,3,24]]},"references-count":46,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2020,3]]}},"alternative-id":["a13030071"],"URL":"https:\/\/doi.org\/10.3390\/a13030071","relation":{},"ISSN":["1999-4893"],"issn-type":[{"type":"electronic","value":"1999-4893"}],"subject":[],"published":{"date-parts":[[2020,3,24]]}}}