{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,3]],"date-time":"2026-04-03T11:19:16Z","timestamp":1775215156722,"version":"3.50.1"},"reference-count":34,"publisher":"Walter de Gruyter GmbH","issue":"1","license":[{"start":{"date-parts":[[2023,1,1]],"date-time":"2023-01-01T00:00:00Z","timestamp":1672531200000},"content-version":"unspecified","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":[],"published-print":{"date-parts":[[2023,7,17]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>The popularity of artificial intelligence applications is on the rise, and they are producing better outcomes in numerous fields of research. However, the effectiveness of these applications relies heavily on the quantity and quality of data used. While the volume of data available has increased significantly in recent years, this does not always lead to better results, as the information content of the data is also important. This study aims to evaluate a new data preprocessing technique called semi-pivoted QR (SPQR) approximation for machine learning. This technique is designed for approximating sparse matrices and acts as a feature selection algorithm. To the best of our knowledge, it has not been previously applied to data preprocessing in machine learning algorithms. The study aims to evaluate the impact of SPQR on the performance of an unsupervised clustering algorithm and compare its results to those obtained using principal component analysis (PCA) as the preprocessing algorithm. The evaluation is conducted on various publicly available datasets. The findings suggest that the SPQR algorithm can produce outcomes comparable to those achieved using PCA without altering the original dataset.<\/jats:p>","DOI":"10.1515\/comp-2022-0278","type":"journal-article","created":{"date-parts":[[2023,7,17]],"date-time":"2023-07-17T14:39:23Z","timestamp":1689604763000},"source":"Crossref","is-referenced-by-count":36,"title":["Data preprocessing impact on machine learning algorithm performance"],"prefix":"10.1515","volume":"13","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-8107-7047","authenticated-orcid":false,"given":"Alberto","family":"Amato","sequence":"first","affiliation":[{"name":"Department of Electrical and Information Engineering Politecnico di Bari , Bari , Italy"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-4878-185X","authenticated-orcid":false,"given":"Vincenzo","family":"Di Lecce","sequence":"additional","affiliation":[{"name":"Department of Electrical and Information Engineering Politecnico di Bari , Bari , Italy"}]}],"member":"374","published-online":{"date-parts":[[2023,7,17]]},"reference":[{"key":"2023090110141808746_j_comp-2022-0278_ref_001","doi-asserted-by":"crossref","unstructured":"G. Tuff\u00e9ry, \u201cFactor analysis,\u201d in Data mining and statistics for decision making, Wiley, 2011, pp. 175\u2013180.","DOI":"10.1002\/9780470979174"},{"key":"2023090110141808746_j_comp-2022-0278_ref_002","doi-asserted-by":"crossref","unstructured":"G. W. Stewart, \u201cFour algorithms for the efficient computation of truncated pivoted QR approximations to a sparse matrix,\u201d Numer. Math., vol. 83, pp. 313\u2013323, 1999.","DOI":"10.1007\/s002110050451"},{"key":"2023090110141808746_j_comp-2022-0278_ref_003","doi-asserted-by":"crossref","unstructured":"M. Popolizio, A. Amato, V. Piuri, and V. Di Lecce, \u201cImproving Classification Performance Using The Semi-Pivoted QR approximation algorithm,\u201d In 2nd FICR International Conference on Rising Threats in Expert Applications and Solutions. 7\u20138 January 2022.","DOI":"10.1007\/978-981-19-1122-4_29"},{"key":"2023090110141808746_j_comp-2022-0278_ref_004","unstructured":"D. Dua and C. Graff, UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science, 2019. http:\/\/archive.ics.uci.edu\/ml"},{"key":"2023090110141808746_j_comp-2022-0278_ref_005","doi-asserted-by":"crossref","unstructured":"C. Boutsidis, J. Sun, and N. Anerousis, \u201cClustered subset selection and its applications on it service metrics,\u201d Proceedings of the 17th ACM conference on Information and knowledge management (CIKM \u201808). New York, NY, USA: Association for Computing Machinery, 2008, pp. 599\u2013608. 10.1145\/1458082.1458162.","DOI":"10.1145\/1458082.1458162"},{"key":"2023090110141808746_j_comp-2022-0278_ref_006","doi-asserted-by":"crossref","unstructured":"A. T\u0103u\u0163an, A. Rossi, R. de Francisco, and B. Ionescu, \u201cDimensionality reduction for EEG-based sleep stage detection: comparison of autoencoders, principal component analysis and factor analysis,\u201d Biomed. Eng.\/Biomedizi Tech., vol. 66, no. 2, pp. 125\u2013136, 2021. 10.1515\/bmt-2020-0139.","DOI":"10.1515\/bmt-2020-0139"},{"key":"2023090110141808746_j_comp-2022-0278_ref_007","doi-asserted-by":"crossref","unstructured":"M. Balasubramanian and E. L. Schwartz, \u201cThe isomap algorithm and topological stability,\u201d Science, vol. 295, no. 5552, p. 7, 2002.","DOI":"10.1126\/science.295.5552.7a"},{"key":"2023090110141808746_j_comp-2022-0278_ref_008","doi-asserted-by":"crossref","unstructured":"S. T. Roweis and L. K. Saul, \u201cNonlinear dimensionality reduction by locally linear embedding,\u201d Science, vol. 290, no. 5500, pp. 2323\u20132326, 2000.","DOI":"10.1126\/science.290.5500.2323"},{"key":"2023090110141808746_j_comp-2022-0278_ref_009","doi-asserted-by":"crossref","unstructured":"D. L. Donoho and C. Grimes, \u201cHessian eigenmaps: Locally linear embedding techniques for high-dimensional data,\u201d Proc. Natl. Acad. Sci. U S A., 2003, vol. 100, no. 10, pp. 5591\u20135596.","DOI":"10.1073\/pnas.1031596100"},{"key":"2023090110141808746_j_comp-2022-0278_ref_010","doi-asserted-by":"crossref","unstructured":"M. Belkin and P. Niyogi, \u201cLaplacian eigenmaps for dimensionality reduction and data representation,\u201d Neural Comput, vol. 15, no. 6, pp. 1373\u20131396, 2003.","DOI":"10.1162\/089976603321780317"},{"key":"2023090110141808746_j_comp-2022-0278_ref_011","doi-asserted-by":"crossref","unstructured":"H. Huang and H. Feng, \u201cGene classification using parameter-free semi-supervised manifold learning,\u201d IEEE\/ACM Trans. Comput. Biology, Bioinf., vol. 9, no. 3, pp. 818\u2013827, May\u2013Jun 2012.","DOI":"10.1109\/TCBB.2011.152"},{"key":"2023090110141808746_j_comp-2022-0278_ref_012","doi-asserted-by":"crossref","unstructured":"J. Shawe-Taylor and N. Cristianini, Kernel methods for pattern analysis, Cambridge, UK: Cambridge University Press, 2004.","DOI":"10.1017\/CBO9780511809682"},{"key":"2023090110141808746_j_comp-2022-0278_ref_013","unstructured":"C. Giraud, Introduction to high-dimensional statistics, vol. 138, Boca Raton, FL, USA: CRC Press, 2014."},{"key":"2023090110141808746_j_comp-2022-0278_ref_014","unstructured":"R. Rubinstein, M. Zibulevsky, and M. Elad, Efficient implementation of the K-SVD algorithm using batch orthogonal matching pursuit. No. CS Technion report CS-2008-08, Computer Science Department, Technion, 2008."},{"key":"2023090110141808746_j_comp-2022-0278_ref_015","unstructured":"R. A. Johnson, D. W. Wichern, Applied multivariate statistical analysis, Englewood Cliffs, NJ, USA: Prentice, 1992, p. 4."},{"key":"2023090110141808746_j_comp-2022-0278_ref_016","doi-asserted-by":"crossref","unstructured":"M. C. Thrun and A. Ultsch, \u201cUncovering High-dimensional Structures of Projections from Dimensionality Reduction Methods,\u201d MethodsX, vol. 7, p. 101093, 2020. 10.1016\/j.mex.2020.101093.","DOI":"10.1016\/j.mex.2020.101093"},{"key":"2023090110141808746_j_comp-2022-0278_ref_017","doi-asserted-by":"crossref","unstructured":"M. W.Berry, S. A. Pulatova, and G. W. Stewart, \u201cComputing sparse reduced-rank approximations to sparse matrices,\u201d ACM Trans. Math. Softw., vol. 31, pp. 252\u2013269, 2005.","DOI":"10.1145\/1067967.1067972"},{"key":"2023090110141808746_j_comp-2022-0278_ref_018","doi-asserted-by":"crossref","unstructured":"G. W. Stewart, \u201cError analysis of the quasi-Gram\u2013Schmidt algorithm,\u201d SIAM J. Matrix Anal. Appl, vol. 27, no. 2, pp. 493\u2013506, 2004.","DOI":"10.1137\/040607794"},{"key":"2023090110141808746_j_comp-2022-0278_ref_019","doi-asserted-by":"crossref","unstructured":"M. Popolizio, A. Amato, V. Piuri, and V. Di Lecce, \u201cImproving classification performance using the semi-pivoted QR approximation algorithm,\u201d in Rising Threats in Expert Applications and Solutions. Lecture Notes in Networks and Systems, vol. 434, V. S. Rathore, S. C. Sharma, J. M. R. Tavares, C. Moreira, B. Surendiran, Eds., Singapore: Springer, 2022. 10.1007\/978-981-19-1122-4_29.","DOI":"10.1007\/978-981-19-1122-4_29"},{"key":"2023090110141808746_j_comp-2022-0278_ref_020","doi-asserted-by":"crossref","unstructured":"J. Minguill\u00f3n, J. Meneses, E. Aibar, N. Ferran-Ferrer, and S. F\u00e3bregues, \u201cExploring the gender gap in the Spanish Wikipedia: Differences in engagement and editing practices,\u201d PLoS One, vol. 16, no. 2, p. e0246702, 2021.","DOI":"10.1371\/journal.pone.0246702"},{"key":"2023090110141808746_j_comp-2022-0278_ref_021","doi-asserted-by":"crossref","unstructured":"P. Borah, D. K. Bhattacharyya, and J. K. Kalita, \u201cMalware dataset generation and evaluation,\u201d in 2020 IEEE 4th Conference on Information & Communication Technology (CICT), IEEE, 2020.","DOI":"10.1109\/CICT51604.2020.9312053"},{"key":"2023090110141808746_j_comp-2022-0278_ref_022","doi-asserted-by":"crossref","unstructured":"A. P. Singh, V. Jain, S. Chaudhari, F. A. Kraemer, S. Werner, and V. Garg, \u201cMachine learning-based occupancy estimation using multivariate sensor nodes,\u201d in 2018 IEEE Globecom Workshops (GC Wkshps), 2018.","DOI":"10.1109\/GLOCOMW.2018.8644432"},{"key":"2023090110141808746_j_comp-2022-0278_ref_023","doi-asserted-by":"crossref","unstructured":"S. E. Golovenkin, J. Bac, A. Chervov, E. M. Mirkes, Y. V. Orlova, E. Barillot, et al., \u201cTrajectories, bifurcations, and pseudo-time in large clinical datasets: applications to myocardial infarction and diabetes data,\u201d GigaScience, vol. 9, no. 11, p. giaa128, 2020, 10.1093\/gigascience\/giaa128","DOI":"10.1093\/gigascience\/giaa128"},{"key":"2023090110141808746_j_comp-2022-0278_ref_024","doi-asserted-by":"crossref","unstructured":"A. Saxena, M. Prasad, A. Gupta, N. Bharill, O. P.Patel, A. Tiwari, et al., \u201cA review of clustering techniques and developments,\u201d Neurocomputing, vol. 267, pp. 664\u201381, 2017, 10.1016\/j.neucom.2017.06.053.","DOI":"10.1016\/j.neucom.2017.06.053"},{"key":"2023090110141808746_j_comp-2022-0278_ref_025","doi-asserted-by":"crossref","unstructured":"W. Pedrycz, \u201cAlgorithms of fuzzy clustering with partial supervision,\u201d Pattern Recog. Lett., vol. 3, pp. 13\u201320, 1985.","DOI":"10.1016\/0167-8655(85)90037-6"},{"key":"2023090110141808746_j_comp-2022-0278_ref_026","doi-asserted-by":"crossref","unstructured":"J. C. Dunn, \u201cA fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters,\u201d J. Cybern., vol. 3, pp. 32\u201357, 1973.","DOI":"10.1080\/01969727308546046"},{"key":"2023090110141808746_j_comp-2022-0278_ref_027","doi-asserted-by":"crossref","unstructured":"J. C. Bezdek, R. Ehrlich, and W. Full, \u201cFCM: The fuzzy c-means clustering algorithm, Comput Geosci, vol. 10, no. 2\u20133, pp. 191\u2013203, 1984.","DOI":"10.1016\/0098-3004(84)90020-7"},{"key":"2023090110141808746_j_comp-2022-0278_ref_028","doi-asserted-by":"crossref","unstructured":"A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, \u201cContent based image retrieval at the end of the early years,\u201d IEEE Trans. PAMI, vol. 22, pp. 121349\u20131380, Dec 2000.","DOI":"10.1109\/34.895972"},{"key":"2023090110141808746_j_comp-2022-0278_ref_029","doi-asserted-by":"crossref","unstructured":"W. M. Rand, \u201cObjective Criteria for the Evaluation of Clustering Methods,\u201d J. Am. Stat. Assoc., vol. 66, no. 336, pp. 846\u2013850, 1971, 10.2307\/2284239.","DOI":"10.1080\/01621459.1971.10482356"},{"key":"2023090110141808746_j_comp-2022-0278_ref_030","doi-asserted-by":"crossref","unstructured":"P. J. Rousseeuw, \u201cSilhouettes: A graphical aid to the interpretation and validation of cluster analysis,\u201d J. Comput. Appl. Math., vol. 20, pp. 53\u201365, 1987.","DOI":"10.1016\/0377-0427(87)90125-7"},{"key":"2023090110141808746_j_comp-2022-0278_ref_031","doi-asserted-by":"crossref","unstructured":"B. Venkatesh and J. Anuradha, \u201cFuzzy Rank Based Parallel Online Feature Selection Method using Multiple Sliding Windows,\u201d Open Comput. Sci., vol. 11, no. 1, pp. 275\u2013287, 2021, 10.1515\/comp-2020-0169.","DOI":"10.1515\/comp-2020-0169"},{"key":"2023090110141808746_j_comp-2022-0278_ref_032","doi-asserted-by":"crossref","unstructured":"S. Visalakshi and V. Radha, \u201cA literature review of feature selection techniques and applications: Review of feature selection in data mining,\u201d in 2014 IEEE International Conference on Computational Intelligence and Computing Research, 2014, pp. 1\u20136. 10.1109\/ICCIC.2014.7238499.","DOI":"10.1109\/ICCIC.2014.7238499"},{"key":"2023090110141808746_j_comp-2022-0278_ref_033","doi-asserted-by":"crossref","unstructured":"P. Kromer, J. Plato and V. Snael, \u201cGenetic algorithm for the column subset selection problem,\u201d in 2014 Eighth International Conference on Complex, Intelligent and Software Intensive Systems (CISIS), Birmingham, UK, 2014, pp. 16\u201322. 10.1109\/CISIS.2014.3","DOI":"10.1109\/CISIS.2014.3"},{"key":"2023090110141808746_j_comp-2022-0278_ref_034","doi-asserted-by":"crossref","unstructured":"I. T. Jolliffe, Principal component analysis, New York: Springer Verlag, 1986.","DOI":"10.1007\/978-1-4757-1904-8"}],"container-title":["Open Computer Science"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.degruyter.com\/document\/doi\/10.1515\/comp-2022-0278\/xml","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.degruyter.com\/document\/doi\/10.1515\/comp-2022-0278\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,9,1]],"date-time":"2023-09-01T10:20:35Z","timestamp":1693563635000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.degruyter.com\/document\/doi\/10.1515\/comp-2022-0278\/html"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,1,1]]},"references-count":34,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2023,5,25]]},"published-print":{"date-parts":[[2023,5,25]]}},"alternative-id":["10.1515\/comp-2022-0278"],"URL":"https:\/\/doi.org\/10.1515\/comp-2022-0278","relation":{},"ISSN":["2299-1093"],"issn-type":[{"value":"2299-1093","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,1,1]]},"article-number":"20220278"}}