{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T01:31:06Z","timestamp":1760059866702,"version":"build-2065373602"},"reference-count":32,"publisher":"MDPI AG","issue":"3","license":[{"start":{"date-parts":[[2025,7,16]],"date-time":"2025-07-16T00:00:00Z","timestamp":1752624000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"name":"Tianjin Manufacturing High Quality Development Special Foundation","award":["20232185","2021YFB3300903","U23A20299","2024-CN-FW-0287","70306901"],"award-info":[{"award-number":["20232185","2021YFB3300903","U23A20299","2024-CN-FW-0287","70306901"]}]},{"DOI":"10.13039\/501100012166","name":"National Key R&amp;D Program of China","doi-asserted-by":"publisher","award":["20232185","2021YFB3300903","U23A20299","2024-CN-FW-0287","70306901"],"award-info":[{"award-number":["20232185","2021YFB3300903","U23A20299","2024-CN-FW-0287","70306901"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["20232185","2021YFB3300903","U23A20299","2024-CN-FW-0287","70306901"],"award-info":[{"award-number":["20232185","2021YFB3300903","U23A20299","2024-CN-FW-0287","70306901"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"name":"Kingbase Foundation","award":["20232185","2021YFB3300903","U23A20299","2024-CN-FW-0287","70306901"],"award-info":[{"award-number":["20232185","2021YFB3300903","U23A20299","2024-CN-FW-0287","70306901"]}]},{"name":"Roycom Foundation","award":["20232185","2021YFB3300903","U23A20299","2024-CN-FW-0287","70306901"],"award-info":[{"award-number":["20232185","2021YFB3300903","U23A20299","2024-CN-FW-0287","70306901"]}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Informatics"],"abstract":"<jats:p>Hard disk failure prediction is an important proactive maintenance method for storage systems. Recent years have seen significant progress in hard disk failure prediction using high-quality SMART datasets. However, in industrial applications, data loss often occurs during SMART data collection, transmission, and storage. Existing machine learning-based hard disk failure prediction models perform poorly on low-quality datasets. Therefore, this paper proposes a hard disk fault prediction technique based on low-quality datasets. Firstly, based on the original Backblaze dataset, we construct a low-quality dataset, Backblaze-, by simulating sector damage in actual scenarios and deleting 10% to 99% of the data. Time series features like the Absolute Sum of First Difference (ASFD) were introduced to amplify the differences between positive and negative samples and reduce the sensitivity of the model to SMART data loss. Considering the impact of different quality datasets on time window selection, we propose a time window selection formula that selects different time windows based on the proportion of data loss. It is found that the poorer the dataset quality, the longer the time window selection should be. The proposed model achieves a True Positive Rate (TPR) of 99.46%, AUC of 0.9971, and F1 score of 0.9871, with a False Positive Rate (FPR) under 0.04%, even with 80% data loss, maintaining performance close to that on the original dataset.<\/jats:p>","DOI":"10.3390\/informatics12030073","type":"journal-article","created":{"date-parts":[[2025,7,16]],"date-time":"2025-07-16T15:48:22Z","timestamp":1752680902000},"page":"73","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["DFPoLD: A Hard Disk Failure Prediction on Low-Quality Datasets"],"prefix":"10.3390","volume":"12","author":[{"ORCID":"https:\/\/orcid.org\/0009-0001-9066-8148","authenticated-orcid":false,"given":"Shuting","family":"Wei","sequence":"first","affiliation":[{"name":"School of Computer Science and Engineering, Tianjin University of Technology, Tianjin 300384, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0006-6374-9109","authenticated-orcid":false,"given":"Xiaoyu","family":"Lu","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, Tianjin University of Technology, Tianjin 300384, China"},{"name":"Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100085, China"}]},{"ORCID":"https:\/\/orcid.org\/0009-0002-1589-7124","authenticated-orcid":false,"given":"Hongzhang","family":"Yang","sequence":"additional","affiliation":[{"name":"School of Computer Science and Engineering, Tianjin University of Technology, Tianjin 300384, China"}]},{"given":"Chenfeng","family":"Tu","sequence":"additional","affiliation":[{"name":"Haojing Cloud Computing Technology Corporation, Nanjing 211153, China"}]},{"given":"Jiangpu","family":"Guo","sequence":"additional","affiliation":[{"name":"Roycom Information Technology Corporation, Tianjin 301721, China"}]},{"given":"Hailong","family":"Sun","sequence":"additional","affiliation":[{"name":"China Electronics System Technology Corporation, Beijing 100036, China"}]},{"given":"Yu","family":"Feng","sequence":"additional","affiliation":[{"name":"Beijing Kingbase Technology Inc., Beijing 100006, China"}]}],"member":"1968","published-online":{"date-parts":[[2025,7,16]]},"reference":[{"unstructured":"Schroeder, B., and Gibson, G.A. (2007, January 13\u201316). Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?. Proceedings of the 5th USENIX Conference on File and Storage Technologies, San Jose, CA, USA.","key":"ref_1"},{"doi-asserted-by":"crossref","unstructured":"Xu, S., and Xu, X. (2023, January 24\u201326). ConvTrans-TPS: A Convolutional Transformer Model for Disk Failure Prediction in Large-Scale Network Storage Systems. Proceedings of the 26th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Rio de Janeiro, Brazil.","key":"ref_2","DOI":"10.1109\/CSCWD57460.2023.10152728"},{"doi-asserted-by":"crossref","unstructured":"Srinivaas, A., Sakthivel, N.R., and Nair, B.B. (2025). Machine Learning Approaches for Fault Detection in Internal Combustion Engines: A Review and Experimental Investigation. Informatics, 12.","key":"ref_3","DOI":"10.3390\/informatics12010025"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"1559","DOI":"10.1142\/S0218194022500620","article-title":"Approach to Predict Software Vulnerability Based on Multiple-Level N-gram Feature Extraction and Heterogeneous Ensemble Learning","volume":"32","author":"Zhang","year":"2022","journal-title":"Int. J. Softw. Eng. Knowl. Eng."},{"key":"ref_5","first-page":"8878364","article-title":"Hard disk drive failure prediction for mobile edge computing based on an LSTM recurrent neural network","volume":"2021","author":"Shen","year":"2021","journal-title":"Mob. Inf. Syst."},{"doi-asserted-by":"crossref","unstructured":"Coursey, A., Nath, G., Prabhu, S., and Sengupta, S. (2021, January 15\u201318). Remaining useful life estimation of hard disk drives using bidirectional lstm networks. Proceedings of the 2021 IEEE International Conference on Big Data, Orlando, Florida, USA.","key":"ref_6","DOI":"10.1109\/BigData52589.2021.9671605"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"107339","DOI":"10.1016\/j.engappai.2023.107339","article-title":"Cost aware LSTM model for predicting hard disk drive failures based on extremely imbalanced SMART sensors data","volume":"127","author":"Ahmed","year":"2024","journal-title":"Eng. Appl. Artif. Intell."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"460","DOI":"10.1016\/j.future.2023.05.020","article-title":"SPAE: Lifelong disk failure prediction via end-to-end GAN-based anomaly detection with ensemble update","volume":"148","author":"Liu","year":"2023","journal-title":"Future Gener. Comput. Syst."},{"doi-asserted-by":"crossref","unstructured":"Gargiulo, F., Duellmann, D., Arpaia, P., and Moriello, R.S.L. (2021). Predicting hard disk failure by means of automatized labeling and machine learning approach. Appl. Sci., 11.","key":"ref_9","DOI":"10.3390\/app11188293"},{"doi-asserted-by":"crossref","unstructured":"Burrello, A., Pagliari, D.J., Bartolini, A., Benini, L., Macii, E., and Poncino, M. (2021). Predicting hard disk failures in data centers using temporal convolutional neural networks. Euro-Par 2020: Parallel Processing Workshops, Proceedings of the 26th International Conference on Parallel and Distributed Computing, Warsaw, Poland, 24\u201328 August, 2020, Springer International Publishing.","key":"ref_10","DOI":"10.1007\/978-3-030-71593-9_22"},{"key":"ref_11","doi-asserted-by":"crossref","first-page":"3502","DOI":"10.1109\/TC.2016.2538237","article-title":"Health status assessment and failure prediction for hard drives with recurrent neural networks","volume":"65","author":"Xu","year":"2016","journal-title":"IEEE Trans. Comput."},{"unstructured":"Lu, S., Luo, B., Patel, T., Yao, Y., Tiwari, D., and Shi, W. (2020, January 25\u201327). Making disk failure predictions SMARTer!. Proceedings of the 18th USENIX Conference on File and Storage Technologies, Boston, MA, USA.","key":"ref_12"},{"key":"ref_13","first-page":"9","article-title":"Monitoring hard disks with SMART","volume":"2004","author":"Allen","year":"2004","journal-title":"Linux J."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"e5669","DOI":"10.1002\/cpe.5669","article-title":"Cost-efficiency disk failure prediction via threshold-moving","volume":"32","author":"Jiang","year":"2020","journal-title":"Concurr. Comput. Pract. Exp."},{"unstructured":"Zhang, J., Huang, P., Zhou, K., Xie, M., and Schelter, S. (2020, January 15\u201317). HDDse: Enabling High-Dimensional Disk State Embedding for Generic Failure Detection System of Heterogeneous Disks in Large Data Centers. Proceedings of the 2020 USENIX Annual Technical Conference, Boston, MA, USA.","key":"ref_15"},{"doi-asserted-by":"crossref","unstructured":"Wang, H., Zhuge, Q., Sha, E.H.-M., Xu, R., and Song, Y. (2023). Optimizing Efficiency of Machine Learning Based Hard Disk Failure Prediction by Two-Layer Classification-Based Feature Selection. Appl. Sci., 13.","key":"ref_16","DOI":"10.3390\/app13137544"},{"doi-asserted-by":"crossref","unstructured":"Zhang, M., Ge, W., Tang, R., and Liu, P. (2023). Hard disk failure prediction based on blending ensemble learning. Appl. Sci., 13.","key":"ref_17","DOI":"10.3390\/app13053288"},{"doi-asserted-by":"crossref","unstructured":"Ge, W., Liu, P., Zhang, M., Zhang, Z., and Lai, Y. (2024, January 5\u20137). DiskTransformer: A Transformer Network for Hard Disk Failure Prediction. Proceedings of the 7th International Conference on Artificial Intelligence and Big Data (ICAIBD), Beijing, China.","key":"ref_18","DOI":"10.1109\/ICAIBD62003.2024.10604547"},{"unstructured":"Lu, X., Tu, C., Yang, H., Guo, J., and Sun, H. (September, January 30). FPTSF: A Failure Prediction of Hard Disks Based on Time Series Features Towards Low Quality Dataset. Proceedings of the Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data, Jinhua, China.","key":"ref_19"},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"1189","DOI":"10.1214\/aos\/1013203451","article-title":"Greedy Function Approximation: A Gradient Boosting Machine","volume":"29","author":"Friedman","year":"2001","journal-title":"Ann. Stat."},{"unstructured":"Han, S., Wu, J., Xu, E., and He, C. (2019). Robust Data Preprocessing for Machine-Learning-Based Disk Failure Prediction in Cloud Production Environments. arXiv.","key":"ref_21"},{"doi-asserted-by":"crossref","unstructured":"Wang, H., Yang, Y., and Yang, H. (2021, January 5\u20138). Hard Disk Failure Prediction Based on Lightgbm with CID. Proceedings of the 2021 IEEE Symposium on Computers and Communications (ISCC), Athens, Greece.","key":"ref_22","DOI":"10.1109\/ISCC53001.2021.9631504"},{"unstructured":"Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. (2017, January 4\u20139). Lightgbm: A highly efficient gradient boosting decision tree. Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. Available online: https:\/\/dl.acm.org\/doi\/10.5555\/3294996.3295074.","key":"ref_23"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"13911","DOI":"10.1007\/s11227-021-03838-w","article-title":"A novel LSTM\u2013CNN\u2013grid search-based deep neural network for sentiment analysis","volume":"77","author":"Priyadarshini","year":"2021","journal-title":"J. Supercomput."},{"key":"ref_25","first-page":"875","article-title":"Grid search in hyperparameter optimization of machine learning models for prediction of HIV\/AIDS test results","volume":"44","author":"Belete","year":"2022","journal-title":"Int. J. Comput. Appl."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"5633","DOI":"10.1007\/s00500-020-05560-w","article-title":"An improved grid search algorithm to optimize SVR for prediction","volume":"25","author":"Sun","year":"2021","journal-title":"Soft Comput."},{"doi-asserted-by":"crossref","unstructured":"Shekhar, S., Bansode, A., and Salim, A. (2021, January 8\u201310). A comparative study of hyper-parameter optimization tools. Proceedings of the 2021 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE), Brisbane, Australia.","key":"ref_27","DOI":"10.1109\/CSDE53843.2021.9718485"},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"295","DOI":"10.1016\/j.neucom.2020.07.061","article-title":"On hyperparameter optimization of machine learning algorithms: Theory and practice","volume":"415","author":"Yang","year":"2020","journal-title":"Neurocomputing"},{"key":"ref_29","doi-asserted-by":"crossref","first-page":"115","DOI":"10.1016\/j.cjche.2022.04.004","article-title":"Hyper-parameter optimization of multiple machine learning algorithms for molecular property prediction using hyperopt library","volume":"52","author":"Zhang","year":"2022","journal-title":"Chin. J. Chem. Eng."},{"doi-asserted-by":"crossref","unstructured":"Akiba, T., Sano, S., Yanase, T., Ohta, T., and Koyama, M. (2019, January 4\u20138). Optuna: A next-generation hyperparameter optimization framework. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Anchorage, AK, USA.","key":"ref_30","DOI":"10.1145\/3292500.3330701"},{"doi-asserted-by":"crossref","unstructured":"Lai, J.P., Lin, Y.L., Lin, H.C., Shih, C.Y., Wang, Y.P., and Pai, P.F. (2023). Tree-based machine learning models with optuna in predicting impedance values for circuit analysis. Micromachines, 14.","key":"ref_31","DOI":"10.3390\/mi14020265"},{"key":"ref_32","doi-asserted-by":"crossref","first-page":"55","DOI":"10.1016\/j.ress.2017.03.004","article-title":"Hard drive failure prediction using decision trees","volume":"164","author":"Li","year":"2017","journal-title":"Reliab. Eng. Syst. Saf."}],"container-title":["Informatics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2227-9709\/12\/3\/73\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,9]],"date-time":"2025-10-09T18:10:46Z","timestamp":1760033446000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2227-9709\/12\/3\/73"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,7,16]]},"references-count":32,"journal-issue":{"issue":"3","published-online":{"date-parts":[[2025,9]]}},"alternative-id":["informatics12030073"],"URL":"https:\/\/doi.org\/10.3390\/informatics12030073","relation":{},"ISSN":["2227-9709"],"issn-type":[{"type":"electronic","value":"2227-9709"}],"subject":[],"published":{"date-parts":[[2025,7,16]]}}}