{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,4]],"date-time":"2025-12-04T06:18:20Z","timestamp":1764829100878,"version":"build-2065373602"},"reference-count":38,"publisher":"MDPI AG","issue":"7","license":[{"start":{"date-parts":[[2023,7,23]],"date-time":"2023-07-23T00:00:00Z","timestamp":1690070400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62176264"],"award-info":[{"award-number":["62176264"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Entropy"],"abstract":"<jats:p>Traditional PDF document detection technology usually builds a rule or feature library for specific vulnerabilities and therefore is only fit for single detection targets and lacks anti-detection ability. To address these shortcomings, we build a double-layer detection model for malicious PDF documents based on an entropy method with multiple features. First, we address the single detection target problem with the fusion of 222 multiple features, including 130 basic features (such as objects, structure, content stream, metadata, etc.) and 82 dangerous features (such as suspicious and encoding function, etc.), which can effectively resist obfuscation and encryption. Second, we generate the best set of features (a total of 153) by creatively applying an entropy method based on RReliefF and MIC (EMBORAM) to PDF samples with 37 typical document vulnerabilities, which can effectively resist anti-detection methods, such as filling data and imitation attacks. Finally, we build a double-layer processing framework to detect samples efficiently through the AdaBoost-optimized random forest algorithm and the robustness-optimized support vector machine algorithm. Compared to the traditional static detection method, this model performs better for various evaluation criteria. The average time of document detection is 1.3 ms, while the accuracy rate reaches 95.9%.<\/jats:p>","DOI":"10.3390\/e25071099","type":"journal-article","created":{"date-parts":[[2023,7,24]],"date-time":"2023-07-24T00:47:03Z","timestamp":1690159623000},"page":"1099","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":2,"title":["Double-Layer Detection Model of Malicious PDF Documents Based on Entropy Method with Multiple Features"],"prefix":"10.3390","volume":"25","author":[{"given":"Enzhou","family":"Song","sequence":"first","affiliation":[{"name":"Information Technology Institute, Information Engineering University, Zhengzhou 450001, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Tao","family":"Hu","sequence":"additional","affiliation":[{"name":"Information Technology Institute, Information Engineering University, Zhengzhou 450001, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Peng","family":"Yi","sequence":"additional","affiliation":[{"name":"Information Technology Institute, Information Engineering University, Zhengzhou 450001, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wenbo","family":"Wang","sequence":"additional","affiliation":[{"name":"Information Technology Institute, Information Engineering University, Zhengzhou 450001, China"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2023,7,23]]},"reference":[{"key":"ref_1","first-page":"54","article-title":"A Survey of Research on Malicious Document Detection","volume":"6","author":"Yu","year":"2021","journal-title":"J. Cyber Secur."},{"key":"ref_2","doi-asserted-by":"crossref","unstructured":"Nissim, N., Cohen, A., Moskovitch, R., Shabtai, A., Edry, M., Bar-Ad, O., and Elovici, Y. (2014, January 24\u201326). ALPD: Active Learning Framework for Enhancing the Detection of Malicious PDF Files. Proceedings of the 2014 IEEE Joint Intelligence and Security Informatics Conference, The Hague, The Netherlands.","DOI":"10.1109\/JISIC.2014.23"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Wang, Y. (2021, January 23\u201325). The De-Obfuscation Method in the Static Detection of Malicious PDF Documents. Proceedings of the 7th Annual International Conference on Network and Information Systems for Computers, Guiyang, China.","DOI":"10.1109\/ICNISC54316.2021.00016"},{"key":"ref_4","first-page":"3831","article-title":"PDF Document Detection Model Based on System Calls and Data Provenance","volume":"42","author":"Lei","year":"2022","journal-title":"J. Comput. Appl."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Lu, X., Wang, F., Jiang, C., and Lio, P. (2021). A Universal Malicious Documents Static Detection Framework Based on Feature Generalization. Appl. Sci., 11.","DOI":"10.3390\/app112412134"},{"key":"ref_6","unstructured":"Maiorca, D., Giacinto, G., and Corona, I. (2012). International Conference on Machine Learning and Data Mining in Pattern Recognition, Springer."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"324","DOI":"10.1016\/j.eswa.2016.07.010","article-title":"SFEM: Structural Feature Extraction Methodology for the Detection of Malicious Office Documents Using Machine Learning Methods","volume":"63","author":"Cohen","year":"2016","journal-title":"Expert Syst. Appl."},{"key":"ref_8","first-page":"1537","article-title":"The Malware Detection Based on Data Breach Actions","volume":"54","author":"Wang","year":"2017","journal-title":"J. Comput. Res. Dev."},{"key":"ref_9","unstructured":"Feng, D., Yu, M., and Wang, Y. (2017, January 25\u201326). Detecting Malicious PDF Files Using Semi-Supervised Learning Method. Proceedings of the International Conference on Advanced Computer Science Applications and Technologies, Beijing, China."},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Corona, I., Maiorca, D., Ariu, D., and Giacinto, G. (2014, January 3\u20137). Lux0R: Detection of malicious PDF-embedded JavaScript code through discriminant analysis of API references. Proceedings of the Workshop on Artificial Intelligent and Security Workshop, Scottsdale, AZ, USA.","DOI":"10.1145\/2666652.2666657"},{"key":"ref_11","unstructured":"Tzermias, Z., Sykiotakis, G., Polychronakis, M., and Markatos, E.P. (2011). Proceedings of the Fourth European Workshop on System Security, ACM."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Maiorca, D., Ariu, D., Corona, I., and Giacinto, G. (2015, January 9\u201311). An evasion resilient approach to the detection of malicious PDF files. Proceedings of the 2015 International Conference on Information Systems Security and Privacy, Angers, France.","DOI":"10.1007\/978-3-319-27668-7_5"},{"key":"ref_13","first-page":"118","article-title":"Malicious PDF document detection based on mixed feature","volume":"40","author":"Du","year":"2019","journal-title":"J. Commun."},{"key":"ref_14","doi-asserted-by":"crossref","first-page":"151","DOI":"10.1007\/s11416-012-0166-z","article-title":"A practical approach on clustering malicious PDF documents","volume":"8","author":"Vatamanu","year":"2012","journal-title":"J. Comput. Virol."},{"key":"ref_15","unstructured":"Maiorca, D., Ariu, D., Corona, I., and Giacinto, G. (2015, January 9\u201311). A Structural and Content-Based Approach for a Precise and Robust Detection of Malicious PDF Files. Proceedings of the 1st International Conference on Information Systems Security and Privacy, Angers, France."},{"key":"ref_16","doi-asserted-by":"crossref","unstructured":"Lu, X., Zhuge, J., Wang, R., Cao, Y., and Chen, Y. (2013, January 7\u201310). De-Obfuscation and Detection of Malicious PDF Files with High Accuracy. Proceedings of the 46th Hawaii International Conference on System Sciences, Wailea, HI, USA.","DOI":"10.1109\/HICSS.2013.166"},{"key":"ref_17","unstructured":"(2022, December 01). ISO 32000-2:2020. Available online: https:\/\/www.pdfa.org\/resource\/iso-32000-pdf\/."},{"key":"ref_18","unstructured":"\u0160rndic, N., and Laskov, P. (2013, January 25\u201327). Detection of malicious pdf files based on hierarchical document structure. Proceedings of the Network and Distributed System Security Symposium, San Diego, CA, USA."},{"key":"ref_19","unstructured":"Jose, T.S., and Santos, D.L. (2018, January 22\u201324). Malicious PDF Documents Detection using Machine Learning Techniques. Proceedings of the 4th International Conference on Information Systems Security and Privacy, Madeira, Portugal."},{"key":"ref_20","doi-asserted-by":"crossref","unstructured":"Laskov, P. (2011, January 5\u20139). Static detection of malicious JavaScript-bearing PDF documents. Proceedings of the Twenty-Seventh Computer Security Applications Conference, Orlando, FL, USA.","DOI":"10.1145\/2076732.2076785"},{"key":"ref_21","unstructured":"Nedim, \u0160., and Pavel, L. (2014, January 17\u201318). Practical Evasion of a Learning-Based Classifier: A Case Study. Proceedings of the 2014 IEEE Symposium on Security and Privacy, San Jose, CA, USA."},{"key":"ref_22","doi-asserted-by":"crossref","unstructured":"Chandran, P.P., and Jeyakarthic, M. (2022, January 9\u201311). Jeyakarthic: Intelligent Optimal Gated Recurrent Unit based Malicious PDF Detection and Classification Model. Proceedings of the 2022 International Conference on Applied Artificial Intelligence and Computing, Salem, India.","DOI":"10.1109\/ICAAIC53929.2022.9793116"},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"22","DOI":"10.1186\/s13635-016-0045-0","article-title":"Hidost: A Static Machine-Learning-Based Detector of Malicious Files","volume":"2016","author":"Laskov","year":"2016","journal-title":"EURASIP J. Inf. Secur."},{"key":"ref_24","first-page":"33","article-title":"PDF file vulnerability detection","volume":"57","author":"Wen","year":"2017","journal-title":"J. Tsinghua Univ. (Sci. Technol.)"},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"314","DOI":"10.1016\/j.future.2020.09.015","article-title":"Improving malicious PDF classifier with feature engineering: A data-driven approach","volume":"115","author":"Falah","year":"2021","journal-title":"Future Gener. Comput. Syst."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"101","DOI":"10.7763\/IJET.2016.V8.866","article-title":"A Method for Shellcode Extraction from Malicious Document Files Using Entropy and Emulation","volume":"8","author":"Iwamoto","year":"2016","journal-title":"Int. J. Eng. Technol."},{"key":"ref_27","unstructured":"Xu, M., and Kim, T. (2017, January 16\u201318). Plat Pal: Detecting malicious documents with platform diversity. Proceedings of the USENIX Security Symposium, Vancouver, BC, Canada."},{"key":"ref_28","doi-asserted-by":"crossref","unstructured":"Liu, D., Wang, H., and Stavrou, A. (2014, January 23\u201326). Detecting malicious javascript in pdf through document instrumentation. Proceedings of the 44th Annual IEEE\/IFIP International Conference on Dependable Systems and Networks, Atlanta, GA, USA.","DOI":"10.1109\/DSN.2014.92"},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Yu, M., Jiang, J., Li, G., Li, J., Lou, C., Liu, C., Huang, W., and Wang, Y. (2019, January 10\u201312). A Unified Malicious Documents Detection Model Based on Two Layers of Abstraction. Proceedings of the IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th In-ternational Conference on Smart City; IEEE 5th International Conference on Data Science and Systems, Zhangjiajie, China.","DOI":"10.1109\/HPCC\/SmartCity\/DSS.2019.00322"},{"key":"ref_30","unstructured":"(2022, December 18). PeePDF. Available online: https:\/\/github.com\/jesparza\/peepdf."},{"key":"ref_31","unstructured":"(2023, January 11). PDFParser. Available online: https:\/\/github.com\/smalot\/pdfparser."},{"key":"ref_32","unstructured":"(2022, December 01). PDFTear. Available online: https:\/\/github.com\/Cryin\/PDFTear."},{"key":"ref_33","unstructured":"(2023, January 12). PDFRate. Available online: https:\/\/github.com\/csmutz\/pdfrate."},{"key":"ref_34","unstructured":"Fernandez, F. (2022, January 28\u201330). Heuristic engines. Proceedings of the 20th Virus Bulletin Conference, Prague, Czech Republic."},{"key":"ref_35","first-page":"1007","article-title":"On Robustness Properties of Convex Risk Minimization Methods for PatternRecognition","volume":"5","author":"Christmann","year":"2004","journal-title":"J. Mach. Learn. Res."},{"key":"ref_36","unstructured":"(2022, December 02). Google. Available online: https:\/\/www.google.com.hk\/."},{"key":"ref_37","unstructured":"(2022, December 05). Yahoo. Available online: https:\/\/www.yahoo.com\/."},{"key":"ref_38","unstructured":"(2023, March 02). React-pdf. Available online: https:\/\/github.com\/wojtekmaj\/react-pdf\/."}],"container-title":["Entropy"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/1099-4300\/25\/7\/1099\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,10]],"date-time":"2025-10-10T20:17:24Z","timestamp":1760127444000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/1099-4300\/25\/7\/1099"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,7,23]]},"references-count":38,"journal-issue":{"issue":"7","published-online":{"date-parts":[[2023,7]]}},"alternative-id":["e25071099"],"URL":"https:\/\/doi.org\/10.3390\/e25071099","relation":{},"ISSN":["1099-4300"],"issn-type":[{"type":"electronic","value":"1099-4300"}],"subject":[],"published":{"date-parts":[[2023,7,23]]}}}