{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,20]],"date-time":"2026-02-20T18:31:17Z","timestamp":1771612277639,"version":"3.50.1"},"reference-count":40,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,2,24]],"date-time":"2025-02-24T00:00:00Z","timestamp":1740355200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,2,24]],"date-time":"2025-02-24T00:00:00Z","timestamp":1740355200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100012166","name":"National Key R& D Program of China","doi-asserted-by":"crossref","award":["31020"],"award-info":[{"award-number":["31020"]}],"id":[{"id":"10.13039\/501100012166","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Cybersecurity"],"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>In recent years, new malicious email attacks have emerged. We summarize two major challenges in the current field of malicious email detection using machine learning algorithms. (1) Current works on malicious email detection use different datasets and lack a unified and comprehensive open source dataset standard for evaluating detection performance. In addition, outdated data makes it difficult to detect new types of malicious email attacks. (2) There are limitations in feature selection and extraction. Relying only on static features or body textual features cannot satisfy the detection of both common phishing or spam email and new malicious emails that exploit protocol vulnerabilities. To address these problems, we propose the Exploiting Protocol Vulnerability Malicious Email (EPVME) dataset, which contains 49,136 malicious email samples. The EPVME dataset is constructed by summarizing and simulating the novel types of malicious email attacks that exploit email protocol vulnerabilities. In our dataset, the coverage of the types of malicious emails and the number of them are significantly increased. By collecting the currently available open source datasets, we build a large-scale dataset with 660,985 samples. Through two sets of comparative experiments on the dataset containing EPVME, we verify the necessity, reliability, and validity of the EPVME dataset. By using a large and comprehensive open source email dataset, we hope to help subsequent work on malicious email detection achieve comparative performance. Furthermore, we propose a new feature selection and construction method that combines both static features and textual features. We extract 79 static features from both the header and body parts of email samples, perform textual feature extraction on the pre-processed body parts, and combine various machine learning algorithms for detection model construction and experimental comparison. Our detection model can achieve an accuracy of 99.968% and a false positive rate of 0.099%.<\/jats:p>","DOI":"10.1186\/s42400-024-00309-6","type":"journal-article","created":{"date-parts":[[2025,2,24]],"date-time":"2025-02-24T01:02:21Z","timestamp":1740358941000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":7,"title":["A combined feature selection approach for malicious email detection based on a comprehensive email dataset"],"prefix":"10.1186","volume":"8","author":[{"given":"Han","family":"Zhang","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0009-0003-0055-8362","authenticated-orcid":false,"given":"Yong","family":"Shi","sequence":"additional","affiliation":[]},{"given":"Ming","family":"Liu","sequence":"additional","affiliation":[]},{"given":"Libo","family":"Chen","sequence":"additional","affiliation":[]},{"given":"Songyang","family":"Wu","sequence":"additional","affiliation":[]},{"given":"Zhi","family":"Xue","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,2,24]]},"reference":[{"key":"309_CR1","doi-asserted-by":"publisher","DOI":"10.1016\/j.cose.2021.102414","volume":"110","author":"AA Alhogail","year":"2021","unstructured":"Alhogail AA, Alsabih A (2021) Applying machine learning and natural language processing to detect phishing email. Comput Secur 110:102414. https:\/\/doi.org\/10.1016\/j.cose.2021.102414","journal-title":"Comput Secur"},{"key":"309_CR2","first-page":"4871","volume-title":"Domainkeys identified mail (dkim) signatures","author":"E Allman","year":"2007","unstructured":"Allman E, Callas J, Delany M, Libbey M, Fenton J, Thomas M (2007) Domainkeys identified mail (dkim) signatures. Technical report, RFC, p 4871"},{"issue":"2","key":"309_CR3","doi-asserted-by":"publisher","first-page":"145","DOI":"10.1016\/j.psra.2016.09.017","volume":"18","author":"AS Aski","year":"2016","unstructured":"Aski AS, Sourati NK (2016) Proposed efficient algorithm to filter spam using machine learning techniques. Pacific Sci Rev Nat Sci Eng 18(2):145\u2013149. https:\/\/doi.org\/10.1016\/j.psra.2016.09.017","journal-title":"Pacific Sci Rev Nat Sci Eng"},{"key":"309_CR4","doi-asserted-by":"crossref","unstructured":"Bountaka, P, Koutroumpouchos K, Xenakis C (2021) A comparison of natural language processing and machine learning methods for phishing email detection. In: ARES 2021: the 16th international conference on availability, reliability and security, Vienna, Austria, 2021, pp 127:1\u2013127:12. ACM","DOI":"10.1145\/3465481.3469205"},{"key":"309_CR5","doi-asserted-by":"publisher","unstructured":"Callas J, Donnerhacke L, Finney H, Shaw D, Thayer R (2007) Openpgp message format. RFC. https:\/\/doi.org\/10.17487\/RFC4880","DOI":"10.17487\/RFC4880"},{"key":"309_CR6","doi-asserted-by":"crossref","unstructured":"Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 785\u2013794. ACM","DOI":"10.1145\/2939672.2939785"},{"key":"309_CR7","doi-asserted-by":"publisher","first-page":"143","DOI":"10.1016\/j.eswa.2018.05.031","volume":"110","author":"A Cohen","year":"2018","unstructured":"Cohen A, Nissim N, Elovici Y (2018) Novel set of general descriptive features for enhanced detection of malicious emails using machine learning methods. Expert Syst Appl 110:143\u2013169","journal-title":"Expert Syst Appl"},{"issue":"1","key":"309_CR8","doi-asserted-by":"publisher","first-page":"21","DOI":"10.1109\/TIT.1967.1053964","volume":"13","author":"T Cover","year":"1967","unstructured":"Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21\u201327","journal-title":"IEEE Trans Inf Theory"},{"key":"309_CR9","doi-asserted-by":"publisher","first-page":"89","DOI":"10.1016\/j.compeleceng.2019.01.004","volume":"74","author":"M Diale","year":"2019","unstructured":"Diale M, \u00c7elik T, van der Walt C (2019) Unsupervised feature learning for spam email filtering. Comput Electr Eng 74:89\u2013104. https:\/\/doi.org\/10.1016\/j.compeleceng.2019.01.004","journal-title":"Comput Electr Eng"},{"key":"309_CR10","doi-asserted-by":"crossref","unstructured":"Ding X, Liu B, Jiang Z, Wang Q, Xin L (2021) Spear phishing emails detection based on machine learning. In: 2021 IEEE 24th international conference on computer supported cooperative work in design (CSCWD), pp 354\u2013359. IEEE","DOI":"10.1109\/CSCWD49262.2021.9437758"},{"issue":"13","key":"309_CR11","doi-asserted-by":"publisher","first-page":"4425","DOI":"10.3390\/app10134425","volume":"10","author":"Y Fang","year":"2020","unstructured":"Fang Y, Xu Y, Jia P, Huang C (2020) Providing email privacy by preventing webmail from loading malicious XSS payloads. Appl Sci 10(13):4425. https:\/\/doi.org\/10.3390\/app10134425","journal-title":"Appl Sci"},{"key":"309_CR12","doi-asserted-by":"publisher","first-page":"56329","DOI":"10.1109\/ACCESS.2019.2913705","volume":"7","author":"Y Fang","year":"2019","unstructured":"Fang Y, Zhang C, Huang C, Liu L, Yang Y (2019) Phishing email detection using improved RCNN model with multilevel vectors and attention mechanism. IEEE Access 7:56329\u201356340. https:\/\/doi.org\/10.1109\/ACCESS.2019.2913705","journal-title":"IEEE Access"},{"key":"309_CR13","unstructured":"Foundation AS (2022) Spam assassin project, ham email corpus"},{"key":"309_CR14","doi-asserted-by":"crossref","unstructured":"Ho TK (1995) Random decision forests. In: Proceedings of 3rd international conference on document analysis and recognition, Vol.\u00a01, pp. 278\u2013282. IEEE","DOI":"10.1109\/ICDAR.1995.598994"},{"key":"309_CR15","doi-asserted-by":"crossref","unstructured":"Hosmer\u00a0Jr DW, Lemeshow S, Sturdivant RX (2013) Applied logistic regression, Vol.\u00a0398. Wiley","DOI":"10.1002\/9781118548387"},{"key":"309_CR16","unstructured":"Kaelbling L (2015) Enron email dataset"},{"key":"309_CR17","doi-asserted-by":"crossref","unstructured":"Kaspersky (2022) Apt trends report q3 2022","DOI":"10.1016\/j.fopow.2022.10.011"},{"key":"309_CR18","doi-asserted-by":"crossref","unstructured":"Kaur H, Sharma A (2016) Improved email spam classification method using integrated particle swarm optimization and decision tree. In: 2016 2nd international conference on next generation computing technologies (NGCT). IEEE","DOI":"10.1109\/NGCT.2016.7877470"},{"key":"309_CR19","doi-asserted-by":"publisher","unstructured":"Kitterman S (2014) Sender policy framework (spf) for authorizing use of domains in email, version 1. RFC 7208 (Proposed Standard), Internet Engineering Task Force. https:\/\/doi.org\/10.17487\/RFC7208","DOI":"10.17487\/RFC7208"},{"key":"309_CR20","doi-asserted-by":"publisher","unstructured":"Kucherawy MS, Zwicky ED (2015) Domain-based message authentication, reporting, and conformance (DMARC). RFC. https:\/\/doi.org\/10.17487\/RFC7489","DOI":"10.17487\/RFC7489"},{"issue":"1","key":"309_CR21","doi-asserted-by":"publisher","first-page":"278","DOI":"10.1109\/TBDATA.2020.2978915","volume":"8","author":"Q Li","year":"2022","unstructured":"Li Q, Cheng M, Wang J, Sun B (2022) LSTM based phishing detection for big email data. IEEE Trans Big Data 8(1):278\u2013288. https:\/\/doi.org\/10.1109\/TBDATA.2020.2978915","journal-title":"IEEE Trans Big Data"},{"key":"309_CR22","doi-asserted-by":"crossref","unstructured":"Macdonald C, Ounis I, Soboroff I (2007) Overview of the TREC 2007 blog track. In: Voorhees EM, Buckland LP (eds) Proceedings of the sixteenth text retrieval conference, TREC 2007, Gaithersburg, Maryland, USA, November 5-9, 2007, Vol.\u00a0500-274 of NIST Special Publication. National Institute of Standards and Technology (NIST)","DOI":"10.6028\/NIST.SP.500-274.blog-overview"},{"key":"309_CR23","doi-asserted-by":"publisher","DOI":"10.1016\/j.comnet.2022.108826","volume":"206","author":"S Magdy","year":"2022","unstructured":"Magdy S, Abouelseoud Y, Mikhail M (2022) Efficient spam and phishing emails filtering based on deep learning. Comput Networks 206:108826. https:\/\/doi.org\/10.1016\/j.comnet.2022.108826","journal-title":"Comput Networks"},{"key":"309_CR24","unstructured":"Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781"},{"key":"309_CR25","unstructured":"M\u00fcller J, Brinkmann M, Poddebniak D, B\u00f6ck H, Schinzel S, Somorovsky J, Schwenk J (2019) Johnny, you are fired! - spoofing openpgp and s\/mime signatures in emails. In: 28th USENIX security symposium, USENIX Security 2019, Santa Clara, CA, USA, 2019, pp 1011\u20131028. USENIX Association"},{"key":"309_CR26","doi-asserted-by":"publisher","first-page":"205","DOI":"10.1145\/3471621.3471862","volume-title":"RAID \u201921: 24th international symposium on research in attacks, intrusions and defenses, San Sebastian, Spain, 2021","author":"M Nabeel","year":"2021","unstructured":"Nabeel M, Altinisik E, Sun H, Khalil I, Wang WH, Yu T (2021) CADUE: content-agnostic detection of unwanted emails for enterprise security. In: Bilge L, Dumitras T (eds) RAID \u201921: 24th international symposium on research in attacks, intrusions and defenses, San Sebastian, Spain, 2021. ACM, pp 205\u2013219"},{"key":"309_CR27","unstructured":"Nazario J (2021) nazario phishingcorpus"},{"key":"309_CR28","doi-asserted-by":"publisher","unstructured":"Postel JB (1982) Simple mail transfer protocol. https:\/\/doi.org\/10.17487\/RFC0821","DOI":"10.17487\/RFC0821"},{"key":"309_CR29","doi-asserted-by":"publisher","first-page":"81","DOI":"10.1007\/BF00116251","volume":"1","author":"JR Quinlan","year":"1986","unstructured":"Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81\u2013106","journal-title":"Mach Learn"},{"key":"309_CR30","unstructured":"Ramos J et\u00a0al (2003) Using tf-idf to determine word relevance in document queries. In: Proceedings of the first instructional conference on machine learning, Vol.\u00a0242, pp 29\u201348. Citeseer"},{"key":"309_CR31","doi-asserted-by":"publisher","unstructured":"Ramsdell B, Turner S (2010) Secure\/multipurpose internet mail extensions (S\/MIME) version 3.2 message specification. RFC. https:\/\/doi.org\/10.17487\/RFC5751","DOI":"10.17487\/RFC5751"},{"key":"309_CR32","unstructured":"Shen K, Wang C, Guo M, Zheng X, Lu C, Liu B, Zhao Y, Hao S, Duan H, Pan Q et\u00a0al (2021) Weak links in authentication chains: a large-scale analysis of email sender spoofing attacks. In: 30th USENIX Security Symposium (USENIX Security 21)"},{"key":"309_CR33","unstructured":"Sunknighteric (2023) Epvme dataset"},{"key":"309_CR34","doi-asserted-by":"crossref","unstructured":"Toolan F, Carthy J (2010) Feature selection for spam and phishing detection. In: 2010 eCrime Researchers Summit, pp 1\u201312. IEEE","DOI":"10.1109\/ecrime.2010.5706696"},{"key":"309_CR35","doi-asserted-by":"crossref","unstructured":"Vishagini V, Rajan AK (2018) An improved spam detection method with weighted support vector machine. In: 2018 International conference on data science and engineering (ICDSE). IEEE","DOI":"10.1109\/ICDSE.2018.8527737"},{"key":"309_CR36","doi-asserted-by":"crossref","unstructured":"Wong M, Schlitt W (2006) Sender policy framework (spf) for authorizing use of domains in e-mail, version 1. Technical report, RFC 4408","DOI":"10.17487\/rfc4408"},{"key":"309_CR37","unstructured":"Wooyun (2019) Wooyun email xss dataset"},{"key":"309_CR38","doi-asserted-by":"crossref","unstructured":"Zhang H, Chen L, Liu M, Shi Y, Wu S, Xue Z (2023) Both sides needed: a two-dimensional measurement study of email security based on spf and dmarc. In: 2023 19th international conference on mobility, sensing and networking (MSN), pp 855\u2013861","DOI":"10.1109\/MSN60784.2023.00126"},{"key":"309_CR39","doi-asserted-by":"crossref","unstructured":"Zhang H, Mi D, Chen L, Liu M, Shi Y, Xue Z (2023) Subdomain protection is needed: an SPF and DMARC-based empirical measurement study and proactive solution of email security. In: 2023 42nd international symposium on reliable distributed systems (SRDS), pp 140\u2013150","DOI":"10.1109\/SRDS60354.2023.00023"},{"key":"309_CR40","doi-asserted-by":"crossref","unstructured":"Zhang J, Li W, Gong L, Gu Z, Wu J (2019) Targeted malicious email detection using hypervisor-based dynamic analysis and ensemble learning. In: 2019 IEEE global communications conference (GLOBECOM), pp 1\u20136. IEEE","DOI":"10.1109\/GLOBECOM38437.2019.9014069"}],"container-title":["Cybersecurity"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s42400-024-00309-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1186\/s42400-024-00309-6\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1186\/s42400-024-00309-6.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,2,24]],"date-time":"2025-02-24T01:02:35Z","timestamp":1740358955000},"score":1,"resource":{"primary":{"URL":"https:\/\/cybersecurity.springeropen.com\/articles\/10.1186\/s42400-024-00309-6"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,2,24]]},"references-count":40,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["309"],"URL":"https:\/\/doi.org\/10.1186\/s42400-024-00309-6","relation":{},"ISSN":["2523-3246"],"issn-type":[{"value":"2523-3246","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,2,24]]},"assertion":[{"value":"11 November 2023","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"24 July 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"24 February 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"14"}}