{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,2]],"date-time":"2026-01-02T07:44:05Z","timestamp":1767339845515,"version":"3.44.0"},"reference-count":34,"publisher":"Springer Science and Business Media LLC","issue":"3","license":[{"start":{"date-parts":[[2024,6,12]],"date-time":"2024-06-12T00:00:00Z","timestamp":1718150400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2024,6,12]],"date-time":"2024-06-12T00:00:00Z","timestamp":1718150400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/501100006752","name":"Universidade do Porto","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100006752","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Data Sci Anal"],"published-print":{"date-parts":[[2025,9]]},"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>Phishing attacks aims to steal sensitive information and, unfortunately, are becoming a common practice on the web. Email phishing is one of the most common types of attacks on the web and can have a big impact on individuals and enterprises. There is still a gap in prevention when it comes to detecting phishing emails, as new attacks are usually not detected. The goal of this work was to develop a model capable of identifying phishing emails based on machine learning approaches. The work was performed in collaboration with E-goi, a multi-channel marketing automation company. The data consisted of emails collected from the E-goi servers in the electronic mail format. The problem consisted of a classification problem with unbalanced classes, with the minority class corresponding to the phishing emails and having less than 1% of the total emails. Several models were evaluated after careful data selection and feature extraction based on the email content and the literature regarding these types of problems. Due to the imbalance present in the data, several sampling methods based on under-sampling techniques were tested to see their impact on the model\u2019s ability to detect phishing emails. The final model consisted of a neural network able to detect more than 80% of phishing emails without compromising the remaining emails sent by E-goi clients.<\/jats:p>","DOI":"10.1007\/s41060-024-00579-w","type":"journal-article","created":{"date-parts":[[2024,6,12]],"date-time":"2024-06-12T08:02:36Z","timestamp":1718179356000},"page":"2001-2020","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":10,"title":["A case study on phishing detection with a machine learning net"],"prefix":"10.1007","volume":"20","author":[{"given":"Ana","family":"Bezerra","sequence":"first","affiliation":[]},{"given":"Ivo","family":"Pereira","sequence":"additional","affiliation":[]},{"given":"Miguel \u00c2ngelo","family":"Rebelo","sequence":"additional","affiliation":[]},{"given":"Duarte","family":"Coelho","sequence":"additional","affiliation":[]},{"given":"Daniel Alves de","family":"Oliveira","sequence":"additional","affiliation":[]},{"given":"Joaquim F. Pinto","family":"Costa","sequence":"additional","affiliation":[]},{"given":"Ricardo P. M.","family":"Cruz","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2024,6,12]]},"reference":[{"key":"579_CR1","doi-asserted-by":"crossref","unstructured":"Dhamija, R., Tygar, J.D., Hearst, M.: Why phishing works. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 581\u2013590 (2006)","DOI":"10.1145\/1124772.1124861"},{"issue":"1","key":"579_CR2","doi-asserted-by":"publisher","first-page":"74","DOI":"10.1145\/2063176.2063197","volume":"55","author":"J Hong","year":"2012","unstructured":"Hong, J.: The state of phishing attacks. Commun. ACM 55(1), 74\u201381 (2012)","journal-title":"Commun. ACM"},{"issue":"3","key":"579_CR3","doi-asserted-by":"publisher","first-page":"675","DOI":"10.1016\/j.is.2010.11.003","volume":"36","author":"W Kim","year":"2011","unstructured":"Kim, W., Jeong, O.-R., Kim, C., So, J.: The dark side of the internet: attacks, costs and responses. Inf. Syst. 36(3), 675\u2013705 (2011)","journal-title":"Inf. Syst."},{"key":"579_CR4","unstructured":"Greene, R.J.E.: The 48 laws of power. Penguin Publishing Group, London, United Kingdom (2020)"},{"key":"579_CR5","doi-asserted-by":"publisher","first-page":"19","DOI":"10.1016\/j.ijhcs.2018.12.004","volume":"125","author":"A Ferreira","year":"2019","unstructured":"Ferreira, A., Teles, S.: Persuasion: how phishing emails can influence users and bypass security measures. Int. J. Hum.-Comput. Stud. 125, 19\u201331 (2019). https:\/\/doi.org\/10.1016\/j.ijhcs.2018.12.004","journal-title":"Int. J. Hum.-Comput. Stud."},{"issue":"3","key":"579_CR6","doi-asserted-by":"publisher","first-page":"316","DOI":"10.1080\/15564886.2020.1829224","volume":"16","author":"AK Ghazi-Tehrani","year":"2021","unstructured":"Ghazi-Tehrani, A.K., Pontell, H.N.: Phishing evolves: analyzing the enduring cybercrime. Vict. Offenders 16(3), 316\u2013342 (2021)","journal-title":"Vict. Offenders"},{"key":"579_CR7","unstructured":"SecurityScordcard: 12 types of phishing attacks and how to identify them. Accessed 08 Feb 2022 (2021). https:\/\/securityscorecard.com\/blog\/types-of-phishing-attacks-and-how-to-identify-them"},{"key":"579_CR8","doi-asserted-by":"publisher","unstructured":"Prasad, R., Rohokale, V.: Phishing, pp. 33\u201342. Springer, Cham (2020). https:\/\/doi.org\/10.1007\/978-3-030-31703-4-3","DOI":"10.1007\/978-3-030-31703-4-3"},{"key":"579_CR9","doi-asserted-by":"publisher","DOI":"10.1155\/2014\/425731","author":"A Akinyelu","year":"2014","unstructured":"Akinyelu, A., Adewumi, A.: Classification of phishing email using random forest machine learning technique. J. Appl. Math. (2014). https:\/\/doi.org\/10.1155\/2014\/425731","journal-title":"J. Appl. Math."},{"key":"579_CR10","unstructured":"Shahrivari, V., Darabi, M.M., Izadi, M.: Phishing detection using machine learning techniques. CoRR (2020)"},{"key":"579_CR11","unstructured":"Zhang, N., Yuan, Y.: Phishing detection using neural network. CS229 lecture notes (2012)"},{"key":"579_CR12","unstructured":"Shahrivari, V., Darabi, M.M., Izadi, M.: Phishing detection using machine learning techniques. arXiv (2020)"},{"key":"579_CR13","doi-asserted-by":"publisher","unstructured":"Afroz, S., Greenstadt, R.: PhishZoo: detecting phishing websites by looking at them. In: 2011 IEEE Fifth International Conference on Semantic Computing, pp. 368\u2013375 (2011). https:\/\/doi.org\/10.1109\/ICSC.2011.52","DOI":"10.1109\/ICSC.2011.52"},{"key":"579_CR14","doi-asserted-by":"crossref","unstructured":"Branco, B., Abreu, P., Gomes, A.S., Almeida, M.S., Ascens\u00e3o, J.T., Bizarro, P.: Interleaved sequence RNNs for fraud detection. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3101\u20133109 (2020)","DOI":"10.1145\/3394486.3403361"},{"key":"579_CR15","unstructured":"Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems (2017)"},{"issue":"3","key":"579_CR16","doi-asserted-by":"publisher","first-page":"3713","DOI":"10.1007\/s11042-022-13428-4","volume":"82","author":"D Khurana","year":"2017","unstructured":"Khurana, D., Koli, A., Khatter, K., Singh, S.: Natural language processing: state of the art, current trends and challenges. Multimed. Tools Appl. 82(3), 3713\u20133744 (2017)","journal-title":"Multimed. Tools Appl."},{"key":"579_CR17","unstructured":"Rothman, D.: Transformers for natural language processing: build innovative deep neural network architectures for NLP with Python, PyTorch, TensorFlow, BERT, RoBERTa, and more. Packt Publishing, Birmingham, UK (2021). https:\/\/books.google.pt\/books?id=Cr0YEAAAQBAJ"},{"key":"579_CR18","unstructured":"Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics (2019). https:\/\/api.semanticscholar.org\/CorpusID:52967399"},{"key":"579_CR19","unstructured":"Face, H.: Transformers. https:\/\/huggingface.co\/docs\/transformers\/index. Accessed: 2022-03-08"},{"key":"579_CR20","unstructured":"LookFantastic: Promotion Campaign LookFantastic. Accessed 19 May 2022 (2022). https:\/\/www.lookfantastic.pt\/myreferrals.list"},{"key":"579_CR21","doi-asserted-by":"crossref","unstructured":"Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using siamese BERT-networks. CoRR abs\/1908.10084 (2019)","DOI":"10.18653\/v1\/D19-1410"},{"issue":"3","key":"579_CR22","doi-asserted-by":"publisher","first-page":"5718","DOI":"10.1016\/j.eswa.2008.06.108","volume":"36","author":"S-J Yen","year":"2009","unstructured":"Yen, S.-J., Lee, Y.-S.: Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst. Appl. 36(3), 5718\u20135727 (2009)","journal-title":"Expert Syst. Appl."},{"key":"579_CR23","unstructured":"Rahman, M.M., Davis, D.: Cluster based under-sampling for unbalanced cardiovascular data. In: Proceedings of the World Congress on Engineering, vol. 3, pp. 3\u20135 (2013)"},{"key":"579_CR24","doi-asserted-by":"crossref","unstructured":"Lin, W.-C., Tsai, C.-F., Hu, Y.-H., Jhang, J.-S.: Clustering-based undersampling in class-imbalanced data. Inf. Sci. 409, 17\u201326 (2017)","DOI":"10.1016\/j.ins.2017.05.008"},{"key":"579_CR25","unstructured":"Huang, A.: Similarity measures for text document clustering. In: Proceedings of the Sixth New Zealand Computer Science Research Student Conference, vol. 4, pp. 9\u201356 (2008)"},{"key":"579_CR26","doi-asserted-by":"publisher","first-page":"53","DOI":"10.1016\/0377-0427(87)90125-7","volume":"20","author":"PJ Rousseeuw","year":"1987","unstructured":"Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53\u201365 (1987)","journal-title":"J. Comput. Appl. Math."},{"issue":"2","key":"579_CR27","doi-asserted-by":"publisher","first-page":"411","DOI":"10.1111\/1467-9868.00293","volume":"63","author":"R Tibshirani","year":"2001","unstructured":"Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the GAP statistic. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 63(2), 411\u2013423 (2001)","journal-title":"J. R. Stat. Soc.: Ser. B (Stat. Methodol.)"},{"key":"579_CR28","doi-asserted-by":"crossref","unstructured":"Heiberger, R.M., Neuwirth, E.: One-way ANOVA. In: R Through Excel, pp. 165\u2013191. Springer, New York City, USA (2009)","DOI":"10.1007\/978-1-4419-0052-4_7"},{"key":"579_CR29","unstructured":"Scikit-learn: Cross-validation: evaluating estimator performance. https:\/\/scikit-learn.org\/stable\/modules\/cross_validation.html. Accessed 30 March 2022"},{"issue":"10","key":"579_CR30","doi-asserted-by":"publisher","first-page":"1340","DOI":"10.1093\/bioinformatics\/btq134","volume":"26","author":"A Altmann","year":"2010","unstructured":"Altmann, A., Tolo\u015fi, L., Sander, O., Lengauer, T.: Permutation importance: a corrected feature importance measure. Bioinformatics 26(10), 1340\u20131347 (2010)","journal-title":"Bioinformatics"},{"key":"579_CR31","doi-asserted-by":"crossref","unstructured":"Ribeiro, M.T., Singh, S., Guestrin, C.: Why should I trust you?\u201d explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135\u20131144 (2016)","DOI":"10.1145\/2939672.2939778"},{"key":"579_CR32","volume-title":"Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems","author":"A Gron","year":"2017","unstructured":"Gron, A.: Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, 1st edn. O\u2019Reilly Media Inc, Massachusetts, USA (2017)","edition":"1"},{"issue":"1","key":"579_CR33","first-page":"1","volume":"12","author":"L Gon\u00e7alves","year":"2014","unstructured":"Gon\u00e7alves, L., Subtil, A., Oliveira, M.R., Zea Bermudez, P.: ROC curve estimation: an overview. REVSTAT-Stat. J. 12(1), 1\u201320 (2014)","journal-title":"REVSTAT-Stat. J."},{"key":"579_CR34","doi-asserted-by":"crossref","unstructured":"Randhawa, R.H., Aslam, N., Alauthman, M., Rafiq, H.: Evasion generative adversarial network for low data regimes. IEEE Transactions on Artificial Intelligence (2022)","DOI":"10.1109\/TAI.2022.3196283"}],"container-title":["International Journal of Data Science and Analytics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s41060-024-00579-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s41060-024-00579-w\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s41060-024-00579-w.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,5]],"date-time":"2025-09-05T20:11:32Z","timestamp":1757103092000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s41060-024-00579-w"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2024,6,12]]},"references-count":34,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2025,9]]}},"alternative-id":["579"],"URL":"https:\/\/doi.org\/10.1007\/s41060-024-00579-w","relation":{},"ISSN":["2364-415X","2364-4168"],"issn-type":[{"type":"print","value":"2364-415X"},{"type":"electronic","value":"2364-4168"}],"subject":[],"published":{"date-parts":[[2024,6,12]]},"assertion":[{"value":"2 February 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"28 May 2024","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"12 June 2024","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare no competing interests","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Conflict of interest"}}]}}