Modern Phishing URL Detection Using Feature Selection and Comparative Classification Models

M Teguh Prastyo; Febri Vahlevie

Authors

M Teguh Prastyo Institut Bisnis & Informatika Darmajaya
Febri Vahlevie Institut Bisnis & Informatika Darmajaya

Keywords:

phishing URL detection, feature selection, Information Gain, comparative classification, cybersecurity

Abstract

Phishing URLs remain a critical cybersecurity threat because attackers increasingly exploit domain similarity, webpage imitation, and structural manipulation to deceive users and bypass conventional blacklist-based detection. This study proposes an engineering-oriented phishing URL detection pipeline using feature selection and comparative classification models implemented in RapidMiner. The PhiUSIIL Phishing URL Dataset was used, and after preprocessing, 234,903 URL records were retained, consisting of 134,834 legitimate URLs and 100,069 phishing URLs. Non-predictive attributes were removed, invalid target labels were filtered, missing predictor values were handled, and the target label was transformed into a binominal class, where phishing was treated as the positive class. Information Gain was applied to identify the most discriminative attributes, and the top-20 features were used for model comparison. Five classification models were evaluated using stratified 10-fold cross-validation: Decision Tree, Random Forest, Naive Bayes, Logistic Regression, and Gradient Boosted Trees. The results show that all models achieved accuracy above 99.95%, indicating strong class separability within the selected-feature scenario. Random Forest produced the most balanced performance, achieving 100.00% accuracy, 100.00% precision, 100.00% recall, 100.00% F1-score, and AUC of 1.000, with only three phishing URLs misclassified as legitimate. The findings demonstrate that selected URL similarity and webpage structural features can support efficient and interpretable phishing detection. However, the near-perfect performance should be interpreted as strong internal validation, and future work should include external dataset validation and ablation testing of dominant features.

Downloads

Download data is not yet available.

References

A. Mughaid, S. AlZu’bi, A. Hnaif, S. Taamneh, A. Alnajjar, and E. A. Elsoud, “An intelligent cyber security phishing detection system using deep learning techniques,” Cluster Comput., vol. 25, pp. 3819–3828, 2022, doi: 10.1007/s10586-022-03604-4.

Y. Liang, “Robust detection of malicious URLs with self-paced wide and deep learning,” IEEE Trans. Dependable Secure Comput., vol. 19, no. 2, pp. 717–730, 2022, doi: 10.1109/TDSC.2021.3121388.

S. Ariyadasa, S. Fernando, and S. Fernando, “Combining long-term recurrent convolutional and graph convolutional networks to detect phishing sites using URL and HTML,” IEEE Access, vol. 10, pp. 82355–82375, 2022, doi: 10.1109/ACCESS.2022.3196018.

T. Wu, M. Wang, Y. Xi, and Z. Zhao, “Malicious URL detection model based on bidirectional gated recurrent unit and attention mechanism,” Applied Sciences, vol. 12, no. 23, p. 12367, 2022, doi: 10.3390/app122312367.

W. Wang, F. Zhang, X. Luo, and S. Zhang, “PDRCNN: Precise phishing detection with recurrent convolutional neural networks,” Security and Communication Networks, vol. 2019, pp. 1–15, 2019, doi: 10.1155/2019/2595794.

A. S. Bozkir, F. C. Dalgic, and M. Aydos, “GramBeddings: A new neural network for URL based identification of phishing web pages through n-gram embeddings,” Comput. Secur., vol. 124, p. 102964, 2023, doi: 10.1016/j.cose.2022.102964.

C. Opara, Y. Chen, and B. Wei, “Look before you leap: Detecting phishing web pages by exploiting raw URL and HTML characteristics,” Expert Syst. Appl., vol. 236, p. 121183, 2024, doi: 10.1016/j.eswa.2023.121183.

D. Sahoo, C. Liu, and S. C. H. Hoi, “Malicious URL detection using machine learning: A survey,” ACM Comput. Surv., vol. 54, no. 4, pp. 1–36, 2021, doi: 10.1145/3446871.

Z. Alkhalil, C. Hewage, L. Nawaf, and I. Khan, “Phishing attacks: A recent comprehensive study and a new anatomy,” Front. Comput. Sci., vol. 3, 2021, doi: 10.3389/fcomp.2021.563060.

A. Karim, M. Shahroz, K. Mustofa, S. B. Belhaouari, and S. R. K. Joga, “Phishing detection system through hybrid machine learning based on URL,” IEEE Access, vol. 11, pp. 36805–36822, 2023, doi: 10.1109/ACCESS.2023.3252366.

A. Aljofey, “An effective detection approach for phishing websites using URL and HTML features,” Sci. Rep., vol. 12, p. 8842, 2022, doi: 10.1038/s41598-022-10841-5.

S. D. Guptta, K. T. Shahriar, H. Alqahtani, D. Alsalman, and I. H. Sarker, “Modeling hybrid feature-based phishing websites detection using machine learning techniques,” Annals of Data Science, vol. 11, pp. 217–242, 2024, doi: 10.1007/s40745-022-00379-8.

A. Almomani, “Phishing website detection with semantic features based on machine learning classifiers: A comparative study,” Int. J. Semant. Web Inf. Syst., vol. 18, no. 1, pp. 1–24, 2022, doi: 10.4018/IJSWIS.297032.

R. S. Rao and A. R. Pais, “Detection of phishing websites using an efficient feature-based machine learning framework,” Neural Comput. Appl., vol. 31, pp. 3851–3873, 2019, doi: 10.1007/s00521-017-3305-0.

F. Rashid, B. Doyle, S. C. Han, and S. Seneviratne, “Phishing URL detection generalisation using unsupervised domain adaptation,” Computer Networks, vol. 245, p. 110398, 2024, doi: 10.1016/j.comnet.2024.110398.

A. Hannousse and S. Yahiouche, “Towards benchmark datasets for machine learning based website phishing detection: An experimental study,” Eng. Appl. Artif. Intell., vol. 104, p. 104347, 2021, doi: 10.1016/j.engappai.2021.104347.

A. Prasad and S. Chandra, “PhiUSIIL: A diverse security profile empowered phishing URL detection framework based on similarity index and incremental learning,” Comput. Secur., vol. 136, p. 103545, 2024, doi: 10.1016/j.cose.2023.103545.

A. Safi and S. Singh, “A systematic literature review on phishing website detection techniques,” Journal of King Saud University - Computer and Information Sciences, vol. 35, no. 2, pp. 590–611, 2023, doi: 10.1016/j.jksuci.2023.01.004.

X. Zhang, T. Ge, S. Yin, W. Chen, and X. Yang, “Digital transformation and green agricultural development: Evidence from agricultural production systems,” Technol. Forecast. Soc. Change, vol. 190, p. 122439, 2023.

Z. Zhai, J. F. Martínez, V. Beltran, and N. L. Martínez, “Decision support systems for agriculture 4.0: Survey and challenges,” Comput. Electron. Agric., vol. 170, p. 105256, 2020, doi: 10.1016/j.compag.2020.105256.

M. L. Yeo and C. M. H. Keske, “From profitability to trust: Factors shaping digital agriculture adoption,” Front. Sustain. Food Syst., vol. 8, p. 1456991, 2024, doi: 10.3389/fsufs.2024.1456991.

Modern Phishing URL Detection Using Feature Selection and Comparative Classification Models

Authors

Keywords:

Abstract

Downloads

References

Published

How to Cite

Issue

Section

License

Submission Block

Published by LPPM Sekolah Tinggi Teknologi Nusantara Lampung