[1]杨鹏,曾朋,赵广振,等.基于Logistic回归和XGBoost的钓鱼网站检测方法[J].东南大学学报(自然科学版),2019,49(2):207-212.[doi:10.3969/j.issn.1001-0505.2019.02.001]
 Yang Peng,Zeng Peng,Zhao Guangzhen,et al.Phishing website detection method based on logistic regression and XGBoost[J].Journal of Southeast University (Natural Science Edition),2019,49(2):207-212.[doi:10.3969/j.issn.1001-0505.2019.02.001]
点击复制

基于Logistic回归和XGBoost的钓鱼网站检测方法()
分享到:

《东南大学学报(自然科学版)》[ISSN:1001-0505/CN:32-1178/N]

卷:
49
期数:
2019年第2期
页码:
207-212
栏目:
计算机科学与工程
出版日期:
2019-03-20

文章信息/Info

Title:
Phishing website detection method based on logistic regression and XGBoost
作者:
杨鹏曾朋赵广振吕培培
东南大学计算机科学与工程学院, 南京211189; 东南大学计算机网络和信息集成教育部重点实验室, 南京 211189
Author(s):
Yang Peng Zeng Peng Zhao Guangzhen Lü Peipei
School of Computer Science and Engineering, Southeast University, Nanjing 211189, China
Key Laboratory of Computer Network and Information Integration of Ministry of Education, Southeast University, Nanjing 211189, China
关键词:
钓鱼网站 Logistic回归 集成学习 XGBoost
Keywords:
phishing websites logistic regression integrated learning eXtreme gradient boosting(XGBoost)
分类号:
TP393
DOI:
10.3969/j.issn.1001-0505.2019.02.001
摘要:
为兼顾钓鱼网站检测的速度和准确率,提出一种基于Logistic回归和XGBoost的钓鱼网站检测方法.根据网页的URL提取HTML特征、URL特征和基于TF-IDF的文本向量特征,结合Logistic回归将高维和稀疏的文本特征转换为概率特征.基于以上融合特征,构建了XGBoost分类模型,给出了方法的时间复杂度分析,采集了真实数据作为实验数据集.实验结果表明,Logistic回归方法降低了融合特征的维度,检测速度优于直接融合方法;融合特征方法比单方面特征方法含有更多有效的信息,可供分类器进行学习,检测精度高于单方面特征方法,精确度达到96.67%,召回率为96.6%.
Abstract:
To balance the speed and the precise of phishing website detection, a phishing website detection method based on logistic regression and eXtreme gradient boosting(XGBoost)was proposed. The HTML features, the uniform resource locator(URL)features and the text vector features based on the term frequency-inverse document frequency(TF-IDF)were extracted according to the URL of the webpage. The high-dimensional and the sparse text features were converted into probabilistic features by using logistic regression. Based on these fusion features, a XGBoost classification model was constructed, and the time complexity analysis of the method was given. The real data were collected as the experimental data set. The experimental results show that the logistic regression method reduces the dimension of the fusion feature. The detection speed of the method is faster than that of the direct fusion method. The fusion features method contains more effective information than the unilateral feature method for the classifier to learn. The precision of the method is higher than that of the unilateral feature method. The precision is 96.67% and the recall is 96.6%.

参考文献/References:

[1] Dou Z C, Khalil I, Khreishah A, et al. Systematization of knowledge(SoK): A systematic review of software-based web phishing detection[J]. IEEE Communications Surveys & Tutorials, 2017, 19(4): 2797-2819. DOI:10.1109/comst.2017.2752087.
[2] 中国反钓鱼网站联盟秘书处. 2017年12月钓鱼网站处理简报[R]. 北京: 中国反钓鱼网站联盟, 2017.
[3] Gupta B B, Tewari A, Jain A K, et al. Fighting against phishing attacks: State of the art and future challenges[J]. Neural Computing and Applications, 2017, 28(12): 3629-3654. DOI:10.1007/s00521-016-2275-y.
[4] Seifollahi S, Bagirov A, Layton R, et al. Optimization based clustering algorithms for authorship analysis of phishing emails[J]. Neural Processing Letters, 2017, 46(2): 411-425. DOI:10.1007/s11063-017-9593-7.
[5] Siadati H, Nguyen T, Gupta P, et al. Mind your SMSes: Mitigating social engineering in second factor authentication[J]. Computers & Security, 2017, 65: 14-28. DOI:10.1016/j.cose.2016.09.009.
[6] Yang T Y,Dehghantanha A, Choo K K R, et al. Windows instant messaging app forensics: Facebook and skype as case studies[J]. PLoS One, 2016, 11(3): e0150300. DOI:10.1371/journal.pone.0150300.
[7] Aggarwal A, Rajadesingan A, Kumaraguru P. PhishAri: Automatic realtime phishing detection on twitter[C]//2012 ECrime Researchers Summit. Las Croabas, Puerto Rico, 2012: 1-12. DOI:10.1109/eCrime.2012.6489521.
[8] Jeong S Y, Koh Y S, Dobbie G. Phishing detection on twitter streams[M]//Jeong S Y, Koh Y S, Dobbie G, ed. Lecture Notes in Computer Science. Cham: Springer International Publishing, 2016: 141-153. DOI:10.1007/978-3-319-42996-0_12.
[9] Zhang Y, Hong J I, Cranor L F. Cantina: A content-based approach to detecting phishing web sites[C]// Proceedings of the 16th international conference on World Wide Web. Banff, Alberta, Canada, 2007: 639-648. DOI: 10.1145/1242572. 1242659.
[10] Tan C L,Chiew K L, Sze S N. Phishing webpage detection using weighted URL tokens for identity keywords retrieval[C]// 9th International Conference on Robotic, Vision, Signal Processing and Power Applications. Singapore: Springer, 2016: 133-139. DOI:10.1007/978-3-319-42996-0.
[11] Jain A K, Gupta B B. Phishing detection: Analysis of visual similarity based approaches[J]. Security and Communication Networks, 2017, 2017: 1-20. DOI:10.1155/2017/5421046.
[12] 胡向东, 刘可, 张峰, 等. 基于页面敏感特征的金融类钓鱼网页检测方法[J]. 网络与信息安全学报, 2017, 3(2): 31-38. DOI:10.11959/j.issn.2096-109x.2017.00122.
Hu X D, Liu K, Zhang F, et al. Financial phishing detection method based on sensitive characteristics of webpage[J]. Chinese Journal of Network and Information Security, 2017, 3(2): 31-38. DOI:10.11959/j.issn.2096-109x.2017.00122. (in Chinese)
[13] Ali W, Ali W. Phishing website detection based on supervised machine learning with wrapper features selection[J]. International Journal of Advanced Computer Science & Applications, 2017, 8(9): 72-78. DOI: 10.14569/ ijacsa.2017.080910.
[14] Dua D, Karra T E. UCI machine learning repository: Phishing websites data set[EB/OL]. [2018-03-12].https://archive.ics.uci.edu/ml//datasets/Phishing+Websites.
[15] Rao R S, Pais A R. Detection of phishing websites using an efficient feature-based machine learning framework[J]. Neural Computing & Applications, 2018,29(1): 1-23. DOI:10.1007/s00521-017-3305-0.
[16] Chen T, He T. XGboost: Extreme gradient boosting[EB/OL].[2018-03-12]. https://www.datacamp.com/courses/extreme-gradient-boosting-with-xgboost.
[17] Chen T, Guestrin C. XGboost: A scalable tree boosting system[C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco, CA, USA, 2016: 785-794. DOI:10.1145/2939672.2939785.
[18] Ramanathan V, Wechsler H. Phishing website detection using latent dirichlet allocation and AdaBoost[C]//2012 IEEE International Conference on Intelligence and Security Informatics. Arlington, VA, USA, 2012: 102-107. DOI:10.1109/ISI.2012.6284100.

相似文献/References:

[1]陆建,孙祥龙,戴越.普通公路车速分布特性的回归分析[J].东南大学学报(自然科学版),2012,42(2):374.[doi:10.3969/j.issn.1001-0505.2012.02.034]
 Lu Jian,Sun Xianglong,Dai Yue.Regression analysis on speed distribution characteristics of ordinary road[J].Journal of Southeast University (Natural Science Edition),2012,42(2):374.[doi:10.3969/j.issn.1001-0505.2012.02.034]
[2]倪富健,方昱,薛智敏.时间序列在路面平整度预测中的应用[J].东南大学学报(自然科学版),2006,36(4):634.[doi:10.3969/j.issn.1001-0505.2006.04.030]
 Ni Fujian,Fang Yu,Xue Zhimin.Prediction of pavement roughness with time series autoregression model[J].Journal of Southeast University (Natural Science Edition),2006,36(2):634.[doi:10.3969/j.issn.1001-0505.2006.04.030]

备注/Memo

备注/Memo:
收稿日期: 2018-10-11.
作者简介: 杨鹏(1975—),男, 博士, 副教授, pengyang@seu.edu.cn.
基金项目: 国家自然科学基金资助项目(61472080)、中国工程院咨询研究资助项目(2018-XY-07)、软件新技术与产业化协同创新中心资助项目.
引用本文: 杨鹏,曾朋,赵广振,等.基于Logistic回归和XGBoost的钓鱼网站检测方法[J].东南大学学报(自然科学版),2019,49(2):207-212. DOI:10.3969/j.issn.1001-0505.2019.02.001.
更新日期/Last Update: 2019-03-20