CCF BDCI Web攻击检测与分类识别 Top8思路
CCF BDCI Web攻击检测与分类识别 Top8思路 赛题地址 :https://www.datafountain.cn/competitions/596
赛题背景: 某业务平台平均每月捕获到Web攻击数量超过2亿,涉及常见注入攻击,代码执行等类型。传统威胁检测手段通过分析已知攻击特征进行规则匹配,无法检测未知漏洞或攻击手法。如何快速准确地识别未知威胁攻击并且将不同攻击正确分类,对提升Web攻击检测能力至关重要。利用机器学习和深度学习技术对攻击报文进行识别和分类已经成为解决该问题的创新思路,有利于推动AI技术在威胁检测分析场景的研究与应用。
赛题任务: 参赛团队需要对前期提供的训练集进行分析,通过特征工程、机器学习和深度学习等方法构建AI模型,实现对每一条样本正确且快速分类,不断提高模型精确率和召回率。待模型优化稳定后,通过无标签测试集评估各参赛团队模型分类效果,以正确率评估各参赛团队模型质量。
决赛答辩: 决赛答辩中,评审专家将根据答辩作品的创新性、可用性等进行打分;最终成绩将综合考虑初赛成绩、创新性、可用性等方面确定最终排名,最终成绩 = 初赛复现成绩 * 80% + 决赛成绩 * 20%。 注意,答辩着重考察以下方面: (1) 模型发现未知攻击类型的能力 (2) 模型的时间复杂度 (3) 其他创新
数据简介 赛题训练集分为6种不同标签,共计约3.5万条数据。训练数据集字段内容主要包括: ● ID:样本编号 ● label:攻击类型编号 ● 其他:HTTP协议内容
评测标准 评比期间将提供无标签测试集,参赛团队需提交对该测试集每条数据的模型分类结果,即每条数据中增加一个predict字段(模型分类结果),与训练集label字段含义保持一致。 评估程序将模型预测结果predict与标准答案label对比,统计精确率、召回率和F1,最终以F1为准。
标签
分类为正标签
分类为负标签
正标签
TP
FN
负标签
FP
TN
精确率计算公式:Precision = TP/(TP + FP) 召回率计算公式:Recall = TP/(TP + FN) F1计算公式:F1 = 2 * Precision * Recall/(Precision + Recall) 注:该F1为 macro F1
代码如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 import lightgbm as lgbimport matplotlib.pyplot as pltimport numpy as npimport pandas as pdfrom lightgbm import early_stoppingfrom lightgbm import log_evaluationfrom sklearn.decomposition import TruncatedSVDfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.metrics import accuracy_score,f1_scorefrom sklearn.model_selection import StratifiedKFoldfrom sklearn.preprocessing import LabelEncoderfrom tqdm import tqdmfrom user_agents import parsefrom sklearn.metrics import classification_reportfrom sklearn.metrics import confusion_matrixfrom catboost import CatBoostClassifierfrom xgboost import XGBClassifierfrom urllib.parse import quote, unquote, urlparseimport reimport glob pd.set_option('display.max_columns' , None )
1 2 3 4 5 6 7 8 9 10 11 def get_ua (row ): user_agent = parse(row['user_agent' ]) browser_family=str (user_agent.browser.family) os_family=str (user_agent.os.family) device_family=str (user_agent.device.family) device_brand=str (user_agent.device.brand) device_model=str (user_agent.device.model) return browser_family,os_family,device_family,device_brand,device_model
1 2 prob = np.load('E://data//DF//Web攻击检测与分类识别//large/deberta-v3-large_probs.npy' ) prob.shape
(4000, 6)
1 2 3 4 5 train=pd.read_pickle('E:\\data\\DF\\Web攻击检测与分类识别/large/oof_df.pkl' ) sub = pd.read_csv('E:\\data\\DF\\Web攻击检测与分类识别\\submit_example (10).csv' ) test=pd.read_csv('E:\\data\\DF\\Web攻击检测与分类识别/test.csv' ) train.head()
id
method
user_agent
url
refer
body
label
text
fold
0
1
2
3
4
5
0
13429
GET
'||(select 1 from (select pg_sleep(8))x)||'
/kelev/scripts/?C=M%3BO%3DA
NAN
GET /kelev/scripts/?C=M%3BO%3DA HTTP/1.1 Accep...
1
method:GET[SEP]user_agent:'||(select 1 from (s...
0
0.000238
0.999110
0.000445
0.000058
0.000040
0.000110
1
18125
GET
Dalvik/2.1.0 (Linux; U; Android 11; M2102K1C B...
/livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap...
NAN
GET /livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&...
1
method:GET[SEP]user_agent:Dalvik/2.1.0 (Linux;...
0
0.006598
0.992451
0.000763
0.000086
0.000067
0.000035
2
14538
GET
Dalvik/2.1.0 (Linux; U; Android 11; M2011K2C B...
/livemsg?ad_type=WL_WK&ty=web&pu=0&openudid=d2...
NAN
GET /livemsg?ad_type=WL_WK&ty=web&pu=0&openudi...
1
method:GET[SEP]user_agent:Dalvik/2.1.0 (Linux;...
0
0.000783
0.999017
0.000138
0.000031
0.000017
0.000013
3
7127
GET
Dalvik/2.1.0 (Linux; U; Android 10; MI 9 MIUI/...
/livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap...
NAN
NAN
1
method:GET[SEP]user_agent:Dalvik/2.1.0 (Linux;...
0
0.007603
0.991491
0.000725
0.000087
0.000062
0.000033
4
7
GET
Dalvik/2.1.0 (Linux; U; Android 10; ELS-AN00 B...
/livemsg?sdtfrom=v5004&ad_type=WL_WK&oadid=&ty...
NAN
GET /livemsg?sdtfrom=v5004&ad_type=WL_WK&oadid...
1
method:GET[SEP]user_agent:Dalvik/2.1.0 (Linux;...
0
0.000529
0.999257
0.000153
0.000020
0.000021
0.000019
id
method
user_agent
url
refer
body
0
0
GET
Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...
/demo/aisec/upload.php?act='%7C%7C(select+1+fr...
http://demo.aisec.cn/demo/aisec/upload.php?t=0...
GET /demo/aisec/upload.php?act='%7C%7C(select+...
1
1
GET
Dalvik/2.1.0 (Linux; U; Android 11; M2102J2SC ...
/livemsg?ad_type=WL_WK&ty=web&pu=1&openudid=5f...
NaN
GET /livemsg?ad_type=WL_WK&ty=web&pu=1&openudi...
2
2
GET
Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/2...
/create_user/?username=%3Cscript%3Ealert(docum...
NaN
NaN
3
3
GET
NaN
/mmsns/WeDwicXmkOl4kjKsBycicI0H3q41r6syFFvu46h...
NaN
NaN
4
4
PUT
Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/2...
/naizau.jsp/
NaN
GET /login HTTP/1.1 Host: 111.160.211.18:8088 ...
id 0
method 0
user_agent 0
url 0
refer 0
body 0
label 0
text 0
fold 0
0 0
1 0
2 0
3 0
4 0
5 0
dtype: int64
Index(['id', 'method', 'user_agent', 'url', 'refer', 'body', 'label', 'text',
'fold', '0', '1', '2', '3', '4', '5'],
dtype='object')
1 2 test[['0' , '1' , '2' , '3' , '4' , '5' ]]=prob
1 2 3 train=train.drop('text' ,axis=1 ) train=train.drop('fold' ,axis=1 )
1 2 train=train[['id' , 'method' , 'user_agent' , 'url' , 'refer' , 'body' , '0' , '1' , '2' , '3' , '4' , '5' , 'label' ]]
1 2 print ("train.shape" ,train.shape)print ("test.shape" ,test.shape)
train.shape (33037, 13)
test.shape (4000, 12)
数据分析 赛题训练集分为6种不同标签,共计约3.5万条数据。训练数据集字段内容主要包括: ● lable:攻击类型编号 ● 其他:HTTP协议内容
Index(['id', 'method', 'user_agent', 'url', 'refer', 'body', '0', '1', '2',
'3', '4', '5', 'label'],
dtype='object')
‘lable’看着很别扭,重新rename一下
1 train=train.rename(columns={'lable' :'label' })
id int64
method object
user_agent object
url object
refer object
body object
0 float32
1 float32
2 float32
3 float32
4 float32
5 float32
label int64
dtype: object
1 2 train['label' ].value_counts()
1 14038
2 9939
0 6489
3 1215
4 697
5 659
Name: label, dtype: int64
1 2 train['label' ].value_counts().plot(kind='bar' ) plt.show()
1 2 data=pd.concat([train,test],axis=0 ).reset_index(drop=True ) data.nunique()
id 19497
method 21
user_agent 1087
url 36613
refer 941
body 22380
0 35915
1 31618
2 36061
3 36961
4 36965
5 36959
label 6
dtype: int64
1 2 3 4 5 data['user_agent' ]=data['user_agent' ].fillna('NAN' ) data['refer' ]=data['refer' ].fillna('NAN' ) data['body' ]=data['body' ].fillna('NAN' ) data['url' ]=data['url' ].fillna('NAN' )
1 2 3 4 ua_cols=['browser_family' , 'os_family' , 'device_family' ,'device_brand' ,'device_model' ] data[ua_cols] = data.apply(get_ua, axis=1 , result_type="expand" ) data.head()
id
method
user_agent
url
refer
body
0
1
2
3
4
5
label
browser_family
os_family
device_family
device_brand
device_model
0
13429
GET
'||(select 1 from (select pg_sleep(8))x)||'
/kelev/scripts/?C=M%3BO%3DA
NAN
GET /kelev/scripts/?C=M%3BO%3DA HTTP/1.1 Accep...
0.000238
0.999110
0.000445
0.000058
0.000040
0.000110
1.0
Other
Other
Other
None
None
1
18125
GET
Dalvik/2.1.0 (Linux; U; Android 11; M2102K1C B...
/livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap...
NAN
GET /livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&...
0.006598
0.992451
0.000763
0.000086
0.000067
0.000035
1.0
Android
Android
M2102K1C
Generic_Android
M2102K1C
2
14538
GET
Dalvik/2.1.0 (Linux; U; Android 11; M2011K2C B...
/livemsg?ad_type=WL_WK&ty=web&pu=0&openudid=d2...
NAN
GET /livemsg?ad_type=WL_WK&ty=web&pu=0&openudi...
0.000783
0.999017
0.000138
0.000031
0.000017
0.000013
1.0
Android
Android
M2011K2C
Generic_Android
M2011K2C
3
7127
GET
Dalvik/2.1.0 (Linux; U; Android 10; MI 9 MIUI/...
/livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap...
NAN
NAN
0.007603
0.991491
0.000725
0.000087
0.000062
0.000033
1.0
Android
Android
XiaoMi MI 9
XiaoMi
MI 9
4
7
GET
Dalvik/2.1.0 (Linux; U; Android 10; ELS-AN00 B...
/livemsg?sdtfrom=v5004&ad_type=WL_WK&oadid=&ty...
NAN
GET /livemsg?sdtfrom=v5004&ad_type=WL_WK&oadid...
0.000529
0.999257
0.000153
0.000020
0.000021
0.000019
1.0
Android
Android
ELS-AN00
Huawei
ELS-AN00
基础特征 1 2 import urllib.parseimport urllib
1 2 3 4 5 6 data['user_agent_len' ]=data['user_agent' ].apply(lambda x:len (x)) data['url_len' ]=data['url' ].apply(lambda x:len (x)) data['refer_len' ]=data['refer' ].apply(lambda x:len (x)) data['body_len' ]=data['body' ].apply(lambda x:len (x)) data['body_user_agent_len_diff' ]=data['body_len' ]-data['user_agent_len' ] data['body_url_len_diff' ]=data['body_len' ]-data['url_len' ]
1 2 3 4 5 6 7 8 9 10 11 12 texts=data['user_agent' ].values.tolist() n_components = 16 tf = TfidfVectorizer(min_df= 1 , max_df=0.5 ,analyzer = 'char_wb' , ngram_range = (1 ,3 )) X = tf.fit_transform(texts) svd = TruncatedSVD(n_components=n_components, random_state=42 ) X_svd = svd.fit_transform(X) df_tfidf = pd.DataFrame(X_svd) df_tfidf.columns = [f'user_agent_name_tfidf_{i} ' for i in range (n_components)] data=pd.concat([data,df_tfidf],axis=1 )
1 2 3 4 5 6 7 8 9 10 11 texts=data['url' ].values.tolist() n_components = 16 tf = TfidfVectorizer(min_df= 1 , max_df=0.5 ,analyzer = 'char_wb' , ngram_range = (1 ,3 )) X = tf.fit_transform(texts) svd = TruncatedSVD(n_components=n_components, random_state=42 ) X_svd = svd.fit_transform(X) df_tfidf = pd.DataFrame(X_svd) df_tfidf.columns = [f'url_name_tfidf_{i} ' for i in range (n_components)] data=pd.concat([data,df_tfidf],axis=1 )
1 2 3 4 5 6 7 8 9 10 11 texts=data['body' ].values.tolist() n_components = 32 tf = TfidfVectorizer(min_df= 1 , max_df=0.5 ,analyzer = 'char_wb' , ngram_range = (1 ,3 )) X = tf.fit_transform(texts) svd = TruncatedSVD(n_components=n_components, random_state=42 ) X_svd = svd.fit_transform(X) df_tfidf = pd.DataFrame(X_svd) df_tfidf.columns = [f'body_tfidf_{i} ' for i in range (n_components)] data=pd.concat([data,df_tfidf],axis=1 )
1 2 3 for f in ['method' , 'url' ,'refer' , 'body' ,'browser_family' ,'os_family' ,'device_family' ,'device_brand' ,'device_model' ]: data[f'id_{f} _nunique' ] = data.groupby(['id' ])[f].transform('nunique' ) data[f'id_{f} _count' ] = data.groupby(['id' ])[f].transform('count' )
1 re.split('[=&]' , urlparse(data['url' ][0 ])[4 ])
['C', 'M%3BO%3DA']
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 def get_url_query (s ): li = re.split('[=&]' , urlparse(s)[4 ]) return [li[i] for i in range (len (li)) if i % 2 == 1 ] def find_max_str_length (x ): max_ = 0 li = [len (i) for i in x] return max (li) if len (li) > 0 else 0 def find_str_length_std (x ): max_ = 0 li = [len (i) for i in x] return np.std(li) if len (li) > 0 else -1 data['url_unquote' ] = data['url' ].apply(unquote) data['url_query' ] = data['url_unquote' ].apply(lambda x: get_url_query(x)) data['url_query_num' ] = data['url_query' ].apply(len ) data['url_query_max_len' ] = data['url_query' ].apply(find_max_str_length) data['url_query_len_std' ] = data['url_query' ].apply(find_str_length_std) data['url' ].apply(unquote)
0 /kelev/scripts/?C=M;O=A
1 /livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap...
2 /livemsg?ad_type=WL_WK&ty=web&pu=0&openudid=d2...
3 /livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap...
4 /livemsg?sdtfrom=v5004&ad_type=WL_WK&oadid=&ty...
...
37032 /livemsg?ad_type=WL_WK&ty=web&pu=1&openudid=64...
37033 /runtime.js
37034 /query?493521812
37035 /stats.php?rand=JtmT4wBtrpNy5RJnNX9wCUo
37036 /api/gateway.do?method=qihoo.sdk.user.mobile.l...
Name: url, Length: 37037, dtype: object
id
method
user_agent
url
refer
body
0
1
2
3
4
5
label
browser_family
os_family
device_family
device_brand
device_model
user_agent_len
url_len
refer_len
body_len
body_user_agent_len_diff
body_url_len_diff
user_agent_name_tfidf_0
user_agent_name_tfidf_1
user_agent_name_tfidf_2
user_agent_name_tfidf_3
user_agent_name_tfidf_4
user_agent_name_tfidf_5
user_agent_name_tfidf_6
user_agent_name_tfidf_7
user_agent_name_tfidf_8
user_agent_name_tfidf_9
user_agent_name_tfidf_10
user_agent_name_tfidf_11
user_agent_name_tfidf_12
user_agent_name_tfidf_13
user_agent_name_tfidf_14
user_agent_name_tfidf_15
url_name_tfidf_0
url_name_tfidf_1
url_name_tfidf_2
url_name_tfidf_3
url_name_tfidf_4
url_name_tfidf_5
url_name_tfidf_6
url_name_tfidf_7
url_name_tfidf_8
url_name_tfidf_9
url_name_tfidf_10
url_name_tfidf_11
url_name_tfidf_12
url_name_tfidf_13
url_name_tfidf_14
url_name_tfidf_15
body_tfidf_0
body_tfidf_1
body_tfidf_2
body_tfidf_3
body_tfidf_4
body_tfidf_5
body_tfidf_6
body_tfidf_7
body_tfidf_8
body_tfidf_9
body_tfidf_10
body_tfidf_11
body_tfidf_12
body_tfidf_13
body_tfidf_14
body_tfidf_15
body_tfidf_16
body_tfidf_17
body_tfidf_18
body_tfidf_19
body_tfidf_20
body_tfidf_21
body_tfidf_22
body_tfidf_23
body_tfidf_24
body_tfidf_25
body_tfidf_26
body_tfidf_27
body_tfidf_28
body_tfidf_29
body_tfidf_30
body_tfidf_31
id_method_nunique
id_method_count
id_url_nunique
id_url_count
id_refer_nunique
id_refer_count
id_body_nunique
id_body_count
id_browser_family_nunique
id_browser_family_count
id_os_family_nunique
id_os_family_count
id_device_family_nunique
id_device_family_count
id_device_brand_nunique
id_device_brand_count
id_device_model_nunique
id_device_model_count
url_unquote
url_query
url_query_num
url_query_max_len
url_query_len_std
0
13429
GET
'||(select 1 from (select pg_sleep(8))x)||'
/kelev/scripts/?C=M%3BO%3DA
NAN
GET /kelev/scripts/?C=M%3BO%3DA HTTP/1.1 Accep...
0.000238
0.999110
0.000445
0.000058
0.000040
0.000110
1.0
Other
Other
Other
None
None
43
27
3
212
169
185
0.010070
0.009456
0.001205
0.003217
0.021082
0.000999
-0.000847
-0.002107
0.008443
0.002747
0.023997
-0.003526
-0.001894
-0.013000
-0.005918
0.009153
0.066298
0.059683
-0.057310
-0.006595
-0.001159
0.094755
-0.021858
0.023071
-0.001968
0.043890
0.022147
-0.006037
-0.005375
0.014239
0.077494
0.081818
0.000054
0.115960
0.105404
-0.027789
-0.002577
0.064009
-0.039324
0.006577
0.023000
-0.064038
0.021134
-0.058787
0.011353
0.066560
0.010658
0.189591
-0.067039
-0.116538
-0.014128
-0.029329
0.047018
-0.023777
-0.035825
0.005829
0.013030
-0.022777
-0.007482
0.039357
0.046385
-0.007902
-0.029089
-0.051989
1
2
2
2
1
2
2
2
2
2
2
2
1
2
1
2
1
2
/kelev/scripts/?C=M;O=A
[M;O]
1
3
0.000000
1
18125
GET
Dalvik/2.1.0 (Linux; U; Android 11; M2102K1C B...
/livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap...
NAN
GET /livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&...
0.006598
0.992451
0.000763
0.000086
0.000067
0.000035
1.0
Android
Android
M2102K1C
Generic_Android
M2102K1C
67
1747
3
2016
1949
269
0.035096
0.120188
0.270092
0.655747
-0.042591
-0.024738
-0.019824
-0.000642
-0.015951
-0.183311
-0.026321
-0.061993
-0.018288
0.027496
-0.012776
-0.002250
0.617250
-0.130247
0.079094
-0.023195
0.003702
-0.030486
-0.013201
-0.021136
0.012303
-0.006278
-0.023727
-0.003599
0.027036
0.006405
-0.005268
0.010234
0.000132
0.530101
-0.277845
0.022171
-0.069403
-0.062703
-0.006813
0.001156
-0.028995
-0.009364
-0.013567
0.015499
-0.009968
-0.032960
-0.000036
-0.008127
-0.000813
0.001447
0.009261
-0.017541
-0.000682
-0.003697
0.007340
-0.010968
0.008710
-0.067023
-0.014870
-0.024112
0.011792
0.004538
0.014397
-0.003550
1
2
2
2
1
2
2
2
2
2
2
2
2
2
2
2
2
2
/livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap...
[WL_WK, , web, 0, 1, 210810, 116, 1, 8, fa0d30...
23
1324
268.709026
2
14538
GET
Dalvik/2.1.0 (Linux; U; Android 11; M2011K2C B...
/livemsg?ad_type=WL_WK&ty=web&pu=0&openudid=d2...
NAN
GET /livemsg?ad_type=WL_WK&ty=web&pu=0&openudi...
0.000783
0.999017
0.000138
0.000031
0.000017
0.000013
1.0
Android
Android
M2011K2C
Generic_Android
M2011K2C
67
1688
3
1986
1919
298
0.034866
0.170330
0.292857
0.713273
-0.047369
-0.025480
-0.014846
-0.002015
-0.095211
-0.199969
0.001352
0.004859
0.026900
-0.000053
-0.045812
-0.003042
0.662307
-0.134895
0.074788
-0.033836
0.004867
-0.012766
-0.014195
-0.019396
0.016613
-0.006143
-0.018700
-0.012640
0.030395
0.004169
-0.005899
0.005060
0.000137
0.557311
-0.298627
0.021799
-0.072395
-0.035369
-0.009561
-0.020158
-0.030540
-0.019107
-0.011824
0.016220
-0.010370
-0.021592
0.002470
0.001666
-0.004027
-0.000666
0.012006
-0.009133
-0.007882
-0.001795
-0.003188
-0.015516
0.010797
-0.083144
-0.021120
-0.029922
0.020777
0.001687
0.006579
-0.001808
1
2
2
2
1
2
2
2
2
2
2
2
2
2
2
2
2
2
/livemsg?ad_type=WL_WK&ty=web&pu=0&openudid=d2...
[WL_WK, web, 0, d24c93f6c8de719a00f1676f3a9a53...
29
1154
209.374211
3
7127
GET
Dalvik/2.1.0 (Linux; U; Android 10; MI 9 MIUI/...
/livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap...
NAN
NAN
0.007603
0.991491
0.000725
0.000087
0.000062
0.000033
1.0
Android
Android
XiaoMi MI 9
XiaoMi
MI 9
64
1613
3
3
-61
-1610
0.038503
0.058521
0.184916
0.434441
0.026507
-0.024867
-0.016191
0.021308
0.058906
0.048677
-0.015773
-0.017188
0.012508
0.026593
0.009872
0.001220
0.621003
-0.119104
0.071898
-0.014246
-0.005748
-0.025066
-0.015300
-0.006643
0.011277
-0.011099
-0.026949
-0.013011
0.027294
0.006678
0.006156
0.004784
1.000000
-0.000356
-0.000224
-0.000019
0.000078
0.000016
-0.000132
-0.000012
-0.000029
0.000011
-0.000044
-0.000041
-0.000041
0.000027
-0.000017
-0.000027
-0.000017
-0.000096
0.000005
0.000026
0.000013
0.000016
-0.000005
-0.000005
0.000015
-0.000004
0.000011
0.000021
-0.000004
0.000045
0.000019
-0.000019
1
3
3
3
1
3
2
3
2
3
2
3
3
3
2
3
3
3
/livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap...
[WL_WK, , web, 0, 1, 201209, 116, 1, 8, bbe035...
24
1186
235.820461
4
7
GET
Dalvik/2.1.0 (Linux; U; Android 10; ELS-AN00 B...
/livemsg?sdtfrom=v5004&ad_type=WL_WK&oadid=&ty...
NAN
GET /livemsg?sdtfrom=v5004&ad_type=WL_WK&oadid...
0.000529
0.999257
0.000153
0.000020
0.000021
0.000019
1.0
Android
Android
ELS-AN00
Huawei
ELS-AN00
66
1467
3
1704
1638
237
0.023016
0.076053
0.211935
0.431614
-0.013015
-0.028123
0.016032
-0.039645
0.129501
0.388460
-0.012173
-0.016553
-0.012854
-0.049835
-0.037182
-0.024569
0.615644
-0.116622
0.066164
-0.019455
-0.007366
-0.022974
-0.016564
-0.006819
-0.000959
-0.011903
-0.022060
-0.011351
0.002437
-0.008505
-0.008668
-0.004333
0.000129
0.535128
-0.294560
0.024614
-0.084538
-0.049889
-0.019040
-0.000086
-0.029467
-0.026259
-0.010166
0.009809
-0.017834
-0.018439
0.007596
-0.017013
-0.005269
0.002957
0.006733
-0.010658
-0.006394
-0.005429
0.011300
-0.024379
0.006489
-0.061012
-0.019155
-0.021446
0.021441
-0.001876
0.002968
-0.005974
2
5
5
5
3
5
5
5
3
5
3
5
3
5
3
5
3
5
/livemsg?sdtfrom=v5004&ad_type=WL_WK&oadid=&ty...
[v5004, WL_WK, , web, 0, 20220209V0BT5X00, 1, ...
27
972
182.378599
1 2 3 4 5 6 7 8 9 10 11 12 13 def find_url_filetype (x ): try : return re.search(r'\.[a-z]+' , x).group() except : return '__NaN__' data['url_path' ] = data['url_unquote' ].apply(lambda x: urlparse(x)[2 ]) data['url_filetype' ] = data['url_path' ].apply(lambda x: find_url_filetype(x)) data['url_path_len' ] = data['url_path' ].apply(len ) data['url_path_num' ] = data['url_path' ].apply(lambda x: len (re.findall('/' , x)))
1 2 data['ua_short' ] = data['user_agent' ].apply(lambda x: x.split('/' )[0 ]) data['ua_first' ] = data['user_agent' ].apply(lambda x: x.split(' ' )[0 ])
1 2 3 for col in tqdm(['method' , 'refer' , 'browser_family' ,'os_family' ,'device_family' , 'device_brand' , 'device_model' ,'url_filetype' ,'ua_short' ,'ua_first' ]): le = LabelEncoder() data[col] = le.fit_transform(data[col])
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 82.97it/s]
1 len (data.select_dtypes(include=['int' ,'float' ]).columns.tolist())
109
1 2 3 col = data.select_dtypes(include=['int' ,'float' ]).columns.tolist() data = data[col] feature_names = [i for i in col if i not in ['id' ,'label' ]]
1 2 3 4 5 6 7 train = data[data['label' ].notnull()].reset_index(drop = True ) test = data[~data['label' ].notnull()].reset_index(drop = True ) x_train = train[feature_names] y_train = train['label' ] x_test = test[feature_names]
(33037, 107)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 def lgb_model (train, target, test, k ): feats = [f for f in train.columns if f not in ['id' ,'label' , 'url' , 'url_count' ]] print ('Current num of features:' , len (feats)) oof_probs = np.zeros((train.shape[0 ],6 )) output_preds = 0 offline_score = [] feature_importance_df = pd.DataFrame() parameters = { 'learning_rate' : 0.03 , 'boosting_type' : 'gbdt' , 'objective' : 'multiclass' , 'metric' : 'multi_error' , 'num_class' : 6 , 'num_leaves' : 31 , 'feature_fraction' : 0.6 , 'bagging_fraction' : 0.8 , 'min_data_in_leaf' : 15 , 'verbose' : -1 , 'nthread' : -1 , 'max_depth' : 7 } seeds = [2020 ] for seed in seeds: folds = StratifiedKFold(n_splits=k, shuffle=True , random_state=seed) for i, (train_index, test_index) in enumerate (folds.split(train, target)): train_y, test_y = target.iloc[train_index], target.iloc[test_index] train_X, test_X = train[feats].iloc[train_index, :], train[feats].iloc[test_index, :] dtrain = lgb.Dataset(train_X, label=train_y) dval = lgb.Dataset(test_X, label=test_y) lgb_model = lgb.train( parameters, dtrain, num_boost_round=8000 , valid_sets=[dval], callbacks=[early_stopping(100 ), log_evaluation(100 )], ) oof_probs[test_index] = lgb_model.predict(test_X[feats], num_iteration=lgb_model.best_iteration) / len ( seeds) offline_score.append(lgb_model.best_score['valid_0' ]['multi_error' ]) output_preds += lgb_model.predict(test[feats], num_iteration=lgb_model.best_iteration) / folds.n_splits / len (seeds) print (offline_score) fold_importance_df = pd.DataFrame() fold_importance_df["feature" ] = feats fold_importance_df["importance" ] = lgb_model.feature_importance(importance_type='gain' ) fold_importance_df["fold" ] = i + 1 feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0 ) print ('OOF-MEAN-AUC:%.6f, OOF-STD-AUC:%.6f' % (np.mean(offline_score), np.std(offline_score))) print ('feature importance:' ) print (feature_importance_df.groupby(['feature' ])['importance' ].mean().sort_values(ascending=False ).head(50 )) return output_preds, oof_probs, np.mean(offline_score), feature_importance_df
1 2 3 4 5 print ('开始模型训练train' )lgb_preds, lgb_oof, lgb_score, feature_importance_df = lgb_model(train=train[feature_names], target=train['label' ], test=test[feature_names], k=5 )
开始模型训练train
Current num of features: 107
Training until validation scores don't improve for 100 rounds
[100] valid_0's multi_error: 0.0145278
[200] valid_0's multi_error: 0.0136199
[300] valid_0's multi_error: 0.0134685
Early stopping, best iteration is:
[203] valid_0's multi_error: 0.0133172
[0.013317191283292978]
Training until validation scores don't improve for 100 rounds
[100] valid_0's multi_error: 0.0119552
Early stopping, best iteration is:
[63] valid_0's multi_error: 0.0116525
[0.013317191283292978, 0.011652542372881356]
Training until validation scores don't improve for 100 rounds
[100] valid_0's multi_error: 0.011503
Early stopping, best iteration is:
[92] valid_0's multi_error: 0.011503
[0.013317191283292978, 0.011652542372881356, 0.011502951415165732]
Training until validation scores don't improve for 100 rounds
[100] valid_0's multi_error: 0.011503
Early stopping, best iteration is:
[55] valid_0's multi_error: 0.0107462
[0.013317191283292978, 0.011652542372881356, 0.011502951415165732, 0.010746178295746934]
Training until validation scores don't improve for 100 rounds
[100] valid_0's multi_error: 0.011503
Early stopping, best iteration is:
[21] valid_0's multi_error: 0.0112002
[0.013317191283292978, 0.011652542372881356, 0.011502951415165732, 0.010746178295746934, 0.011200242167398214]
OOF-MEAN-AUC:0.011684, OOF-STD-AUC:0.000873
feature importance:
feature
2 222576.439984
1 211736.666876
0 147032.174178
4 108800.598945
3 86021.756831
browser_family 62270.501839
5 39814.736140
body_tfidf_0 27317.752312
body_tfidf_1 23303.336642
url_name_tfidf_2 16427.787706
url_name_tfidf_14 15618.236285
url_name_tfidf_3 14344.577968
user_agent_len 9418.270741
body_user_agent_len_diff 9103.063841
user_agent_name_tfidf_2 8463.647873
url_query_max_len 6510.758108
user_agent_name_tfidf_5 4018.171339
body_tfidf_3 3672.676081
body_tfidf_23 3355.861338
user_agent_name_tfidf_4 3142.178290
url_name_tfidf_4 3127.492671
url_name_tfidf_13 2886.949569
url_name_tfidf_1 2849.448233
url_name_tfidf_5 2676.723486
user_agent_name_tfidf_7 2330.854639
user_agent_name_tfidf_14 2310.199494
refer_len 2145.159362
user_agent_name_tfidf_3 2103.610035
body_tfidf_10 1986.768004
url_name_tfidf_6 1866.483835
body_url_len_diff 1849.085455
body_tfidf_6 1846.585181
body_len 1821.060478
user_agent_name_tfidf_0 1743.131918
id_method_count 1659.460621
user_agent_name_tfidf_13 1627.554725
body_tfidf_5 1539.584926
url_name_tfidf_8 1485.817275
url_name_tfidf_9 1449.335372
url_name_tfidf_12 1429.778973
url_name_tfidf_10 1413.196746
url_name_tfidf_11 1296.268733
user_agent_name_tfidf_9 1290.671890
url_len 1284.274063
body_tfidf_12 1279.712368
body_tfidf_9 1209.730761
body_tfidf_8 1161.935181
url_name_tfidf_15 1158.114828
url_name_tfidf_7 998.550187
url_name_tfidf_0 998.193206
Name: importance, dtype: float64
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
1 sub['predict' ]=np.argmax(lgb_preds,axis=1 )
1 sub['predict' ].value_counts()
2 855
1 828
0 804
3 666
4 447
5 400
Name: predict, dtype: int64
1 accuracy_score(train['label' ],np.argmax(lgb_oof,axis=1 ))
0.9883161303992494
1 2 3 4 5 6 7 8 f1_score(np.argmax(lgb_oof,axis=1 ),train['label' ],average= 'macro' )
0.9650199644033531
1 print (classification_report(train['label' ],np.argmax(lgb_oof,axis=1 )))
precision recall f1-score support
0.0 0.99 1.00 1.00 6489
1.0 0.99 0.99 0.99 14038
2.0 0.99 0.99 0.99 9939
3.0 0.94 0.93 0.94 1215
4.0 0.94 0.87 0.90 697
5.0 0.97 0.98 0.98 659
accuracy 0.99 33037
macro avg 0.97 0.96 0.97 33037
weighted avg 0.99 0.99 0.99 33037