江东的笔记

Be overcome difficulties is victory

0%

Web攻击检测与分类识别思路

CCF BDCI Web攻击检测与分类识别 Top8思路

CCF BDCI Web攻击检测与分类识别 Top8思路

赛题地址https://www.datafountain.cn/competitions/596

Snipaste_2022-12-05_12-05-48.png

赛题背景:

某业务平台平均每月捕获到Web攻击数量超过2亿,涉及常见注入攻击,代码执行等类型。传统威胁检测手段通过分析已知攻击特征进行规则匹配,无法检测未知漏洞或攻击手法。如何快速准确地识别未知威胁攻击并且将不同攻击正确分类,对提升Web攻击检测能力至关重要。利用机器学习和深度学习技术对攻击报文进行识别和分类已经成为解决该问题的创新思路,有利于推动AI技术在威胁检测分析场景的研究与应用。

赛题任务:

参赛团队需要对前期提供的训练集进行分析,通过特征工程、机器学习和深度学习等方法构建AI模型,实现对每一条样本正确且快速分类,不断提高模型精确率和召回率。待模型优化稳定后,通过无标签测试集评估各参赛团队模型分类效果,以正确率评估各参赛团队模型质量。

决赛答辩:

决赛答辩中,评审专家将根据答辩作品的创新性、可用性等进行打分;最终成绩将综合考虑初赛成绩、创新性、可用性等方面确定最终排名,最终成绩 = 初赛复现成绩 * 80% + 决赛成绩 * 20%。
注意,答辩着重考察以下方面:
(1) 模型发现未知攻击类型的能力
(2) 模型的时间复杂度
(3) 其他创新

数据简介

赛题训练集分为6种不同标签,共计约3.5万条数据。训练数据集字段内容主要包括:
● ID:样本编号
● label:攻击类型编号
● 其他:HTTP协议内容

评测标准

评比期间将提供无标签测试集,参赛团队需提交对该测试集每条数据的模型分类结果,即每条数据中增加一个predict字段(模型分类结果),与训练集label字段含义保持一致。
评估程序将模型预测结果predict与标准答案label对比,统计精确率、召回率和F1,最终以F1为准。

标签 分类为正标签 分类为负标签
正标签 TP FN
负标签 FP TN

精确率计算公式:Precision = TP/(TP + FP)
召回率计算公式:Recall = TP/(TP + FN)
F1计算公式:F1 = 2 * Precision * Recall/(Precision + Recall)
注:该F1为 macro F1

代码如下

1
2
3
4
# lgb删掉tfidf+refer:线下0.9644,线上0.95997
# lgb未删tfidf+refer:线下0.96432,线上0.95828
# xgb:线下0.96494
# cat:线下9650,线上0.9613
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import lightgbm as lgb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from lightgbm import early_stopping
from lightgbm import log_evaluation
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score,f1_score
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from tqdm import tqdm
from user_agents import parse
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from urllib.parse import quote, unquote, urlparse
import re
import glob

pd.set_option('display.max_columns', None)
1
2
3
4
5
6
7
8
9
10
11
# iPhone的UserAgent

def get_ua(row):
user_agent = parse(row['user_agent'])

browser_family=str(user_agent.browser.family)
os_family=str(user_agent.os.family)
device_family=str(user_agent.device.family)
device_brand=str(user_agent.device.brand)
device_model=str(user_agent.device.model)
return browser_family,os_family,device_family,device_brand,device_model
1
2
prob = np.load('E://data//DF//Web攻击检测与分类识别//large/deberta-v3-large_probs.npy')
prob.shape
(4000, 6)
1
2
3
4
5
train=pd.read_pickle('E:\\data\\DF\\Web攻击检测与分类识别/large/oof_df.pkl')     # 33037 rows × 15 columns
sub = pd.read_csv('E:\\data\\DF\\Web攻击检测与分类识别\\submit_example (10).csv')

test=pd.read_csv('E:\\data\\DF\\Web攻击检测与分类识别/test.csv')
train.head()
id method user_agent url refer body label text fold 0 1 2 3 4 5
0 13429 GET '||(select 1 from (select pg_sleep(8))x)||' /kelev/scripts/?C=M%3BO%3DA NAN GET /kelev/scripts/?C=M%3BO%3DA HTTP/1.1 Accep... 1 method:GET[SEP]user_agent:'||(select 1 from (s... 0 0.000238 0.999110 0.000445 0.000058 0.000040 0.000110
1 18125 GET Dalvik/2.1.0 (Linux; U; Android 11; M2102K1C B... /livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap... NAN GET /livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&... 1 method:GET[SEP]user_agent:Dalvik/2.1.0 (Linux;... 0 0.006598 0.992451 0.000763 0.000086 0.000067 0.000035
2 14538 GET Dalvik/2.1.0 (Linux; U; Android 11; M2011K2C B... /livemsg?ad_type=WL_WK&ty=web&pu=0&openudid=d2... NAN GET /livemsg?ad_type=WL_WK&ty=web&pu=0&openudi... 1 method:GET[SEP]user_agent:Dalvik/2.1.0 (Linux;... 0 0.000783 0.999017 0.000138 0.000031 0.000017 0.000013
3 7127 GET Dalvik/2.1.0 (Linux; U; Android 10; MI 9 MIUI/... /livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap... NAN NAN 1 method:GET[SEP]user_agent:Dalvik/2.1.0 (Linux;... 0 0.007603 0.991491 0.000725 0.000087 0.000062 0.000033
4 7 GET Dalvik/2.1.0 (Linux; U; Android 10; ELS-AN00 B... /livemsg?sdtfrom=v5004&ad_type=WL_WK&oadid=&ty... NAN GET /livemsg?sdtfrom=v5004&ad_type=WL_WK&oadid... 1 method:GET[SEP]user_agent:Dalvik/2.1.0 (Linux;... 0 0.000529 0.999257 0.000153 0.000020 0.000021 0.000019
1
test.head()

id method user_agent url refer body
0 0 GET Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl... /demo/aisec/upload.php?act='%7C%7C(select+1+fr... http://demo.aisec.cn/demo/aisec/upload.php?t=0... GET /demo/aisec/upload.php?act='%7C%7C(select+...
1 1 GET Dalvik/2.1.0 (Linux; U; Android 11; M2102J2SC ... /livemsg?ad_type=WL_WK&ty=web&pu=1&openudid=5f... NaN GET /livemsg?ad_type=WL_WK&ty=web&pu=1&openudi...
2 2 GET Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/2... /create_user/?username=%3Cscript%3Ealert(docum... NaN NaN
3 3 GET NaN /mmsns/WeDwicXmkOl4kjKsBycicI0H3q41r6syFFvu46h... NaN NaN
4 4 PUT Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/2... /naizau.jsp/ NaN GET /login HTTP/1.1 Host: 111.160.211.18:8088 ...
1
train.isnull().sum()
id            0
method        0
user_agent    0
url           0
refer         0
body          0
label         0
text          0
fold          0
0             0
1             0
2             0
3             0
4             0
5             0
dtype: int64
1
train.columns
Index(['id', 'method', 'user_agent', 'url', 'refer', 'body', 'label', 'text',
       'fold', '0', '1', '2', '3', '4', '5'],
      dtype='object')
1
2
test[['0', '1', '2',
'3', '4', '5']]=prob
1
2
3
# test['label']=pd.read_csv('models/v4/lgb.csv')['predict']
train=train.drop('text',axis=1)
train=train.drop('fold',axis=1)
1
2
train=train[['id', 'method', 'user_agent', 'url', 'refer', 'body', '0', '1',
'2', '3', '4', '5', 'label']]
1
2
print("train.shape",train.shape)
print("test.shape",test.shape)
train.shape (33037, 13)
test.shape (4000, 12)

数据分析

赛题训练集分为6种不同标签,共计约3.5万条数据。训练数据集字段内容主要包括:
● lable:攻击类型编号
● 其他:HTTP协议内容

1
# 查看训练集的字段
1
train.columns
Index(['id', 'method', 'user_agent', 'url', 'refer', 'body', '0', '1', '2',
       '3', '4', '5', 'label'],
      dtype='object')

‘lable’看着很别扭,重新rename一下

1
train=train.rename(columns={'lable':'label'})
1
train.dtypes
id              int64
method         object
user_agent     object
url            object
refer          object
body           object
0             float32
1             float32
2             float32
3             float32
4             float32
5             float32
label           int64
dtype: object
1
2
# 标签个数统计
train['label'].value_counts()
1    14038
2     9939
0     6489
3     1215
4      697
5      659
Name: label, dtype: int64
1
2
train['label'].value_counts().plot(kind='bar')
plt.show()

Snipaste_2022-12-05_12-13-30.png

1
2
data=pd.concat([train,test],axis=0).reset_index(drop=True)
data.nunique()
id            19497
method           21
user_agent     1087
url           36613
refer           941
body          22380
0             35915
1             31618
2             36061
3             36961
4             36965
5             36959
label             6
dtype: int64
1
2
3
4
5
# 缺失值处理:
data['user_agent']=data['user_agent'].fillna('NAN')
data['refer']=data['refer'].fillna('NAN')
data['body']=data['body'].fillna('NAN')
data['url']=data['url'].fillna('NAN')
1
2
3
4
# 提取user_agent特征
ua_cols=['browser_family', 'os_family', 'device_family','device_brand','device_model']
data[ua_cols] = data.apply(get_ua, axis=1, result_type="expand")
data.head()

id method user_agent url refer body 0 1 2 3 4 5 label browser_family os_family device_family device_brand device_model
0 13429 GET '||(select 1 from (select pg_sleep(8))x)||' /kelev/scripts/?C=M%3BO%3DA NAN GET /kelev/scripts/?C=M%3BO%3DA HTTP/1.1 Accep... 0.000238 0.999110 0.000445 0.000058 0.000040 0.000110 1.0 Other Other Other None None
1 18125 GET Dalvik/2.1.0 (Linux; U; Android 11; M2102K1C B... /livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap... NAN GET /livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&... 0.006598 0.992451 0.000763 0.000086 0.000067 0.000035 1.0 Android Android M2102K1C Generic_Android M2102K1C
2 14538 GET Dalvik/2.1.0 (Linux; U; Android 11; M2011K2C B... /livemsg?ad_type=WL_WK&ty=web&pu=0&openudid=d2... NAN GET /livemsg?ad_type=WL_WK&ty=web&pu=0&openudi... 0.000783 0.999017 0.000138 0.000031 0.000017 0.000013 1.0 Android Android M2011K2C Generic_Android M2011K2C
3 7127 GET Dalvik/2.1.0 (Linux; U; Android 10; MI 9 MIUI/... /livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap... NAN NAN 0.007603 0.991491 0.000725 0.000087 0.000062 0.000033 1.0 Android Android XiaoMi MI 9 XiaoMi MI 9
4 7 GET Dalvik/2.1.0 (Linux; U; Android 10; ELS-AN00 B... /livemsg?sdtfrom=v5004&ad_type=WL_WK&oadid=&ty... NAN GET /livemsg?sdtfrom=v5004&ad_type=WL_WK&oadid... 0.000529 0.999257 0.000153 0.000020 0.000021 0.000019 1.0 Android Android ELS-AN00 Huawei ELS-AN00

基础特征

1
2
import urllib.parse
import urllib
1
2
3
4
5
6
data['user_agent_len']=data['user_agent'].apply(lambda x:len(x))
data['url_len']=data['url'].apply(lambda x:len(x))
data['refer_len']=data['refer'].apply(lambda x:len(x))
data['body_len']=data['body'].apply(lambda x:len(x))
data['body_user_agent_len_diff']=data['body_len']-data['user_agent_len']
data['body_url_len_diff']=data['body_len']-data['url_len']
1
2
3
4
5
6
7
8
9
10
11
12
# 将user_agent列进行tfidf特征提取,再SVD变成16维度
texts=data['user_agent'].values.tolist()

n_components = 16 # 期望维数
tf = TfidfVectorizer(min_df= 1, max_df=0.5,analyzer = 'char_wb', ngram_range = (1,3)) # ngram_range = (2,5)
X = tf.fit_transform(texts)
svd = TruncatedSVD(n_components=n_components,
random_state=42)
X_svd = svd.fit_transform(X)
df_tfidf = pd.DataFrame(X_svd)
df_tfidf.columns = [f'user_agent_name_tfidf_{i}' for i in range(n_components)]
data=pd.concat([data,df_tfidf],axis=1)
1
2
3
4
5
6
7
8
9
10
11
texts=data['url'].values.tolist()

n_components = 16
tf = TfidfVectorizer(min_df= 1, max_df=0.5,analyzer = 'char_wb', ngram_range = (1,3))
X = tf.fit_transform(texts)
svd = TruncatedSVD(n_components=n_components,
random_state=42)
X_svd = svd.fit_transform(X)
df_tfidf = pd.DataFrame(X_svd)
df_tfidf.columns = [f'url_name_tfidf_{i}' for i in range(n_components)]
data=pd.concat([data,df_tfidf],axis=1)
1
2
3
4
5
6
7
8
9
10
11
# texts=data['refer'].values.tolist()

# n_components = 16
# tf = TfidfVectorizer(min_df= 1, max_df=0.5,analyzer = 'char_wb', ngram_range = (1,3))
# X = tf.fit_transform(texts)
# svd = TruncatedSVD(n_components=n_components,
# random_state=42)
# X_svd = svd.fit_transform(X)
# df_tfidf = pd.DataFrame(X_svd)
# df_tfidf.columns = [f'refer_tfidf_{i}' for i in range(n_components)]
# data=pd.concat([data,df_tfidf],axis=1)
1
2
3
4
5
6
7
8
9
10
11
texts=data['body'].values.tolist()

n_components = 32
tf = TfidfVectorizer(min_df= 1, max_df=0.5,analyzer = 'char_wb', ngram_range = (1,3))
X = tf.fit_transform(texts)
svd = TruncatedSVD(n_components=n_components,
random_state=42)
X_svd = svd.fit_transform(X)
df_tfidf = pd.DataFrame(X_svd)
df_tfidf.columns = [f'body_tfidf_{i}' for i in range(n_components)]
data=pd.concat([data,df_tfidf],axis=1)
1
2
3
for f in ['method', 'url','refer', 'body','browser_family','os_family','device_family','device_brand','device_model']:    # refer
data[f'id_{f}_nunique'] = data.groupby(['id'])[f].transform('nunique')
data[f'id_{f}_count'] = data.groupby(['id'])[f].transform('count')
1
re.split('[=&]', urlparse(data['url'][0])[4])
['C', 'M%3BO%3DA']
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def get_url_query(s):
li = re.split('[=&]', urlparse(s)[4])
return [li[i] for i in range(len(li)) if i % 2 == 1]


def find_max_str_length(x):
max_ = 0
li = [len(i) for i in x]
return max(li) if len(li) > 0 else 0


def find_str_length_std(x):
max_ = 0
li = [len(i) for i in x]
return np.std(li) if len(li) > 0 else -1


data['url_unquote'] = data['url'].apply(unquote)
data['url_query'] = data['url_unquote'].apply(lambda x: get_url_query(x))
data['url_query_num'] = data['url_query'].apply(len)
data['url_query_max_len'] = data['url_query'].apply(find_max_str_length)
data['url_query_len_std'] = data['url_query'].apply(find_str_length_std)
data['url'].apply(unquote)
0                                  /kelev/scripts/?C=M;O=A
1        /livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap...
2        /livemsg?ad_type=WL_WK&ty=web&pu=0&openudid=d2...
3        /livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap...
4        /livemsg?sdtfrom=v5004&ad_type=WL_WK&oadid=&ty...
                               ...                        
37032    /livemsg?ad_type=WL_WK&ty=web&pu=1&openudid=64...
37033                                          /runtime.js
37034                                     /query?493521812
37035              /stats.php?rand=JtmT4wBtrpNy5RJnNX9wCUo
37036    /api/gateway.do?method=qihoo.sdk.user.mobile.l...
Name: url, Length: 37037, dtype: object
1
data.head()

id method user_agent url refer body 0 1 2 3 4 5 label browser_family os_family device_family device_brand device_model user_agent_len url_len refer_len body_len body_user_agent_len_diff body_url_len_diff user_agent_name_tfidf_0 user_agent_name_tfidf_1 user_agent_name_tfidf_2 user_agent_name_tfidf_3 user_agent_name_tfidf_4 user_agent_name_tfidf_5 user_agent_name_tfidf_6 user_agent_name_tfidf_7 user_agent_name_tfidf_8 user_agent_name_tfidf_9 user_agent_name_tfidf_10 user_agent_name_tfidf_11 user_agent_name_tfidf_12 user_agent_name_tfidf_13 user_agent_name_tfidf_14 user_agent_name_tfidf_15 url_name_tfidf_0 url_name_tfidf_1 url_name_tfidf_2 url_name_tfidf_3 url_name_tfidf_4 url_name_tfidf_5 url_name_tfidf_6 url_name_tfidf_7 url_name_tfidf_8 url_name_tfidf_9 url_name_tfidf_10 url_name_tfidf_11 url_name_tfidf_12 url_name_tfidf_13 url_name_tfidf_14 url_name_tfidf_15 body_tfidf_0 body_tfidf_1 body_tfidf_2 body_tfidf_3 body_tfidf_4 body_tfidf_5 body_tfidf_6 body_tfidf_7 body_tfidf_8 body_tfidf_9 body_tfidf_10 body_tfidf_11 body_tfidf_12 body_tfidf_13 body_tfidf_14 body_tfidf_15 body_tfidf_16 body_tfidf_17 body_tfidf_18 body_tfidf_19 body_tfidf_20 body_tfidf_21 body_tfidf_22 body_tfidf_23 body_tfidf_24 body_tfidf_25 body_tfidf_26 body_tfidf_27 body_tfidf_28 body_tfidf_29 body_tfidf_30 body_tfidf_31 id_method_nunique id_method_count id_url_nunique id_url_count id_refer_nunique id_refer_count id_body_nunique id_body_count id_browser_family_nunique id_browser_family_count id_os_family_nunique id_os_family_count id_device_family_nunique id_device_family_count id_device_brand_nunique id_device_brand_count id_device_model_nunique id_device_model_count url_unquote url_query url_query_num url_query_max_len url_query_len_std
0 13429 GET '||(select 1 from (select pg_sleep(8))x)||' /kelev/scripts/?C=M%3BO%3DA NAN GET /kelev/scripts/?C=M%3BO%3DA HTTP/1.1 Accep... 0.000238 0.999110 0.000445 0.000058 0.000040 0.000110 1.0 Other Other Other None None 43 27 3 212 169 185 0.010070 0.009456 0.001205 0.003217 0.021082 0.000999 -0.000847 -0.002107 0.008443 0.002747 0.023997 -0.003526 -0.001894 -0.013000 -0.005918 0.009153 0.066298 0.059683 -0.057310 -0.006595 -0.001159 0.094755 -0.021858 0.023071 -0.001968 0.043890 0.022147 -0.006037 -0.005375 0.014239 0.077494 0.081818 0.000054 0.115960 0.105404 -0.027789 -0.002577 0.064009 -0.039324 0.006577 0.023000 -0.064038 0.021134 -0.058787 0.011353 0.066560 0.010658 0.189591 -0.067039 -0.116538 -0.014128 -0.029329 0.047018 -0.023777 -0.035825 0.005829 0.013030 -0.022777 -0.007482 0.039357 0.046385 -0.007902 -0.029089 -0.051989 1 2 2 2 1 2 2 2 2 2 2 2 1 2 1 2 1 2 /kelev/scripts/?C=M;O=A [M;O] 1 3 0.000000
1 18125 GET Dalvik/2.1.0 (Linux; U; Android 11; M2102K1C B... /livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap... NAN GET /livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&... 0.006598 0.992451 0.000763 0.000086 0.000067 0.000035 1.0 Android Android M2102K1C Generic_Android M2102K1C 67 1747 3 2016 1949 269 0.035096 0.120188 0.270092 0.655747 -0.042591 -0.024738 -0.019824 -0.000642 -0.015951 -0.183311 -0.026321 -0.061993 -0.018288 0.027496 -0.012776 -0.002250 0.617250 -0.130247 0.079094 -0.023195 0.003702 -0.030486 -0.013201 -0.021136 0.012303 -0.006278 -0.023727 -0.003599 0.027036 0.006405 -0.005268 0.010234 0.000132 0.530101 -0.277845 0.022171 -0.069403 -0.062703 -0.006813 0.001156 -0.028995 -0.009364 -0.013567 0.015499 -0.009968 -0.032960 -0.000036 -0.008127 -0.000813 0.001447 0.009261 -0.017541 -0.000682 -0.003697 0.007340 -0.010968 0.008710 -0.067023 -0.014870 -0.024112 0.011792 0.004538 0.014397 -0.003550 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 /livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap... [WL_WK, , web, 0, 1, 210810, 116, 1, 8, fa0d30... 23 1324 268.709026
2 14538 GET Dalvik/2.1.0 (Linux; U; Android 11; M2011K2C B... /livemsg?ad_type=WL_WK&ty=web&pu=0&openudid=d2... NAN GET /livemsg?ad_type=WL_WK&ty=web&pu=0&openudi... 0.000783 0.999017 0.000138 0.000031 0.000017 0.000013 1.0 Android Android M2011K2C Generic_Android M2011K2C 67 1688 3 1986 1919 298 0.034866 0.170330 0.292857 0.713273 -0.047369 -0.025480 -0.014846 -0.002015 -0.095211 -0.199969 0.001352 0.004859 0.026900 -0.000053 -0.045812 -0.003042 0.662307 -0.134895 0.074788 -0.033836 0.004867 -0.012766 -0.014195 -0.019396 0.016613 -0.006143 -0.018700 -0.012640 0.030395 0.004169 -0.005899 0.005060 0.000137 0.557311 -0.298627 0.021799 -0.072395 -0.035369 -0.009561 -0.020158 -0.030540 -0.019107 -0.011824 0.016220 -0.010370 -0.021592 0.002470 0.001666 -0.004027 -0.000666 0.012006 -0.009133 -0.007882 -0.001795 -0.003188 -0.015516 0.010797 -0.083144 -0.021120 -0.029922 0.020777 0.001687 0.006579 -0.001808 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 /livemsg?ad_type=WL_WK&ty=web&pu=0&openudid=d2... [WL_WK, web, 0, d24c93f6c8de719a00f1676f3a9a53... 29 1154 209.374211
3 7127 GET Dalvik/2.1.0 (Linux; U; Android 10; MI 9 MIUI/... /livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap... NAN NAN 0.007603 0.991491 0.000725 0.000087 0.000062 0.000033 1.0 Android Android XiaoMi MI 9 XiaoMi MI 9 64 1613 3 3 -61 -1610 0.038503 0.058521 0.184916 0.434441 0.026507 -0.024867 -0.016191 0.021308 0.058906 0.048677 -0.015773 -0.017188 0.012508 0.026593 0.009872 0.001220 0.621003 -0.119104 0.071898 -0.014246 -0.005748 -0.025066 -0.015300 -0.006643 0.011277 -0.011099 -0.026949 -0.013011 0.027294 0.006678 0.006156 0.004784 1.000000 -0.000356 -0.000224 -0.000019 0.000078 0.000016 -0.000132 -0.000012 -0.000029 0.000011 -0.000044 -0.000041 -0.000041 0.000027 -0.000017 -0.000027 -0.000017 -0.000096 0.000005 0.000026 0.000013 0.000016 -0.000005 -0.000005 0.000015 -0.000004 0.000011 0.000021 -0.000004 0.000045 0.000019 -0.000019 1 3 3 3 1 3 2 3 2 3 2 3 3 3 2 3 3 3 /livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap... [WL_WK, , web, 0, 1, 201209, 116, 1, 8, bbe035... 24 1186 235.820461
4 7 GET Dalvik/2.1.0 (Linux; U; Android 10; ELS-AN00 B... /livemsg?sdtfrom=v5004&ad_type=WL_WK&oadid=&ty... NAN GET /livemsg?sdtfrom=v5004&ad_type=WL_WK&oadid... 0.000529 0.999257 0.000153 0.000020 0.000021 0.000019 1.0 Android Android ELS-AN00 Huawei ELS-AN00 66 1467 3 1704 1638 237 0.023016 0.076053 0.211935 0.431614 -0.013015 -0.028123 0.016032 -0.039645 0.129501 0.388460 -0.012173 -0.016553 -0.012854 -0.049835 -0.037182 -0.024569 0.615644 -0.116622 0.066164 -0.019455 -0.007366 -0.022974 -0.016564 -0.006819 -0.000959 -0.011903 -0.022060 -0.011351 0.002437 -0.008505 -0.008668 -0.004333 0.000129 0.535128 -0.294560 0.024614 -0.084538 -0.049889 -0.019040 -0.000086 -0.029467 -0.026259 -0.010166 0.009809 -0.017834 -0.018439 0.007596 -0.017013 -0.005269 0.002957 0.006733 -0.010658 -0.006394 -0.005429 0.011300 -0.024379 0.006489 -0.061012 -0.019155 -0.021446 0.021441 -0.001876 0.002968 -0.005974 2 5 5 5 3 5 5 5 3 5 3 5 3 5 3 5 3 5 /livemsg?sdtfrom=v5004&ad_type=WL_WK&oadid=&ty... [v5004, WL_WK, , web, 0, 20220209V0BT5X00, 1, ... 27 972 182.378599
1
2
3
4
5
6
7
8
9
10
11
12
13
def find_url_filetype(x):
try:
return re.search(r'\.[a-z]+', x).group()
except:
return '__NaN__'


data['url_path'] = data['url_unquote'].apply(lambda x: urlparse(x)[2])
data['url_filetype'] = data['url_path'].apply(lambda x: find_url_filetype(x))

data['url_path_len'] = data['url_path'].apply(len)
data['url_path_num'] = data['url_path'].apply(lambda x: len(re.findall('/', x)))

1
2
data['ua_short'] = data['user_agent'].apply(lambda x: x.split('/')[0])
data['ua_first'] = data['user_agent'].apply(lambda x: x.split(' ')[0])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# def strs_contains(strs, keyword):
# return True if re.search(keyword, strs, re.IGNORECASE) else False

# data['url_contains_select'] = data['url_unquote'].apply(lambda x: strs_contains(x, 'select'))
# data['url_contains_select_from'] = data['url_unquote'].apply(lambda x: strs_contains(x, 'select.*from'))
# data['url_contains_union'] = data['url_unquote'].apply(lambda x: strs_contains(x, 'union'))
# data['url_contains_where'] = data['url_unquote'].apply(lambda x: strs_contains(x, 'where'))
# data['url_contains_struts2'] = data['url_unquote'].apply(lambda x: strs_contains(x, 'struts2'))
# data['url_contains_alert'] = data['url_unquote'].apply(lambda x: strs_contains(x, 'alert'))
# data['url_contains_sudo'] = data['url_unquote'].apply(lambda x: strs_contains(x, 'sudo'))
# data['url_contains_etc_passwd'] = data['url_unquote'].apply(lambda x: strs_contains(x, 'etc.*passwd'))
# data['url_contains_dot_dot'] = data['url_unquote'].apply(lambda x: strs_contains(x, '%2e%2e%2f'))
# data['url_contains_dot_dot2'] = data['url_unquote'].apply(lambda x: strs_contains(x, '\.\./'))
# data['url_contains_javascript'] = data['url_unquote'].apply(lambda x: strs_contains(x, 'javascript'))
# data['url_contains_shell'] = data['url_unquote'].apply(lambda x: strs_contains(x, 'shell'))
# data['url_contains_java_lang'] = data['url_unquote'].apply(lambda x: strs_contains(x, 'java.lang'))
1
2
3
for col in tqdm(['method', 'refer', 'browser_family','os_family','device_family', 'device_brand', 'device_model','url_filetype','ua_short','ua_first']):
le = LabelEncoder()
data[col] = le.fit_transform(data[col])
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 82.97it/s]
1
len(data.select_dtypes(include=['int','float']).columns.tolist())
109
1
2
3
col = data.select_dtypes(include=['int','float']).columns.tolist()
data = data[col]
feature_names = [i for i in col if i not in ['id','label']]
1
2
3
4
5
6
7
train = data[data['label'].notnull()].reset_index(drop = True)
test = data[~data['label'].notnull()].reset_index(drop = True)


x_train = train[feature_names]
y_train = train['label']
x_test = test[feature_names]
1
x_train.shape
(33037, 107)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
def lgb_model(train, target, test, k):
feats = [f for f in train.columns if f not in ['id','label', 'url', 'url_count']]
# feats=import_cols
print('Current num of features:', len(feats))

oof_probs = np.zeros((train.shape[0],6))
output_preds = 0
offline_score = []
feature_importance_df = pd.DataFrame()
parameters = {
'learning_rate': 0.03,
'boosting_type': 'gbdt',
'objective': 'multiclass',
'metric': 'multi_error',
'num_class': 6,
'num_leaves': 31,
'feature_fraction': 0.6,
'bagging_fraction': 0.8,
'min_data_in_leaf': 15,
'verbose': -1,
'nthread': -1,
'max_depth': 7
}

# parameters = {
# 'learning_rate': 0.1,
# 'metric': 'multiclass',
# 'objective': 'multiclass',
# 'num_classes': 6,
# 'feature_fraction': 0.75,
# 'bagging_fraction': 0.75,
# 'bagging_freq': 2,
# 'n_jobs': -1,
# 'seed': 1029,
# 'max_depth': 10,
# 'num_leaves': 100,
# 'lambda_l1': 0.5,
# 'lambda_l2': 0.8,
# 'verbose': -1
# }

seeds = [2020]
for seed in seeds:
folds = StratifiedKFold(n_splits=k, shuffle=True, random_state=seed)
for i, (train_index, test_index) in enumerate(folds.split(train, target)):
train_y, test_y = target.iloc[train_index], target.iloc[test_index]
train_X, test_X = train[feats].iloc[train_index, :], train[feats].iloc[test_index, :]

dtrain = lgb.Dataset(train_X,
label=train_y)
dval = lgb.Dataset(test_X,
label=test_y)
lgb_model = lgb.train(
parameters,
dtrain,
num_boost_round=8000,
valid_sets=[dval],
# feval = evalerror,
callbacks=[early_stopping(100), log_evaluation(100)],
)
oof_probs[test_index] = lgb_model.predict(test_X[feats], num_iteration=lgb_model.best_iteration) / len(
seeds)
offline_score.append(lgb_model.best_score['valid_0']['multi_error'])
output_preds += lgb_model.predict(test[feats],
num_iteration=lgb_model.best_iteration) / folds.n_splits / len(seeds)
print(offline_score)
# feature importance
fold_importance_df = pd.DataFrame()
fold_importance_df["feature"] = feats
fold_importance_df["importance"] = lgb_model.feature_importance(importance_type='gain')
fold_importance_df["fold"] = i + 1
feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
print('OOF-MEAN-AUC:%.6f, OOF-STD-AUC:%.6f' % (np.mean(offline_score), np.std(offline_score)))
print('feature importance:')
print(feature_importance_df.groupby(['feature'])['importance'].mean().sort_values(ascending=False).head(50))

return output_preds, oof_probs, np.mean(offline_score), feature_importance_df

1
2
3
4
# feature_names = list(
# filter(
# lambda x: x not in ['id','label','url', 'url_count','url_query'],
# train.columns))
1
2
3
4
5

print('开始模型训练train')
lgb_preds, lgb_oof, lgb_score, feature_importance_df = lgb_model(train=train[feature_names],
target=train['label'],
test=test[feature_names], k=5)
开始模型训练train
Current num of features: 107
Training until validation scores don't improve for 100 rounds
[100]	valid_0's multi_error: 0.0145278
[200]	valid_0's multi_error: 0.0136199
[300]	valid_0's multi_error: 0.0134685
Early stopping, best iteration is:
[203]	valid_0's multi_error: 0.0133172
[0.013317191283292978]
Training until validation scores don't improve for 100 rounds
[100]	valid_0's multi_error: 0.0119552
Early stopping, best iteration is:
[63]	valid_0's multi_error: 0.0116525
[0.013317191283292978, 0.011652542372881356]
Training until validation scores don't improve for 100 rounds
[100]	valid_0's multi_error: 0.011503
Early stopping, best iteration is:
[92]	valid_0's multi_error: 0.011503
[0.013317191283292978, 0.011652542372881356, 0.011502951415165732]
Training until validation scores don't improve for 100 rounds
[100]	valid_0's multi_error: 0.011503
Early stopping, best iteration is:
[55]	valid_0's multi_error: 0.0107462
[0.013317191283292978, 0.011652542372881356, 0.011502951415165732, 0.010746178295746934]
Training until validation scores don't improve for 100 rounds
[100]	valid_0's multi_error: 0.011503
Early stopping, best iteration is:
[21]	valid_0's multi_error: 0.0112002
[0.013317191283292978, 0.011652542372881356, 0.011502951415165732, 0.010746178295746934, 0.011200242167398214]
OOF-MEAN-AUC:0.011684, OOF-STD-AUC:0.000873
feature importance:
feature
2                           222576.439984
1                           211736.666876
0                           147032.174178
4                           108800.598945
3                            86021.756831
browser_family               62270.501839
5                            39814.736140
body_tfidf_0                 27317.752312
body_tfidf_1                 23303.336642
url_name_tfidf_2             16427.787706
url_name_tfidf_14            15618.236285
url_name_tfidf_3             14344.577968
user_agent_len                9418.270741
body_user_agent_len_diff      9103.063841
user_agent_name_tfidf_2       8463.647873
url_query_max_len             6510.758108
user_agent_name_tfidf_5       4018.171339
body_tfidf_3                  3672.676081
body_tfidf_23                 3355.861338
user_agent_name_tfidf_4       3142.178290
url_name_tfidf_4              3127.492671
url_name_tfidf_13             2886.949569
url_name_tfidf_1              2849.448233
url_name_tfidf_5              2676.723486
user_agent_name_tfidf_7       2330.854639
user_agent_name_tfidf_14      2310.199494
refer_len                     2145.159362
user_agent_name_tfidf_3       2103.610035
body_tfidf_10                 1986.768004
url_name_tfidf_6              1866.483835
body_url_len_diff             1849.085455
body_tfidf_6                  1846.585181
body_len                      1821.060478
user_agent_name_tfidf_0       1743.131918
id_method_count               1659.460621
user_agent_name_tfidf_13      1627.554725
body_tfidf_5                  1539.584926
url_name_tfidf_8              1485.817275
url_name_tfidf_9              1449.335372
url_name_tfidf_12             1429.778973
url_name_tfidf_10             1413.196746
url_name_tfidf_11             1296.268733
user_agent_name_tfidf_9       1290.671890
url_len                       1284.274063
body_tfidf_12                 1279.712368
body_tfidf_9                  1209.730761
body_tfidf_8                  1161.935181
url_name_tfidf_15             1158.114828
url_name_tfidf_7               998.550187
url_name_tfidf_0               998.193206
Name: importance, dtype: float64
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# xgb_params = {'n_estimators': 10000,
# 'learning_rate': 0.03689407512484644,
# 'max_depth': 8,
# 'objective': 'multi:softproba',
# 'colsample_bytree': 0.3723914688159835,
# 'subsample': 0.780714581166012,
# 'eval_metric': 'mlogloss',
# 'gamma': 0,
# 'nthread': 1,
# 'reg_lambda': 50.0,
# 'random_state': 42}

# cat_params = {'iterations': 8000,
# 'learning_rate': 0.03429054860458741,
# 'reg_lambda': 0.3242286463210283,
# 'subsample': 0.9433911589913944,
# 'random_strength': 22.4849972385133,
# 'depth': 8,
# 'thread_count': 1,
# # 'min_data_in_leaf': 4,
# 'leaf_estimation_iterations': 8,
# 'task_type': "CPU",
# 'bootstrap_type': 'Bernoulli',
# 'verbose': 50,
# 'early_stopping_rounds': 50,
# # 'eval_metric': 'AUC',
# 'loss_function':'MultiClass'
# }
# # lgb = LGBMClassifier(**lgb_params)
# xgb = XGBClassifier(**xgb_params)
# cat = CatBoostClassifier(**cat_params)


# # In[18]:


# def get_oof(feats, target, test, kfold, clf):
# oof_preds = np.zeros((feats.shape[0],6))
# sub_preds = np.zeros((test.shape[0],6))
# for i, (train_idx, valid_idx) in enumerate(kfold.split(feats, target)):
# train_X, train_y = feats.loc[train_idx], target.loc[train_idx]
# valid_X, valid_y = feats.loc[valid_idx], target.loc[valid_idx]

# clf.fit(train_X, train_y, eval_set=[(valid_X, valid_y)], verbose=100, early_stopping_rounds=50, )
# oof_preds[valid_idx] = clf.predict_proba(valid_X)
# sub_preds += clf.predict_proba(test)
# del train_X, train_y, valid_X, valid_y

# evalution_result = accuracy_score(target, np.argmax(oof_preds,axis=1))
# print('*' * 10)
# print('roc auc score:', evalution_result)
# print('*' * 20)
# sub_preds_result = sub_preds / kfold.n_splits
# return oof_preds, sub_preds_result


# kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=2021)
# # 45开始
1
# oof_preds_2, sub_preds_2 = get_oof(train[feature_names], train['label'], test[feature_names], kfold, xgb)i
1
# oof_preds_3, sub_preds_3 = get_oof(train[feature_names], train['label'], test[feature_names], kfold, cat)
1
sub['predict']=np.argmax(lgb_preds,axis=1)
1
sub['predict'].value_counts()
2    855
1    828
0    804
3    666
4    447
5    400
Name: predict, dtype: int64
1
# sub.to_csv('E:/data/DF/Web攻击检测与分类识别/res/9-8-1.csv')
1
accuracy_score(train['label'],np.argmax(lgb_oof,axis=1))
0.9883161303992494
1
2
3
4
5
6
7
8
# lgb
f1_score(np.argmax(lgb_oof,axis=1),train['label'],average= 'macro')

# xgb
# f1_score(np.argmax(oof_preds_2,axis=1),train['label'],average= 'macro')

# cat
# f1_score(np.argmax(lgb_oof,axis=1),train['label'],average= 'macro')
0.9650199644033531
1
print(classification_report(train['label'],np.argmax(lgb_oof,axis=1)))
              precision    recall  f1-score   support

         0.0       0.99      1.00      1.00      6489
         1.0       0.99      0.99      0.99     14038
         2.0       0.99      0.99      0.99      9939
         3.0       0.94      0.93      0.94      1215
         4.0       0.94      0.87      0.90       697
         5.0       0.97      0.98      0.98       659

    accuracy                           0.99     33037
   macro avg       0.97      0.96      0.97     33037
weighted avg       0.99      0.99      0.99     33037