Web攻击检测与分类识别思路

发表于 2022-12-05 分类于竞赛

CCF BDCI Web攻击检测与分类识别 Top8思路

赛题地址：https://www.datafountain.cn/competitions/596

赛题背景：

某业务平台平均每月捕获到Web攻击数量超过2亿，涉及常见注入攻击，代码执行等类型。传统威胁检测手段通过分析已知攻击特征进行规则匹配，无法检测未知漏洞或攻击手法。如何快速准确地识别未知威胁攻击并且将不同攻击正确分类，对提升Web攻击检测能力至关重要。利用机器学习和深度学习技术对攻击报文进行识别和分类已经成为解决该问题的创新思路，有利于推动AI技术在威胁检测分析场景的研究与应用。

赛题任务：

参赛团队需要对前期提供的训练集进行分析，通过特征工程、机器学习和深度学习等方法构建AI模型，实现对每一条样本正确且快速分类，不断提高模型精确率和召回率。待模型优化稳定后，通过无标签测试集评估各参赛团队模型分类效果，以正确率评估各参赛团队模型质量。

决赛答辩：

决赛答辩中，评审专家将根据答辩作品的创新性、可用性等进行打分；最终成绩将综合考虑初赛成绩、创新性、可用性等方面确定最终排名，最终成绩 = 初赛复现成绩 * 80% + 决赛成绩 * 20%。
注意，答辩着重考察以下方面：
(1) 模型发现未知攻击类型的能力
(2) 模型的时间复杂度
(3) 其他创新

数据简介

赛题训练集分为6种不同标签，共计约3.5万条数据。训练数据集字段内容主要包括：
●　ID：样本编号
●　label：攻击类型编号
●　其他：HTTP协议内容

评测标准

评比期间将提供无标签测试集，参赛团队需提交对该测试集每条数据的模型分类结果，即每条数据中增加一个predict字段（模型分类结果），与训练集label字段含义保持一致。
评估程序将模型预测结果predict与标准答案label对比，统计精确率、召回率和F1，最终以F1为准。

标签	分类为正标签	分类为负标签
正标签	TP	FN
负标签	FP	TN

精确率计算公式：Precision = TP/(TP + FP)
召回率计算公式：Recall = TP/(TP + FN)
F1计算公式：F1 = 2 * Precision * Recall/(Precision + Recall)
注：该F1为 macro F1

代码如下

# lgb删掉tfidf+refer：线下0.9644，线上0.95997
# lgb未删tfidf+refer：线下0.96432，线上0.95828
# xgb:线下0.96494
# cat:线下9650，线上0.9613

import lightgbm as lgb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from lightgbm import early_stopping
from lightgbm import log_evaluation
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score,f1_score
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from tqdm import tqdm
from user_agents import parse
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from urllib.parse import quote, unquote, urlparse
import re
import glob  

pd.set_option('display.max_columns', None)

# iPhone的UserAgent

def get_ua(row):
    user_agent = parse(row['user_agent'])

    browser_family=str(user_agent.browser.family)
    os_family=str(user_agent.os.family)
    device_family=str(user_agent.device.family)
    device_brand=str(user_agent.device.brand)
    device_model=str(user_agent.device.model)
    return browser_family,os_family,device_family,device_brand,device_model

1 2	prob = np.load('E://data//DF//Web攻击检测与分类识别//large/deberta-v3-large_probs.npy') prob.shape

(4000, 6)

train=pd.read_pickle('E:\\data\\DF\\Web攻击检测与分类识别/large/oof_df.pkl')     # 33037 rows × 15 columns
sub = pd.read_csv('E:\\data\\DF\\Web攻击检测与分类识别\\submit_example (10).csv')

test=pd.read_csv('E:\\data\\DF\\Web攻击检测与分类识别/test.csv')
train.head()

	id	method	user_agent	url	refer	body	label	text	0	1	2	3	4	5
0	13429	GET	'\|\|(select 1 from (select pg_sleep(8))x)\|\|'	/kelev/scripts/?C=M%3BO%3DA	NAN	GET /kelev/scripts/?C=M%3BO%3DA HTTP/1.1 Accep...	1	method:GET[SEP]user_agent:'\|\|(select 1 from (s...	0.000238	0.999110	0.000445	0.000058	0.000040	0.000110
1	18125	GET	Dalvik/2.1.0 (Linux; U; Android 11; M2102K1C B...	/livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap...	NAN	GET /livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&...	1	method:GET[SEP]user_agent:Dalvik/2.1.0 (Linux;...	0.006598	0.992451	0.000763	0.000086	0.000067	0.000035
2	14538	GET	Dalvik/2.1.0 (Linux; U; Android 11; M2011K2C B...	/livemsg?ad_type=WL_WK&ty=web&pu=0&openudid=d2...	NAN	GET /livemsg?ad_type=WL_WK&ty=web&pu=0&openudi...	1	method:GET[SEP]user_agent:Dalvik/2.1.0 (Linux;...	0.000783	0.999017	0.000138	0.000031	0.000017	0.000013
3	7127	GET	Dalvik/2.1.0 (Linux; U; Android 10; MI 9 MIUI/...	/livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap...	NAN	NAN	1	method:GET[SEP]user_agent:Dalvik/2.1.0 (Linux;...	0.007603	0.991491	0.000725	0.000087	0.000062	0.000033
4	7	GET	Dalvik/2.1.0 (Linux; U; Android 10; ELS-AN00 B...	/livemsg?sdtfrom=v5004&ad_type=WL_WK&oadid=&ty...	NAN	GET /livemsg?sdtfrom=v5004&ad_type=WL_WK&oadid...	1	method:GET[SEP]user_agent:Dalvik/2.1.0 (Linux;...	0.000529	0.999257	0.000153	0.000020	0.000021	0.000019

1	test.head()

	id	method	user_agent	url	refer	body
0	0	GET	Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...	/demo/aisec/upload.php?act='%7C%7C(select+1+fr...	http://demo.aisec.cn/demo/aisec/upload.php?t=0...	GET /demo/aisec/upload.php?act='%7C%7C(select+...
1	1	GET	Dalvik/2.1.0 (Linux; U; Android 11; M2102J2SC ...	/livemsg?ad_type=WL_WK&ty=web&pu=1&openudid=5f...	NaN	GET /livemsg?ad_type=WL_WK&ty=web&pu=1&openudi...
2	2	GET	Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/2...	/create_user/?username=%3Cscript%3Ealert(docum...	NaN	NaN
3	3	GET	NaN	/mmsns/WeDwicXmkOl4kjKsBycicI0H3q41r6syFFvu46h...	NaN	NaN
4	4	PUT	Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/2...	/naizau.jsp/	NaN	GET /login HTTP/1.1 Host: 111.160.211.18:8088 ...

1	train.isnull().sum()

id            0
method        0
user_agent    0
url           0
refer         0
body          0
label         0
text          0
fold          0
0             0
1             0
2             0
3             0
4             0
5             0
dtype: int64

1	train.columns

Index(['id', 'method', 'user_agent', 'url', 'refer', 'body', 'label', 'text',
       'fold', '0', '1', '2', '3', '4', '5'],
      dtype='object')

1 2	test[['0', '1', '2', '3', '4', '5']]=prob

1
2
3

# test['label']=pd.read_csv('models/v4/lgb.csv')['predict']
train=train.drop('text',axis=1)
train=train.drop('fold',axis=1)

1 2	train=train[['id', 'method', 'user_agent', 'url', 'refer', 'body', '0', '1', '2', '3', '4', '5', 'label']]

1 2	print("train.shape",train.shape) print("test.shape",test.shape)

train.shape (33037, 13)
test.shape (4000, 12)

数据分析

赛题训练集分为6种不同标签，共计约3.5万条数据。训练数据集字段内容主要包括：
●　lable：攻击类型编号
●　其他：HTTP协议内容

1	# 查看训练集的字段

1	train.columns

Index(['id', 'method', 'user_agent', 'url', 'refer', 'body', '0', '1', '2',
       '3', '4', '5', 'label'],
      dtype='object')

‘lable’看着很别扭，重新rename一下

1	train=train.rename(columns={'lable':'label'})

1	train.dtypes

id              int64
method         object
user_agent     object
url            object
refer          object
body           object
0             float32
1             float32
2             float32
3             float32
4             float32
5             float32
label           int64
dtype: object

1 2	# 标签个数统计 train['label'].value_counts()

1    14038
2     9939
0     6489
3     1215
4      697
5      659
Name: label, dtype: int64

1 2	train['label'].value_counts().plot(kind='bar') plt.show()

1 2	data=pd.concat([train,test],axis=0).reset_index(drop=True) data.nunique()

id            19497
method           21
user_agent     1087
url           36613
refer           941
body          22380
0             35915
1             31618
2             36061
3             36961
4             36965
5             36959
label             6
dtype: int64

# 缺失值处理：
data['user_agent']=data['user_agent'].fillna('NAN')
data['refer']=data['refer'].fillna('NAN')
data['body']=data['body'].fillna('NAN')
data['url']=data['url'].fillna('NAN')

# 提取user_agent特征
ua_cols=['browser_family', 'os_family', 'device_family','device_brand','device_model']
data[ua_cols] = data.apply(get_ua, axis=1, result_type="expand")
data.head()

	id	method	user_agent	url	refer	body	0	1	2	3	4	5	label	browser_family	os_family	device_family	device_brand	device_model
0	13429	GET	'\|\|(select 1 from (select pg_sleep(8))x)\|\|'	/kelev/scripts/?C=M%3BO%3DA	NAN	GET /kelev/scripts/?C=M%3BO%3DA HTTP/1.1 Accep...	0.000238	0.999110	0.000445	0.000058	0.000040	0.000110	1.0	Other	Other	Other	None	None
1	18125	GET	Dalvik/2.1.0 (Linux; U; Android 11; M2102K1C B...	/livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap...	NAN	GET /livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&...	0.006598	0.992451	0.000763	0.000086	0.000067	0.000035	1.0	Android	Android	M2102K1C	Generic_Android	M2102K1C
2	14538	GET	Dalvik/2.1.0 (Linux; U; Android 11; M2011K2C B...	/livemsg?ad_type=WL_WK&ty=web&pu=0&openudid=d2...	NAN	GET /livemsg?ad_type=WL_WK&ty=web&pu=0&openudi...	0.000783	0.999017	0.000138	0.000031	0.000017	0.000013	1.0	Android	Android	M2011K2C	Generic_Android	M2011K2C
3	7127	GET	Dalvik/2.1.0 (Linux; U; Android 10; MI 9 MIUI/...	/livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap...	NAN	NAN	0.007603	0.991491	0.000725	0.000087	0.000062	0.000033	1.0	Android	Android	XiaoMi MI 9	XiaoMi	MI 9
4	7	GET	Dalvik/2.1.0 (Linux; U; Android 10; ELS-AN00 B...	/livemsg?sdtfrom=v5004&ad_type=WL_WK&oadid=&ty...	NAN	GET /livemsg?sdtfrom=v5004&ad_type=WL_WK&oadid...	0.000529	0.999257	0.000153	0.000020	0.000021	0.000019	1.0	Android	Android	ELS-AN00	Huawei	ELS-AN00

基础特征

1 2	import urllib.parse import urllib

data['user_agent_len']=data['user_agent'].apply(lambda x:len(x))
data['url_len']=data['url'].apply(lambda x:len(x))
data['refer_len']=data['refer'].apply(lambda x:len(x))
data['body_len']=data['body'].apply(lambda x:len(x))
data['body_user_agent_len_diff']=data['body_len']-data['user_agent_len']
data['body_url_len_diff']=data['body_len']-data['url_len']

# 将user_agent列进行tfidf特征提取，再SVD变成16维度
texts=data['user_agent'].values.tolist()

n_components = 16      # 期望维数 
tf = TfidfVectorizer(min_df= 1, max_df=0.5,analyzer = 'char_wb', ngram_range = (1,3))     # ngram_range = (2,5)
X = tf.fit_transform(texts)
svd = TruncatedSVD(n_components=n_components,
                   random_state=42)
X_svd = svd.fit_transform(X)
df_tfidf = pd.DataFrame(X_svd)
df_tfidf.columns = [f'user_agent_name_tfidf_{i}' for i in range(n_components)]
data=pd.concat([data,df_tfidf],axis=1)

texts=data['url'].values.tolist()

n_components = 16
tf = TfidfVectorizer(min_df= 1, max_df=0.5,analyzer = 'char_wb', ngram_range = (1,3))
X = tf.fit_transform(texts)
svd = TruncatedSVD(n_components=n_components,
                   random_state=42)
X_svd = svd.fit_transform(X)
df_tfidf = pd.DataFrame(X_svd)
df_tfidf.columns = [f'url_name_tfidf_{i}' for i in range(n_components)]
data=pd.concat([data,df_tfidf],axis=1)

# texts=data['refer'].values.tolist()

# n_components = 16
# tf = TfidfVectorizer(min_df= 1, max_df=0.5,analyzer = 'char_wb', ngram_range = (1,3))
# X = tf.fit_transform(texts)
# svd = TruncatedSVD(n_components=n_components,
#                    random_state=42)
# X_svd = svd.fit_transform(X)
# df_tfidf = pd.DataFrame(X_svd)
# df_tfidf.columns = [f'refer_tfidf_{i}' for i in range(n_components)]
# data=pd.concat([data,df_tfidf],axis=1)

texts=data['body'].values.tolist()

n_components = 32
tf = TfidfVectorizer(min_df= 1, max_df=0.5,analyzer = 'char_wb', ngram_range = (1,3))
X = tf.fit_transform(texts)
svd = TruncatedSVD(n_components=n_components,
                   random_state=42)
X_svd = svd.fit_transform(X)
df_tfidf = pd.DataFrame(X_svd)
df_tfidf.columns = [f'body_tfidf_{i}' for i in range(n_components)]
data=pd.concat([data,df_tfidf],axis=1)

1
2
3

for f in ['method', 'url','refer', 'body','browser_family','os_family','device_family','device_brand','device_model']:    # refer
    data[f'id_{f}_nunique'] = data.groupby(['id'])[f].transform('nunique')
    data[f'id_{f}_count'] = data.groupby(['id'])[f].transform('count')

1	re.split('[=&]', urlparse(data['url'][0])[4])

['C', 'M%3BO%3DA']

def get_url_query(s):
    li = re.split('[=&]', urlparse(s)[4])
    return [li[i] for i in range(len(li)) if i % 2 == 1]


def find_max_str_length(x):
    max_ = 0
    li = [len(i) for i in x]
    return max(li) if len(li) > 0 else 0


def find_str_length_std(x):
    max_ = 0
    li = [len(i) for i in x]
    return np.std(li) if len(li) > 0 else -1


data['url_unquote'] = data['url'].apply(unquote)
data['url_query'] = data['url_unquote'].apply(lambda x: get_url_query(x))
data['url_query_num'] = data['url_query'].apply(len)
data['url_query_max_len'] = data['url_query'].apply(find_max_str_length)
data['url_query_len_std'] = data['url_query'].apply(find_str_length_std)
data['url'].apply(unquote)

0                                  /kelev/scripts/?C=M;O=A
1        /livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap...
2        /livemsg?ad_type=WL_WK&ty=web&pu=0&openudid=d2...
3        /livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap...
4        /livemsg?sdtfrom=v5004&ad_type=WL_WK&oadid=&ty...
                               ...                        
37032    /livemsg?ad_type=WL_WK&ty=web&pu=1&openudid=64...
37033                                          /runtime.js
37034                                     /query?493521812
37035              /stats.php?rand=JtmT4wBtrpNy5RJnNX9wCUo
37036    /api/gateway.do?method=qihoo.sdk.user.mobile.l...
Name: url, Length: 37037, dtype: object

1	data.head()

	id	method	user_agent	url	refer	body	0	1	2	3	4	5	label	browser_family	os_family	device_family	device_brand	device_model	user_agent_len	url_len	refer_len	body_len	body_user_agent_len_diff	body_url_len_diff	user_agent_name_tfidf_0	user_agent_name_tfidf_1	user_agent_name_tfidf_2	user_agent_name_tfidf_3	user_agent_name_tfidf_4	user_agent_name_tfidf_5	user_agent_name_tfidf_6	user_agent_name_tfidf_7	user_agent_name_tfidf_8	user_agent_name_tfidf_9	user_agent_name_tfidf_10	user_agent_name_tfidf_11	user_agent_name_tfidf_12	user_agent_name_tfidf_13	user_agent_name_tfidf_14	user_agent_name_tfidf_15	url_name_tfidf_0	url_name_tfidf_1	url_name_tfidf_2	url_name_tfidf_3	url_name_tfidf_4	url_name_tfidf_5	url_name_tfidf_6	url_name_tfidf_7	url_name_tfidf_8	url_name_tfidf_9	url_name_tfidf_10	url_name_tfidf_11	url_name_tfidf_12	url_name_tfidf_13	url_name_tfidf_14	url_name_tfidf_15	body_tfidf_0	body_tfidf_1	body_tfidf_2	body_tfidf_3	body_tfidf_4	body_tfidf_5	body_tfidf_6	body_tfidf_7	body_tfidf_8	body_tfidf_9	body_tfidf_10	body_tfidf_11	body_tfidf_12	body_tfidf_13	body_tfidf_14	body_tfidf_15	body_tfidf_16	body_tfidf_17	body_tfidf_18	body_tfidf_19	body_tfidf_20	body_tfidf_21	body_tfidf_22	body_tfidf_23	body_tfidf_24	body_tfidf_25	body_tfidf_26	body_tfidf_27	body_tfidf_28	body_tfidf_29	body_tfidf_30	body_tfidf_31	id_method_nunique	id_method_count	id_url_nunique	id_url_count	id_refer_nunique	id_refer_count	id_body_nunique	id_body_count	id_browser_family_nunique	id_browser_family_count	id_os_family_nunique	id_os_family_count	id_device_family_nunique	id_device_family_count	id_device_brand_nunique	id_device_brand_count	id_device_model_nunique	id_device_model_count	url_unquote	url_query	url_query_num	url_query_max_len	url_query_len_std
0	13429	GET	'\|\|(select 1 from (select pg_sleep(8))x)\|\|'	/kelev/scripts/?C=M%3BO%3DA	NAN	GET /kelev/scripts/?C=M%3BO%3DA HTTP/1.1 Accep...	0.000238	0.999110	0.000445	0.000058	0.000040	0.000110	1.0	Other	Other	Other	None	None	43	27	3	212	169	185	0.010070	0.009456	0.001205	0.003217	0.021082	0.000999	-0.000847	-0.002107	0.008443	0.002747	0.023997	-0.003526	-0.001894	-0.013000	-0.005918	0.009153	0.066298	0.059683	-0.057310	-0.006595	-0.001159	0.094755	-0.021858	0.023071	-0.001968	0.043890	0.022147	-0.006037	-0.005375	0.014239	0.077494	0.081818	0.000054	0.115960	0.105404	-0.027789	-0.002577	0.064009	-0.039324	0.006577	0.023000	-0.064038	0.021134	-0.058787	0.011353	0.066560	0.010658	0.189591	-0.067039	-0.116538	-0.014128	-0.029329	0.047018	-0.023777	-0.035825	0.005829	0.013030	-0.022777	-0.007482	0.039357	0.046385	-0.007902	-0.029089	-0.051989	1	2	2	2	1	2	2	2	2	2	2	2	1	2	1	2	1	2	/kelev/scripts/?C=M;O=A	[M;O]	1	3	0.000000
1	18125	GET	Dalvik/2.1.0 (Linux; U; Android 11; M2102K1C B...	/livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap...	NAN	GET /livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&...	0.006598	0.992451	0.000763	0.000086	0.000067	0.000035	1.0	Android	Android	M2102K1C	Generic_Android	M2102K1C	67	1747	3	2016	1949	269	0.035096	0.120188	0.270092	0.655747	-0.042591	-0.024738	-0.019824	-0.000642	-0.015951	-0.183311	-0.026321	-0.061993	-0.018288	0.027496	-0.012776	-0.002250	0.617250	-0.130247	0.079094	-0.023195	0.003702	-0.030486	-0.013201	-0.021136	0.012303	-0.006278	-0.023727	-0.003599	0.027036	0.006405	-0.005268	0.010234	0.000132	0.530101	-0.277845	0.022171	-0.069403	-0.062703	-0.006813	0.001156	-0.028995	-0.009364	-0.013567	0.015499	-0.009968	-0.032960	-0.000036	-0.008127	-0.000813	0.001447	0.009261	-0.017541	-0.000682	-0.003697	0.007340	-0.010968	0.008710	-0.067023	-0.014870	-0.024112	0.011792	0.004538	0.014397	-0.003550	1	2	2	2	1	2	2	2	2	2	2	2	2	2	2	2	2	2	/livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap...	[WL_WK, , web, 0, 1, 210810, 116, 1, 8, fa0d30...	23	1324	268.709026
2	14538	GET	Dalvik/2.1.0 (Linux; U; Android 11; M2011K2C B...	/livemsg?ad_type=WL_WK&ty=web&pu=0&openudid=d2...	NAN	GET /livemsg?ad_type=WL_WK&ty=web&pu=0&openudi...	0.000783	0.999017	0.000138	0.000031	0.000017	0.000013	1.0	Android	Android	M2011K2C	Generic_Android	M2011K2C	67	1688	3	1986	1919	298	0.034866	0.170330	0.292857	0.713273	-0.047369	-0.025480	-0.014846	-0.002015	-0.095211	-0.199969	0.001352	0.004859	0.026900	-0.000053	-0.045812	-0.003042	0.662307	-0.134895	0.074788	-0.033836	0.004867	-0.012766	-0.014195	-0.019396	0.016613	-0.006143	-0.018700	-0.012640	0.030395	0.004169	-0.005899	0.005060	0.000137	0.557311	-0.298627	0.021799	-0.072395	-0.035369	-0.009561	-0.020158	-0.030540	-0.019107	-0.011824	0.016220	-0.010370	-0.021592	0.002470	0.001666	-0.004027	-0.000666	0.012006	-0.009133	-0.007882	-0.001795	-0.003188	-0.015516	0.010797	-0.083144	-0.021120	-0.029922	0.020777	0.001687	0.006579	-0.001808	1	2	2	2	1	2	2	2	2	2	2	2	2	2	2	2	2	2	/livemsg?ad_type=WL_WK&ty=web&pu=0&openudid=d2...	[WL_WK, web, 0, d24c93f6c8de719a00f1676f3a9a53...	29	1154	209.374211
3	7127	GET	Dalvik/2.1.0 (Linux; U; Android 10; MI 9 MIUI/...	/livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap...	NAN	NAN	0.007603	0.991491	0.000725	0.000087	0.000062	0.000033	1.0	Android	Android	XiaoMi MI 9	XiaoMi	MI 9	64	1613	3	3	-61	-1610	0.038503	0.058521	0.184916	0.434441	0.026507	-0.024867	-0.016191	0.021308	0.058906	0.048677	-0.015773	-0.017188	0.012508	0.026593	0.009872	0.001220	0.621003	-0.119104	0.071898	-0.014246	-0.005748	-0.025066	-0.015300	-0.006643	0.011277	-0.011099	-0.026949	-0.013011	0.027294	0.006678	0.006156	0.004784	1.000000	-0.000356	-0.000224	-0.000019	0.000078	0.000016	-0.000132	-0.000012	-0.000029	0.000011	-0.000044	-0.000041	-0.000041	0.000027	-0.000017	-0.000027	-0.000017	-0.000096	0.000005	0.000026	0.000013	0.000016	-0.000005	-0.000005	0.000015	-0.000004	0.000011	0.000021	-0.000004	0.000045	0.000019	-0.000019	1	3	3	3	1	3	2	3	2	3	2	3	3	3	2	3	3	3	/livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap...	[WL_WK, , web, 0, 1, 201209, 116, 1, 8, bbe035...	24	1186	235.820461
4	7	GET	Dalvik/2.1.0 (Linux; U; Android 10; ELS-AN00 B...	/livemsg?sdtfrom=v5004&ad_type=WL_WK&oadid=&ty...	NAN	GET /livemsg?sdtfrom=v5004&ad_type=WL_WK&oadid...	0.000529	0.999257	0.000153	0.000020	0.000021	0.000019	1.0	Android	Android	ELS-AN00	Huawei	ELS-AN00	66	1467	3	1704	1638	237	0.023016	0.076053	0.211935	0.431614	-0.013015	-0.028123	0.016032	-0.039645	0.129501	0.388460	-0.012173	-0.016553	-0.012854	-0.049835	-0.037182	-0.024569	0.615644	-0.116622	0.066164	-0.019455	-0.007366	-0.022974	-0.016564	-0.006819	-0.000959	-0.011903	-0.022060	-0.011351	0.002437	-0.008505	-0.008668	-0.004333	0.000129	0.535128	-0.294560	0.024614	-0.084538	-0.049889	-0.019040	-0.000086	-0.029467	-0.026259	-0.010166	0.009809	-0.017834	-0.018439	0.007596	-0.017013	-0.005269	0.002957	0.006733	-0.010658	-0.006394	-0.005429	0.011300	-0.024379	0.006489	-0.061012	-0.019155	-0.021446	0.021441	-0.001876	0.002968	-0.005974	2	5	5	5	3	5	5	5	3	5	3	5	3	5	3	5	3	5	/livemsg?sdtfrom=v5004&ad_type=WL_WK&oadid=&ty...	[v5004, WL_WK, , web, 0, 20220209V0BT5X00, 1, ...	27	972	182.378599

def find_url_filetype(x):
    try:
        return re.search(r'\.[a-z]+', x).group()
    except:
        return '__NaN__'
    
    
data['url_path'] = data['url_unquote'].apply(lambda x: urlparse(x)[2])
data['url_filetype'] = data['url_path'].apply(lambda x: find_url_filetype(x))

data['url_path_len'] = data['url_path'].apply(len)
data['url_path_num'] = data['url_path'].apply(lambda x: len(re.findall('/',  x)))

1 2	data['ua_short'] = data['user_agent'].apply(lambda x: x.split('/')[0]) data['ua_first'] = data['user_agent'].apply(lambda x: x.split(' ')[0])

# def strs_contains(strs, keyword):
#     return True if re.search(keyword, strs, re.IGNORECASE) else False

# data['url_contains_select'] = data['url_unquote'].apply(lambda x: strs_contains(x, 'select'))
# data['url_contains_select_from'] = data['url_unquote'].apply(lambda x: strs_contains(x, 'select.*from'))
# data['url_contains_union'] = data['url_unquote'].apply(lambda x: strs_contains(x, 'union'))
# data['url_contains_where'] = data['url_unquote'].apply(lambda x: strs_contains(x, 'where'))
# data['url_contains_struts2'] = data['url_unquote'].apply(lambda x: strs_contains(x, 'struts2'))
# data['url_contains_alert'] = data['url_unquote'].apply(lambda x: strs_contains(x, 'alert'))
# data['url_contains_sudo'] = data['url_unquote'].apply(lambda x: strs_contains(x, 'sudo'))
# data['url_contains_etc_passwd'] = data['url_unquote'].apply(lambda x: strs_contains(x, 'etc.*passwd'))
# data['url_contains_dot_dot'] = data['url_unquote'].apply(lambda x: strs_contains(x, '%2e%2e%2f'))
# data['url_contains_dot_dot2'] = data['url_unquote'].apply(lambda x: strs_contains(x, '\.\./'))
# data['url_contains_javascript'] = data['url_unquote'].apply(lambda x: strs_contains(x, 'javascript'))
# data['url_contains_shell'] = data['url_unquote'].apply(lambda x: strs_contains(x, 'shell'))
# data['url_contains_java_lang'] = data['url_unquote'].apply(lambda x: strs_contains(x, 'java.lang'))

1
2
3

for col in tqdm(['method', 'refer', 'browser_family','os_family','device_family', 'device_brand', 'device_model','url_filetype','ua_short','ua_first']):
    le = LabelEncoder()
    data[col] = le.fit_transform(data[col])

100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 82.97it/s]

1	len(data.select_dtypes(include=['int','float']).columns.tolist())

1
2
3

col = data.select_dtypes(include=['int','float']).columns.tolist()
data = data[col]
feature_names = [i for i in col if i not in ['id','label']]

train = data[data['label'].notnull()].reset_index(drop = True)
test = data[~data['label'].notnull()].reset_index(drop = True)


x_train = train[feature_names]
y_train = train['label']
x_test = test[feature_names]

1	x_train.shape

(33037, 107)

def lgb_model(train, target, test, k):
    feats = [f for f in train.columns if f not in ['id','label',  'url', 'url_count']]
    #     feats=import_cols
    print('Current num of features:', len(feats))

    oof_probs = np.zeros((train.shape[0],6))
    output_preds = 0
    offline_score = []
    feature_importance_df = pd.DataFrame()
    parameters = {
        'learning_rate': 0.03,
        'boosting_type': 'gbdt',
        'objective': 'multiclass',
        'metric': 'multi_error',
        'num_class': 6,
        'num_leaves': 31,
        'feature_fraction': 0.6,
        'bagging_fraction': 0.8,
        'min_data_in_leaf': 15,
        'verbose': -1,
        'nthread': -1,
        'max_depth': 7
    }
    
    # parameters = {
    #         'learning_rate': 0.1,
    #         'metric': 'multiclass',
    #         'objective': 'multiclass',
    #         'num_classes': 6,
    #         'feature_fraction': 0.75,
    #         'bagging_fraction': 0.75,
    #         'bagging_freq': 2,
    #         'n_jobs': -1,
    #         'seed': 1029,
    #         'max_depth': 10,
    #         'num_leaves': 100,
    #         'lambda_l1': 0.5,
    #         'lambda_l2': 0.8,
    #         'verbose': -1
    #     }

    seeds = [2020]
    for seed in seeds:
        folds = StratifiedKFold(n_splits=k, shuffle=True, random_state=seed)
        for i, (train_index, test_index) in enumerate(folds.split(train, target)):
            train_y, test_y = target.iloc[train_index], target.iloc[test_index]
            train_X, test_X = train[feats].iloc[train_index, :], train[feats].iloc[test_index, :]

            dtrain = lgb.Dataset(train_X,
                                 label=train_y)
            dval = lgb.Dataset(test_X,
                               label=test_y)
            lgb_model = lgb.train(
                parameters,
                dtrain,
                num_boost_round=8000,
                valid_sets=[dval],
                # feval = evalerror,
                callbacks=[early_stopping(100), log_evaluation(100)],
            )
            oof_probs[test_index] = lgb_model.predict(test_X[feats], num_iteration=lgb_model.best_iteration) / len(
                seeds)
            offline_score.append(lgb_model.best_score['valid_0']['multi_error'])
            output_preds += lgb_model.predict(test[feats],
                                              num_iteration=lgb_model.best_iteration) / folds.n_splits / len(seeds)
            print(offline_score)
            # feature importance
            fold_importance_df = pd.DataFrame()
            fold_importance_df["feature"] = feats
            fold_importance_df["importance"] = lgb_model.feature_importance(importance_type='gain')
            fold_importance_df["fold"] = i + 1
            feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    print('OOF-MEAN-AUC:%.6f, OOF-STD-AUC:%.6f' % (np.mean(offline_score), np.std(offline_score)))
    print('feature importance:')
    print(feature_importance_df.groupby(['feature'])['importance'].mean().sort_values(ascending=False).head(50))

    return output_preds, oof_probs, np.mean(offline_score), feature_importance_df

# feature_names = list(
#     filter(
#         lambda x: x not in ['id','label','url', 'url_count','url_query'],
#         train.columns))


print('开始模型训练train')
lgb_preds, lgb_oof, lgb_score, feature_importance_df = lgb_model(train=train[feature_names],
                                                                 target=train['label'],
                                                                 test=test[feature_names], k=5)

开始模型训练train
Current num of features: 107
Training until validation scores don't improve for 100 rounds
[100]	valid_0's multi_error: 0.0145278
[200]	valid_0's multi_error: 0.0136199
[300]	valid_0's multi_error: 0.0134685
Early stopping, best iteration is:
[203]	valid_0's multi_error: 0.0133172
[0.013317191283292978]
Training until validation scores don't improve for 100 rounds
[100]	valid_0's multi_error: 0.0119552
Early stopping, best iteration is:
[63]	valid_0's multi_error: 0.0116525
[0.013317191283292978, 0.011652542372881356]
Training until validation scores don't improve for 100 rounds
[100]	valid_0's multi_error: 0.011503
Early stopping, best iteration is:
[92]	valid_0's multi_error: 0.011503
[0.013317191283292978, 0.011652542372881356, 0.011502951415165732]
Training until validation scores don't improve for 100 rounds
[100]	valid_0's multi_error: 0.011503
Early stopping, best iteration is:
[55]	valid_0's multi_error: 0.0107462
[0.013317191283292978, 0.011652542372881356, 0.011502951415165732, 0.010746178295746934]
Training until validation scores don't improve for 100 rounds
[100]	valid_0's multi_error: 0.011503
Early stopping, best iteration is:
[21]	valid_0's multi_error: 0.0112002
[0.013317191283292978, 0.011652542372881356, 0.011502951415165732, 0.010746178295746934, 0.011200242167398214]
OOF-MEAN-AUC:0.011684, OOF-STD-AUC:0.000873
feature importance:
feature
2                           222576.439984
1                           211736.666876
0                           147032.174178
4                           108800.598945
3                            86021.756831
browser_family               62270.501839
5                            39814.736140
body_tfidf_0                 27317.752312
body_tfidf_1                 23303.336642
url_name_tfidf_2             16427.787706
url_name_tfidf_14            15618.236285
url_name_tfidf_3             14344.577968
user_agent_len                9418.270741
body_user_agent_len_diff      9103.063841
user_agent_name_tfidf_2       8463.647873
url_query_max_len             6510.758108
user_agent_name_tfidf_5       4018.171339
body_tfidf_3                  3672.676081
body_tfidf_23                 3355.861338
user_agent_name_tfidf_4       3142.178290
url_name_tfidf_4              3127.492671
url_name_tfidf_13             2886.949569
url_name_tfidf_1              2849.448233
url_name_tfidf_5              2676.723486
user_agent_name_tfidf_7       2330.854639
user_agent_name_tfidf_14      2310.199494
refer_len                     2145.159362
user_agent_name_tfidf_3       2103.610035
body_tfidf_10                 1986.768004
url_name_tfidf_6              1866.483835
body_url_len_diff             1849.085455
body_tfidf_6                  1846.585181
body_len                      1821.060478
user_agent_name_tfidf_0       1743.131918
id_method_count               1659.460621
user_agent_name_tfidf_13      1627.554725
body_tfidf_5                  1539.584926
url_name_tfidf_8              1485.817275
url_name_tfidf_9              1449.335372
url_name_tfidf_12             1429.778973
url_name_tfidf_10             1413.196746
url_name_tfidf_11             1296.268733
user_agent_name_tfidf_9       1290.671890
url_len                       1284.274063
body_tfidf_12                 1279.712368
body_tfidf_9                  1209.730761
body_tfidf_8                  1161.935181
url_name_tfidf_15             1158.114828
url_name_tfidf_7               998.550187
url_name_tfidf_0               998.193206
Name: importance, dtype: float64

# xgb_params = {'n_estimators': 10000,
#               'learning_rate': 0.03689407512484644,
#               'max_depth': 8,
#               'objective': 'multi:softproba',
#               'colsample_bytree': 0.3723914688159835,
#               'subsample': 0.780714581166012,
#               'eval_metric': 'mlogloss',
#               'gamma': 0,
#               'nthread': 1,
#               'reg_lambda': 50.0,
#               'random_state': 42}

# cat_params = {'iterations': 8000,
#               'learning_rate': 0.03429054860458741,
#               'reg_lambda': 0.3242286463210283,
#               'subsample': 0.9433911589913944,
#               'random_strength': 22.4849972385133,
#               'depth': 8,
#               'thread_count': 1,
#               #         'min_data_in_leaf': 4,
#               'leaf_estimation_iterations': 8,
#               'task_type': "CPU",
#               'bootstrap_type': 'Bernoulli',
#               'verbose': 50,
#               'early_stopping_rounds': 50,
#               # 'eval_metric': 'AUC',
#               'loss_function':'MultiClass'
#               }
# # lgb = LGBMClassifier(**lgb_params)
# xgb = XGBClassifier(**xgb_params)
# cat = CatBoostClassifier(**cat_params)


# # In[18]:


# def get_oof(feats, target, test, kfold, clf):
#     oof_preds = np.zeros((feats.shape[0],6))
#     sub_preds = np.zeros((test.shape[0],6))
#     for i, (train_idx, valid_idx) in enumerate(kfold.split(feats, target)):
#         train_X, train_y = feats.loc[train_idx], target.loc[train_idx]
#         valid_X, valid_y = feats.loc[valid_idx], target.loc[valid_idx]

#         clf.fit(train_X, train_y, eval_set=[(valid_X, valid_y)], verbose=100, early_stopping_rounds=50, )
#         oof_preds[valid_idx] = clf.predict_proba(valid_X)
#         sub_preds += clf.predict_proba(test)
#         del train_X, train_y, valid_X, valid_y

#     evalution_result = accuracy_score(target, np.argmax(oof_preds,axis=1))
#     print('*' * 10)
#     print('roc auc score:', evalution_result)
#     print('*' * 20)
#     sub_preds_result = sub_preds / kfold.n_splits
#     return oof_preds, sub_preds_result


# kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=2021)
# # 45开始

1	# oof_preds_2, sub_preds_2 = get_oof(train[feature_names], train['label'], test[feature_names], kfold, xgb)i

1	# oof_preds_3, sub_preds_3 = get_oof(train[feature_names], train['label'], test[feature_names], kfold, cat)

1	sub['predict']=np.argmax(lgb_preds,axis=1)

1	sub['predict'].value_counts()

2    855
1    828
0    804
3    666
4    447
5    400
Name: predict, dtype: int64

1	# sub.to_csv('E:/data/DF/Web攻击检测与分类识别/res/9-8-1.csv')

1	accuracy_score(train['label'],np.argmax(lgb_oof,axis=1))

0.9883161303992494

# lgb
f1_score(np.argmax(lgb_oof,axis=1),train['label'],average= 'macro')

# xgb
# f1_score(np.argmax(oof_preds_2,axis=1),train['label'],average= 'macro')

# cat
# f1_score(np.argmax(lgb_oof,axis=1),train['label'],average= 'macro')

0.9650199644033531

1	print(classification_report(train['label'],np.argmax(lgb_oof,axis=1)))

              precision    recall  f1-score   support

         0.0       0.99      1.00      1.00      6489
         1.0       0.99      0.99      0.99     14038
         2.0       0.99      0.99      0.99      9939
         3.0       0.94      0.93      0.94      1215
         4.0       0.94      0.87      0.90       697
         5.0       0.97      0.98      0.98       659

    accuracy                           0.99     33037
   macro avg       0.97      0.96      0.97     33037
weighted avg       0.99      0.99      0.99     33037