江东的笔记

Be overcome difficulties is victory

0%

手撕代码:朴素贝叶斯+拉普拉斯平滑代码实现

通过底层逻辑去复现贝叶斯代码

计算步骤:

P(好瓜) = P(好瓜)P(色泽|好瓜)P(根蒂|好瓜)P(敲声|好瓜)P(纹理|好瓜)P(脐部|好瓜)P(触感|好瓜)
P(坏瓜) = P(坏瓜)P(色泽|坏瓜)P(根蒂|坏瓜)P(敲声|坏瓜)P(纹理|坏瓜)P(脐部|坏瓜)P(触感|坏瓜)
例:P(色泽|好瓜) = P(好瓜|色泽)*P(色泽)/P(好瓜)

数据的读取

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import pandas as pd

# melon2 = pd.read_csv('E:\\work\ml\\Python_Project_01\\sklearn_week\\week_10\\melon2.0.csv', index_col='编号')

melon2 = pd.DataFrame([["青绿", "蜷缩", "浊响", "清晰", "凹陷", "硬滑", "是"],
["乌黑", "蜷缩", "沉闷", "清晰", "凹陷", "硬滑", "是"],
["乌黑", "蜷缩", "浊响", "清晰", "凹陷", "硬滑", "是"],
["青绿", "蜷缩", "沉闷", "清晰", "凹陷", "硬滑", "是"],
["浅白", "蜷缩", "浊响", "清晰", "凹陷", "硬滑", "是"],
["青绿", "稍蜷", "浊响", "清晰", "稍凹", "软粘", "是"],
["乌黑", "稍蜷", "浊响", "稍糊", "稍凹", "软粘", "是"],
["乌黑", "稍蜷", "浊响", "清晰", "稍凹", "硬滑", "是"],
["乌黑", "稍蜷", "沉闷", "稍糊", "稍凹", "硬滑", "否"],
["青绿", "硬挺", "清脆", "清晰", "平坦", "软粘", "否"],
["浅白", "硬挺", "清脆", "模糊", "平坦", "硬滑", "否"],
["浅白", "蜷缩", "浊响", "模糊", "平坦", "软粘", "否"],
["青绿", "稍蜷", "浊响", "稍糊", "凹陷", "硬滑", "否"],
["浅白", "稍蜷", "沉闷", "稍糊", "凹陷", "硬滑", "否"],
["乌黑", "稍蜷", "浊响", "清晰", "稍凹", "软粘", "否"],
["浅白", "蜷缩", "浊响", "模糊", "平坦", "硬滑", "否"],
["青绿", "蜷缩", "沉闷", "稍糊", "稍凹", "硬滑", "否"]],
columns=["色泽", "根蒂", "敲声", "纹理", "脐部", "触感", "好瓜"])

取好坏瓜:

1
2
m2_bad = melon2[melon2['好瓜'] == '否']
m2_good = melon2[melon2['好瓜'] == '是']

求先验:

1
2
3
# # 好不好的先验
p_good_priori = (len(m2_good) + 1) / (len(melon2) + 2)
p_bad_priori = (len(m2_bad) + 1) / (len(melon2) + 2)

特征提取:

1
2
3
4
5
6
7
8
9
# # 各个特征的好、不好的拉普拉斯平滑:使用列表作为整体,每个特征实现一个字典
# 计数每个特征的值类别数
feature_num = melon2.shape[-1] - 1 # 全局性隐含特征序一致
features_name = [] # 特征的值的集合,这里一致,然后防止好瓜、坏瓜中没有相关的特征值
features_counts = [] # 特征的个数,可以拉普拉斯平滑的分母修正项
for ii in range(feature_num):
features_name.append(set(melon2.iloc[:, ii]))
features_counts.append(len(set(melon2.iloc[:, ii]))

1
2
3
4
5
6
7
8
9
10
features_name
[{'乌黑', '浅白', '青绿'},
{'硬挺', '稍蜷', '蜷缩'},
{'沉闷', '浊响', '清脆'},
{'模糊', '清晰', '稍糊'},
{'凹陷', '平坦', '稍凹'},
{'硬滑', '软粘'}]

features_counts:[3, 3, 3, 3, 3, 2]

求P(*|好瓜):

1
2
3
4
5
6
7
8
9
10
# 好瓜部分
ps_feature_good = []
# 先对特征计数
for ii in range(feature_num):
ps_feature_good.append(dict(m2_good.iloc[:, ii].value_counts())) # Series本质上就是字典
# 然后用拉普拉斯计算条件概率
for ii in range(feature_num):
for ff in features_name[ii]: # 下一行的get防止出空
ps_feature_good[ii][ff] = (ps_feature_good[ii].get(ff, 0) + 1) / (len(m2_good) + features_counts[ii])

求P(*|坏瓜):

1
2
3
4
5
6
7
8
9
# 坏瓜部分
ps_feature_bad = []
# 先对特征计数
for ii in range(feature_num):
ps_feature_bad.append(dict(m2_bad.iloc[:, ii].value_counts()))
# 然后用拉普拉斯计算条件概率
for ii in range(feature_num):
for ff in features_name[ii]:
ps_feature_bad[ii][ff] = (ps_feature_bad[ii].get(ff, 0) + 1) / (len(m2_bad) + features_counts[ii])

预测好坏瓜的函数:

1
2
3
4
5
6
7
8
9
10
11
# # 预测的函数 好坏分开,连乘比大小
def predict(features):
p_good = p_good_priori
for ii in range(feature_num):
p_good *= ps_feature_good[ii][features[ii]]

p_bad = p_bad_priori
for ii in range(feature_num):
p_bad *= ps_feature_bad[ii][features[ii]]

return '是' if p_good > p_bad else '否'

验证结果:

1
2
3
4
# 验证结果
for idx in melon2.index:
print(predict(melon2.loc[idx]), melon2.loc[idx][-1],
predict(melon2.loc[idx]) == melon2.loc[idx][-1])

输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
是 是 True
是 是 True
是 是 True
是 是 True
是 是 True
是 是 True
否 是 False
是 是 True
否 否 True
否 否 True
否 否 True
否 否 True
是 否 False
否 否 True
是 否 False
否 否 True
否 否 True

这是老师给的代码,下一篇文章介绍本人自己的写的代码,欢迎阅读