轮廓系数的值是介于 [-1,1] ,越趋近于1代表内聚度和分离度都相对较优,计算簇内不相似度a(i)
K均值调包
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
| from sklearn.cluster import KMeans from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.preprocessing import StandardScaler from sklearn.model_selection import GridSearchCV import numpy as np import pandas as pd
iris = load_iris()
X = iris.data[:, 1:3] Y = iris.target
X_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=2021)
estimator = KMeans(n_clusters=3)
estimator.fit(X_train)
y_pre = estimator.predict(x_test)
print("模型的准确率为:", accuracy_score(y_test, y_pre))
|
可视化展示
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
| from sklearn.cluster import KMeans from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.preprocessing import StandardScaler from sklearn.model_selection import GridSearchCV import numpy as np import pandas as pd import matplotlib.pyplot as plt
iris = load_iris()
X = iris.data[:, 1:3] Y = iris.target
X_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=2021)
def Kmeans_fun(k): estimator = KMeans(n_clusters=k) estimator.fit(X_train) y_pre = estimator.predict(X) return y_pre
cou = 1 plt.figure(figsize=(20, 8), dpi=100) for i in range(1, 10): y_pre = Kmeans_fun(i) plt.subplot(330 + cou) plt.scatter(X[:, 0], X[:, 1], c=y_pre) cou += 1 plt.title("第{0}个中心分类的结果".format(i))
|
选取最优K值
手肘法
手肘发:肉眼观察K,将每个中心的E进行可视化,选取拐点
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
| from sklearn.cluster import KMeans from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.preprocessing import StandardScaler from sklearn.model_selection import GridSearchCV import numpy as np import pandas as pd import matplotlib.pyplot as plt
iris = load_iris()
X = iris.data[:, 1:3] Y = iris.target
X_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=2021)
SSE = [] for i in range(2, 10): estimator = KMeans(n_clusters=i) estimator.fit(X_train, y_train) SSE.append(estimator.inertia_)
X = range(2,10) plt.scatter(X, SSE) plt.plot(X, SSE) plt.show()
|
轮廓系数
轮廓系数的值是介于 [-1,1] ,越趋近于1代表内聚度和分离度都相对较优,计算簇内不相似度a(i)(所属的簇的其他对象之间的平均距离) :i向量到同簇内其他点不相似程度的平均值,体现凝聚,计算 簇间不相似度b(i) :i向量到其他簇的平均不相似程度的最小值,体现分离度
si接近1,则说明样本i聚类合理;si接近-1,则说明样本i更应该分类到另外的簇;若si 近似为0,则说明样本i在两个簇的边界上。
将所有点的轮廓系数求平均,就是该聚类结果总的轮廓系数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
| from sklearn.cluster import KMeans from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.preprocessing import StandardScaler from sklearn.model_selection import GridSearchCV import numpy as np import pandas as pd import matplotlib.pyplot as plt
iris = load_iris()
X = iris.data[:, 1:3] Y = iris.target
X_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=2021)
from sklearn import metrics def Kmeans_fun(k): estimator = KMeans(n_clusters=k) estimator.fit(X_train) res = estimator.labels_ return res lis = [] for i in range(2, 100): res_label = Kmeans_fun(i) lis.append(metrics.silhouette_score(X_train, res_label, metric='euclidean', sample_size=None, random_state=None)) plt.figure(figsize=(20, 8)) plt.plot(list(range(1,99)),lis) plt.xlabel("聚类中心的数量") plt.ylabel("轮廓系数") plt.title("轮廓系数和聚类中心的关系") plt.show()
|
欢迎大家关注我!!!!