如何使用matplotlib绘制Kmeans文本聚类结果?
问题内容:
我有以下代码将一些示例文本与scikit Learn聚类。
train = ["is this good?", "this is bad", "some other text here", "i am hero", "blue jeans", "red carpet", "red dog", "blue sweater", "red hat", "kitty blue"]
vect = TfidfVectorizer()
X = vect.fit_transform(train)
clf = KMeans(n_clusters=3)
clf.fit(X)
centroids = clf.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1], marker='x', s=80, linewidths=5)
plt.show()
我不知道的是如何绘制聚类结果。X是csr_matrix。我想要的是每个要绘制的结果的(x,y)坐标。
泰
问题答案:
您的tf-idf矩阵最终为3 x 17,因此您需要进行某种投影或降维以获取二维质心。您有几种选择。这是t-SNE的示例:
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.manifold import TSNE
train = ["is this good?", "this is bad", "some other text here", "i am hero", "blue jeans", "red carpet", "red dog",
"blue sweater", "red hat", "kitty blue"]
vect = TfidfVectorizer()
X = vect.fit_transform(train)
random_state = 1
clf = KMeans(n_clusters=3, random_state = random_state)
data = clf.fit(X)
centroids = clf.cluster_centers_
tsne_init = 'pca' # could also be 'random'
tsne_perplexity = 20.0
tsne_early_exaggeration = 4.0
tsne_learning_rate = 1000
model = TSNE(n_components=2, random_state=random_state, init=tsne_init, perplexity=tsne_perplexity,
early_exaggeration=tsne_early_exaggeration, learning_rate=tsne_learning_rate)
transformed_centroids = model.fit_transform(centroids)
print transformed_centroids
plt.scatter(transformed_centroids[:, 0], transformed_centroids[:, 1], marker='x')
plt.show()
在您的示例中,如果使用PCA初始化t-SNE,则会得到相距较远的质心。如果您使用随机初始化,则会得到微小的质心和无趣的图片。