Spark中的聚类算法

Spark – Clustering

官方文档://spark.apache.org/docs/2.2.0/ml-clustering.html

这部分介绍MLlib中的聚类算法;

目录:

  • K-means:
    • 输入列;
    • 输出列;
  • Latent Dirichlet allocation(LDA):
  • Bisecting k-means;
  • Gaussian Mixture Model(GMM):
    • 输入列;
    • 输出列;

K-means

k-means是最常用的聚类算法之一,它将数据聚集到预先设定的N个簇中;

KMeans作为一个预测器,生成一个KMeansModel作为基本模型;

输入列

Param name Type(s) Default Description
featuresCol Vector features Feature vector

输出列

Param name Type(s) Default Description
predictionCol Int prediction Predicted cluster center

例子

from pyspark.ml.clustering import KMeans

# Loads data.
dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

# Trains a k-means model.
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(dataset)

# Evaluate clustering by computing Within Set Sum of Squared Errors.
wssse = model.computeCost(dataset)
print("Within Set Sum of Squared Errors = " + str(wssse))

# Shows the result.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

LDA

LDA是一个预测器,同时支持EMLDAOptimizer和OnlineLDAOptimizer,生成一个LDAModel作为基本模型,专家使用者如果有需要可以将EMLDAOptimizer生成的LDAModel转为DistributedLDAModel;

from pyspark.ml.clustering import LDA

# Loads data.
dataset = spark.read.format("libsvm").load("data/mllib/sample_lda_libsvm_data.txt")

# Trains a LDA model.
lda = LDA(k=10, maxIter=10)
model = lda.fit(dataset)

ll = model.logLikelihood(dataset)
lp = model.logPerplexity(dataset)
print("The lower bound on the log likelihood of the entire corpus: " + str(ll))
print("The upper bound on perplexity: " + str(lp))

# Describe topics.
topics = model.describeTopics(3)
print("The topics described by their top-weighted terms:")
topics.show(truncate=False)

# Shows the result
transformed = model.transform(dataset)
transformed.show(truncate=False)

Bisecting k-means

Bisecting k-means是一种使用分裂方法的层次聚类算法:所有数据点开始都处在一个簇中,递归的对数据进行划分直到簇的个数为指定个数为止;

Bisecting k-means一般比K-means要快,但是它会生成不一样的聚类结果;

BisectingKMeans是一个预测器,并生成BisectingKMeansModel作为基本模型;

与K-means相比,二分K-means的最终结果不依赖于初始簇心的选择,这也是为什么通常二分K-means与K-means结果往往不一样的原因;

from pyspark.ml.clustering import BisectingKMeans

# Loads data.
dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

# Trains a bisecting k-means model.
bkm = BisectingKMeans().setK(2).setSeed(1)
model = bkm.fit(dataset)

# Evaluate clustering.
cost = model.computeCost(dataset)
print("Within Set Sum of Squared Errors = " + str(cost))

# Shows the result.
print("Cluster Centers: ")
centers = model.clusterCenters()
for center in centers:
    print(center)

Gaussian Mixture Model(GMM)

GMM表示一个符合分布,从一个高斯子分布中提取点,每个点都有其自己 的概率,spark.ml基于给定数据通过期望最大化算法来归纳最大似然模型实现算法;

输入列

Param name Type(s) Default Description
featuresCol Vector features Feature vector

输出列

Param name Type(s) Default Description
predictionCol Int prediction Predicted cluster center
probabilityCol Vector probability Probability of each cluster

例子

from pyspark.ml.clustering import GaussianMixture

# loads data
dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

gmm = GaussianMixture().setK(2).setSeed(538009335)
model = gmm.fit(dataset)

print("Gaussians shown as a DataFrame: ")
model.gaussiansDF.show(truncate=False)