Spark中的聚類演算法

Spark – Clustering

官方文檔://spark.apache.org/docs/2.2.0/ml-clustering.html

這部分介紹MLlib中的聚類演算法;

目錄:

  • K-means:
    • 輸入列;
    • 輸出列;
  • Latent Dirichlet allocation(LDA):
  • Bisecting k-means;
  • Gaussian Mixture Model(GMM):
    • 輸入列;
    • 輸出列;

K-means

k-means是最常用的聚類演算法之一,它將數據聚集到預先設定的N個簇中;

KMeans作為一個預測器,生成一個KMeansModel作為基本模型;

輸入列

Param name Type(s) Default Description
featuresCol Vector features Feature vector

輸出列

Param name Type(s) Default Description
predictionCol Int prediction Predicted cluster center

例子

from pyspark.ml.clustering import KMeans

# Loads data.
dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

# Trains a k-means model.
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(dataset)

# Evaluate clustering by computing Within Set Sum of Squared Errors.
wssse = model.computeCost(dataset)
print("Within Set Sum of Squared Errors = " + str(wssse))

# Shows the result.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

LDA

LDA是一個預測器,同時支援EMLDAOptimizer和OnlineLDAOptimizer,生成一個LDAModel作為基本模型,專家使用者如果有需要可以將EMLDAOptimizer生成的LDAModel轉為DistributedLDAModel;

from pyspark.ml.clustering import LDA

# Loads data.
dataset = spark.read.format("libsvm").load("data/mllib/sample_lda_libsvm_data.txt")

# Trains a LDA model.
lda = LDA(k=10, maxIter=10)
model = lda.fit(dataset)

ll = model.logLikelihood(dataset)
lp = model.logPerplexity(dataset)
print("The lower bound on the log likelihood of the entire corpus: " + str(ll))
print("The upper bound on perplexity: " + str(lp))

# Describe topics.
topics = model.describeTopics(3)
print("The topics described by their top-weighted terms:")
topics.show(truncate=False)

# Shows the result
transformed = model.transform(dataset)
transformed.show(truncate=False)

Bisecting k-means

Bisecting k-means是一種使用分裂方法的層次聚類演算法:所有數據點開始都處在一個簇中,遞歸的對數據進行劃分直到簇的個數為指定個數為止;

Bisecting k-means一般比K-means要快,但是它會生成不一樣的聚類結果;

BisectingKMeans是一個預測器,並生成BisectingKMeansModel作為基本模型;

與K-means相比,二分K-means的最終結果不依賴於初始簇心的選擇,這也是為什麼通常二分K-means與K-means結果往往不一樣的原因;

from pyspark.ml.clustering import BisectingKMeans

# Loads data.
dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

# Trains a bisecting k-means model.
bkm = BisectingKMeans().setK(2).setSeed(1)
model = bkm.fit(dataset)

# Evaluate clustering.
cost = model.computeCost(dataset)
print("Within Set Sum of Squared Errors = " + str(cost))

# Shows the result.
print("Cluster Centers: ")
centers = model.clusterCenters()
for center in centers:
    print(center)

Gaussian Mixture Model(GMM)

GMM表示一個符合分布,從一個高斯子分布中提取點,每個點都有其自己 的概率,spark.ml基於給定數據通過期望最大化演算法來歸納最大似然模型實現演算法;

輸入列

Param name Type(s) Default Description
featuresCol Vector features Feature vector

輸出列

Param name Type(s) Default Description
predictionCol Int prediction Predicted cluster center
probabilityCol Vector probability Probability of each cluster

例子

from pyspark.ml.clustering import GaussianMixture

# loads data
dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

gmm = GaussianMixture().setK(2).setSeed(538009335)
model = gmm.fit(dataset)

print("Gaussians shown as a DataFrame: ")
model.gaussiansDF.show(truncate=False)