pyspark.mllib.clustering.
KMeansModel
A clustering model derived from the k-means method.
New in version 0.9.0.
Examples
>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) >>> model = KMeans.train( ... sc.parallelize(data), 2, maxIterations=10, initializationMode="random", ... seed=50, initializationSteps=5, epsilon=1e-4) >>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) True >>> model.predict(array([8.0, 9.0])) == model.predict(array([9.0, 8.0])) True >>> model.k 2 >>> model.computeCost(sc.parallelize(data)) 2.0 >>> model = KMeans.train(sc.parallelize(data), 2) >>> sparse_data = [ ... SparseVector(3, {1: 1.0}), ... SparseVector(3, {1: 1.1}), ... SparseVector(3, {2: 1.0}), ... SparseVector(3, {2: 1.1}) ... ] >>> model = KMeans.train(sc.parallelize(sparse_data), 2, initializationMode="k-means||", ... seed=50, initializationSteps=5, epsilon=1e-4) >>> model.predict(array([0., 1., 0.])) == model.predict(array([0, 1.1, 0.])) True >>> model.predict(array([0., 0., 1.])) == model.predict(array([0, 0, 1.1])) True >>> model.predict(sparse_data[0]) == model.predict(sparse_data[1]) True >>> model.predict(sparse_data[2]) == model.predict(sparse_data[3]) True >>> isinstance(model.clusterCenters, list) True >>> import os, tempfile >>> path = tempfile.mkdtemp() >>> model.save(sc, path) >>> sameModel = KMeansModel.load(sc, path) >>> sameModel.predict(sparse_data[0]) == model.predict(sparse_data[0]) True >>> from shutil import rmtree >>> try: ... rmtree(path) ... except OSError: ... pass
>>> data = array([-383.1,-382.9, 28.7,31.2, 366.2,367.3]).reshape(3, 2) >>> model = KMeans.train(sc.parallelize(data), 3, maxIterations=0, ... initialModel = KMeansModel([(-1000.0,-1000.0),(5.0,5.0),(1000.0,1000.0)])) >>> model.clusterCenters [array([-1000., -1000.]), array([ 5., 5.]), array([ 1000., 1000.])]
Methods
computeCost(rdd)
computeCost
Return the K-means cost (sum of squared distances of points to their nearest center) for this model on the given data.
load(sc, path)
load
Load a model from the given path.
predict(x)
predict
Find the cluster that each of the points belongs to in this model.
save(sc, path)
save
Save this model to the given path.
Attributes
clusterCenters
Get the cluster centers, represented as a list of NumPy arrays.
k
Total number of clusters.
Methods Documentation
New in version 1.4.0.
pyspark.RDD
The RDD of points to compute the cost on.
pyspark.mllib.linalg.Vector
A data point (or RDD of points) to determine cluster index. pyspark.mllib.linalg.Vector can be replaced with equivalent objects (list, tuple, numpy.ndarray).
Predicted cluster index or an RDD of predicted cluster indices if the input is an RDD.
Attributes Documentation
New in version 1.0.0.