On this page

MLlib (RDD-based)¶

Classification¶

`LogisticRegressionModel`(weights, intercept, …)	Classification model trained using Multinomial/Binary Logistic Regression.
`LogisticRegressionWithSGD`	Train a classification model for Binary Logistic Regression using Stochastic Gradient Descent.
`LogisticRegressionWithLBFGS`	Train a classification model for Multinomial/Binary Logistic Regression using Limited-memory BFGS.
`SVMModel`(weights, intercept)	Model for Support Vector Machines (SVMs).
`SVMWithSGD`	Train a Support Vector Machine (SVM) using Stochastic Gradient Descent.
`NaiveBayesModel`(labels, pi, theta)	Model for Naive Bayes classifiers.
`NaiveBayes`	Train a Multinomial Naive Bayes model.
`StreamingLogisticRegressionWithSGD`([…])	Train or predict a logistic regression model on streaming data.

Clustering¶

`BisectingKMeansModel`(java_model)	A clustering model derived from the bisecting k-means method.
`BisectingKMeans`	A bisecting k-means algorithm based on the paper “A comparison of document clustering techniques” by Steinbach, Karypis, and Kumar, with modification to fit Spark.
`KMeansModel`(centers)	A clustering model derived from the k-means method.
`KMeans`	K-means clustering.
`GaussianMixtureModel`(java_model)	A clustering model derived from the Gaussian Mixture Model method.
`GaussianMixture`	Learning algorithm for Gaussian Mixtures using the expectation-maximization algorithm.
`PowerIterationClusteringModel`(java_model)	Model produced by `PowerIterationClustering`.
`PowerIterationClustering`	Power Iteration Clustering (PIC), a scalable graph clustering algorithm.
`StreamingKMeans`([k, decayFactor, timeUnit])	Provides methods to set k, decayFactor, timeUnit to configure the KMeans algorithm for fitting and predicting on incoming dstreams.
`StreamingKMeansModel`(clusterCenters, …)	Clustering model which can perform an online update of the centroids.
`LDA`	Train Latent Dirichlet Allocation (LDA) model.
`LDAModel`(java_model)	A clustering model derived from the LDA method.

Evaluation¶

`BinaryClassificationMetrics`(scoreAndLabels)	Evaluator for binary classification.
`RegressionMetrics`(predictionAndObservations)	Evaluator for regression.
`MulticlassMetrics`(predictionAndLabels)	Evaluator for multiclass classification.
`RankingMetrics`(predictionAndLabels)	Evaluator for ranking algorithms.

Feature¶

`Normalizer`([p])	Normalizes samples individually to unit L^p norm
`StandardScalerModel`(java_model)	Represents a StandardScaler model that can transform vectors.
`StandardScaler`([withMean, withStd])	Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.
`HashingTF`([numFeatures])	Maps a sequence of terms to their term frequencies using the hashing trick.
`IDFModel`(java_model)	Represents an IDF model that can transform term frequency vectors.
`IDF`([minDocFreq])	Inverse document frequency (IDF).
`Word2Vec`()	Word2Vec creates vector representation of words in a text corpus.
`Word2VecModel`(java_model)	class for Word2Vec model
`ChiSqSelector`([numTopFeatures, …])	Creates a ChiSquared feature selector.
`ChiSqSelectorModel`(java_model)	Represents a Chi Squared selector model.
`ElementwiseProduct`(scalingVector)	Scales each column of the vector, with the supplied weight vector.

Frequency Pattern Mining¶

`FPGrowth`	A Parallel FP-growth algorithm to mine frequent itemsets.
`FPGrowthModel`(java_model)	A FP-Growth model for mining frequent itemsets using the Parallel FP-Growth algorithm.
`PrefixSpan`	A parallel PrefixSpan algorithm to mine frequent sequential patterns.
`PrefixSpanModel`(java_model)	Model fitted by PrefixSpan

Vector and Matrix¶

`Vector`
`DenseVector`(ar)	A dense vector represented by a value array.
`SparseVector`(size, *args)	A simple sparse vector class for passing data to MLlib.
`Vectors`	Factory methods for working with vectors.
`Matrix`(numRows, numCols[, isTransposed])
`DenseMatrix`(numRows, numCols, values[, …])	Column-major dense matrix.
`SparseMatrix`(numRows, numCols, colPtrs, …)	Sparse Matrix stored in CSC format.
`Matrices`
`QRDecomposition`(Q, R)	Represents QR factors.

Distributed Representation¶

`BlockMatrix`(blocks, rowsPerBlock, colsPerBlock)	Represents a distributed matrix in blocks of local matrices.
`CoordinateMatrix`(entries[, numRows, numCols])	Represents a matrix in coordinate format.
`DistributedMatrix`	Represents a distributively stored matrix backed by one or more RDDs.
`IndexedRow`(index, vector)	Represents a row of an IndexedRowMatrix.
`IndexedRowMatrix`(rows[, numRows, numCols])	Represents a row-oriented distributed Matrix with indexed rows.
`MatrixEntry`(i, j, value)	Represents an entry of a CoordinateMatrix.
`RowMatrix`(rows[, numRows, numCols])	Represents a row-oriented distributed Matrix with no meaningful row indices.
`SingularValueDecomposition`(java_model)	Represents singular value decomposition (SVD) factors.

Random¶

Generator methods for creating RDDs comprised of i.i.d samples from some distribution.

Recommendation¶

`MatrixFactorizationModel`(java_model)	A matrix factorisation model trained by regularized alternating least-squares.
`ALS`	Alternating Least Squares matrix factorization
`Rating`	Represents a (user, product, rating) tuple.

Regression¶

`LabeledPoint`(label, features)	Class that represents the features and labels of a data point.
`LinearModel`(weights, intercept)	A linear model that has a vector of coefficients and an intercept.
`LinearRegressionModel`(weights, intercept)	A linear regression model derived from a least-squares fit.
`LinearRegressionWithSGD`	Train a linear regression model with no regularization using Stochastic Gradient Descent.
`RidgeRegressionModel`(weights, intercept)	A linear regression model derived from a least-squares fit with an l_2 penalty term.
`RidgeRegressionWithSGD`	Train a regression model with L2-regularization using Stochastic Gradient Descent.
`LassoModel`(weights, intercept)	A linear regression model derived from a least-squares fit with an l_1 penalty term.
`LassoWithSGD`	Train a regression model with L1-regularization using Stochastic Gradient Descent.
`IsotonicRegressionModel`(boundaries, …)	Regression model for isotonic regression.
`IsotonicRegression`	Isotonic regression.
`StreamingLinearAlgorithm`(model)	Base class that has to be inherited by any StreamingLinearAlgorithm.
`StreamingLinearRegressionWithSGD`([stepSize, …])	Train or predict a linear regression model on streaming data.

Statistics¶

`Statistics`
`MultivariateStatisticalSummary`(java_model)	Trait for multivariate statistical summary of a data matrix.
`ChiSqTestResult`(java_model)	Contains test results for the chi-squared hypothesis test.
`MultivariateGaussian`	Represents a (mu, sigma) tuple
`KernelDensity`()	Estimate probability density at required points given an RDD of samples from the population.
`ChiSqTestResult`(java_model)	Contains test results for the chi-squared hypothesis test.
`KolmogorovSmirnovTestResult`(java_model)	Contains test results for the Kolmogorov-Smirnov test.

Tree¶

`DecisionTreeModel`(java_model)	A decision tree model for classification or regression.
`DecisionTree`	Learning algorithm for a decision tree model for classification or regression.
`RandomForestModel`(java_model)	Represents a random forest model.
`RandomForest`	Learning algorithm for a random forest model for classification or regression.
`GradientBoostedTreesModel`(java_model)	Represents a gradient-boosted tree model.
`GradientBoostedTrees`	Learning algorithm for a gradient boosted trees model for classification or regression.

Utilities¶

`JavaLoader`	Mixin for classes which can load saved models using its Scala implementation.
`JavaSaveable`	Mixin for models that provide save() through their Scala implementation.
`LinearDataGenerator`	Utils for generating linear data.
`Loader`	Mixin for classes which can load saved models from files.
`MLUtils`	Helper methods to load, save and pre-process data used in MLlib.
`Saveable`	Mixin for models and transformers which may be saved as files.

previous

pyspark.streaming.kinesis.InitialPositionInStream.TRIM_HORIZON

next

LogisticRegressionModel