MLlib (DataFrame-based) for Spark Connect#

Warning

The namespace for this package can change in the future Spark version.

Pipeline APIs#

`Transformer`()	Abstract class for transformers that transform one dataset into another.
`Estimator`()	Abstract class for estimators that fit models to data.
`Model`()	Abstract class for models that are fitted by estimators.
`Evaluator`()	Base class for evaluators that compute metrics from predictions.
`Pipeline`(*[, stages])	A simple pipeline, which acts as an estimator.
`PipelineModel`([stages])	Represents a compiled pipeline with transformers and fitted models.

Feature#

`MaxAbsScaler`(*[, inputCol, outputCol])	Rescale each feature individually to range [-1, 1] by dividing through the largest maximum absolute value in each feature.
`MaxAbsScalerModel`([max_abs_values, ...])	Model fitted by MaxAbsScaler.
`StandardScaler`([inputCol, outputCol])	Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.
`StandardScalerModel`([mean_values, ...])	Model fitted by StandardScaler.
`ArrayAssembler`(*[, inputCols, outputCol, ...])	A feature transformer that merges multiple input columns into an array type column.

Classification#

`LogisticRegression`(*[, featuresCol, ...])	Logistic regression estimator.
`LogisticRegressionModel`([torch_model, ...])	Model fitted by LogisticRegression.

Functions#

`array_to_vector`(col)	Converts a column of array of numeric type into a column of pyspark.ml.linalg.DenseVector instances
`vector_to_array`(col[, dtype])	Converts a column of MLlib sparse/dense vectors into a column of dense arrays.

Tuning#

`CrossValidator`(*[, estimator, ...])	K-fold cross validation performs model selection by splitting the dataset into a set of non-overlapping randomly partitioned folds which are used as separate training and test datasets e.g., with k=3 folds, K-fold cross validation will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing.
`CrossValidatorModel`([bestModel, avgMetrics, ...])	CrossValidatorModel contains the model with the highest average cross-validation metric across folds and uses this model to transform input data.

Evaluation#

`RegressionEvaluator`(*[, metricName, ...])	Evaluator for Regression, which expects input columns prediction and label.
`BinaryClassificationEvaluator`(*[, ...])	Evaluator for binary classification, which expects input columns prediction and label.
`MulticlassClassificationEvaluator`([...])	Evaluator for multiclass classification, which expects input columns prediction and label.

Utilities#

`ParamsReadWrite`()	The base interface Estimator / Transformer / Model / Evaluator needs to inherit for supporting saving and loading.
`CoreModelReadWrite`()
`MetaAlgorithmReadWrite`()	Meta-algorithm such as pipeline and cross validator must implement this interface.