ChiSqSelector#
- class pyspark.mllib.feature.ChiSqSelector(numTopFeatures=50, selectorType='numTopFeatures', percentile=0.1, fpr=0.05, fdr=0.05, fwe=0.05)[source]#
Creates a ChiSquared feature selector. The selector supports different selection methods: numTopFeatures, percentile, fpr, fdr, fwe.
numTopFeatures chooses a fixed number of top features according to a chi-squared test.
percentile is similar but chooses a fraction of all features instead of a fixed number.
fpr chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection.
fdr uses the Benjamini-Hochberg procedure to choose all features whose false discovery rate is below a threshold.
fwe chooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection.
By default, the selection method is numTopFeatures, with the default number of top features set to 50.
New in version 1.4.0.
Examples
>>> from pyspark.mllib.linalg import SparseVector, DenseVector >>> from pyspark.mllib.regression import LabeledPoint >>> data = sc.parallelize([ ... LabeledPoint(0.0, SparseVector(3, {0: 8.0, 1: 7.0})), ... LabeledPoint(1.0, SparseVector(3, {1: 9.0, 2: 6.0})), ... LabeledPoint(1.0, [0.0, 9.0, 8.0]), ... LabeledPoint(2.0, [7.0, 9.0, 5.0]), ... LabeledPoint(2.0, [8.0, 7.0, 3.0]) ... ]) >>> model = ChiSqSelector(numTopFeatures=1).fit(data) >>> model.transform(SparseVector(3, {1: 9.0, 2: 6.0})) SparseVector(1, {}) >>> model.transform(DenseVector([7.0, 9.0, 5.0])) DenseVector([7.0]) >>> model = ChiSqSelector(selectorType="fpr", fpr=0.2).fit(data) >>> model.transform(SparseVector(3, {1: 9.0, 2: 6.0})) SparseVector(1, {}) >>> model.transform(DenseVector([7.0, 9.0, 5.0])) DenseVector([7.0]) >>> model = ChiSqSelector(selectorType="percentile", percentile=0.34).fit(data) >>> model.transform(DenseVector([7.0, 9.0, 5.0])) DenseVector([7.0])
Methods
fit
(data)Returns a ChiSquared feature selector.
setFdr
(fdr)set FDR [0.0, 1.0] for feature selection by FDR.
setFpr
(fpr)set FPR [0.0, 1.0] for feature selection by FPR.
setFwe
(fwe)set FWE [0.0, 1.0] for feature selection by FWE.
setNumTopFeatures
(numTopFeatures)set numTopFeature for feature selection by number of top features.
setPercentile
(percentile)set percentile [0.0, 1.0] for feature selection by percentile.
setSelectorType
(selectorType)set the selector type of the ChisqSelector.
Methods Documentation
- fit(data)[source]#
Returns a ChiSquared feature selector.
New in version 1.4.0.
- Parameters
- data
pyspark.RDD
ofpyspark.mllib.regression.LabeledPoint
containing the labeled dataset with categorical features. Real-valued features will be treated as categorical for each distinct value. Apply feature discretizer before using this function.
- data
- setFdr(fdr)[source]#
set FDR [0.0, 1.0] for feature selection by FDR. Only applicable when selectorType = “fdr”.
New in version 2.2.0.
- setFpr(fpr)[source]#
set FPR [0.0, 1.0] for feature selection by FPR. Only applicable when selectorType = “fpr”.
New in version 2.1.0.
- setFwe(fwe)[source]#
set FWE [0.0, 1.0] for feature selection by FWE. Only applicable when selectorType = “fwe”.
New in version 2.2.0.
- setNumTopFeatures(numTopFeatures)[source]#
set numTopFeature for feature selection by number of top features. Only applicable when selectorType = “numTopFeatures”.
New in version 2.1.0.