pyspark.ml.clustering.
LDAModel
Latent Dirichlet Allocation (LDA) model. This abstraction permits for different underlying representations, including local and distributed data structures.
New in version 2.0.0.
Methods
clear(param)
clear
Clears a param from the param map if it has been explicitly set.
copy([extra])
copy
Creates a copy of this instance with the same uid and some extra params.
describeTopics([maxTermsPerTopic])
describeTopics
Return the topics described by their top-weighted terms.
estimatedDocConcentration()
estimatedDocConcentration
Value for LDA.docConcentration estimated from data.
LDA.docConcentration
explainParam(param)
explainParam
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
explainParams()
explainParams
Returns the documentation of all params with their optionally default values and user-supplied values.
extractParamMap([extra])
extractParamMap
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
getCheckpointInterval()
getCheckpointInterval
Gets the value of checkpointInterval or its default value.
getDocConcentration()
getDocConcentration
Gets the value of docConcentration or its default value.
docConcentration
getFeaturesCol()
getFeaturesCol
Gets the value of featuresCol or its default value.
getK()
getK
Gets the value of k or its default value.
k
getKeepLastCheckpoint()
getKeepLastCheckpoint
Gets the value of keepLastCheckpoint or its default value.
keepLastCheckpoint
getLearningDecay()
getLearningDecay
Gets the value of learningDecay or its default value.
learningDecay
getLearningOffset()
getLearningOffset
Gets the value of learningOffset or its default value.
learningOffset
getMaxIter()
getMaxIter
Gets the value of maxIter or its default value.
getOptimizeDocConcentration()
getOptimizeDocConcentration
Gets the value of optimizeDocConcentration or its default value.
optimizeDocConcentration
getOptimizer()
getOptimizer
Gets the value of optimizer or its default value.
optimizer
getOrDefault(param)
getOrDefault
Gets the value of a param in the user-supplied param map or its default value.
getParam(paramName)
getParam
Gets a param by its name.
getSeed()
getSeed
Gets the value of seed or its default value.
getSubsamplingRate()
getSubsamplingRate
Gets the value of subsamplingRate or its default value.
subsamplingRate
getTopicConcentration()
getTopicConcentration
Gets the value of topicConcentration or its default value.
topicConcentration
getTopicDistributionCol()
getTopicDistributionCol
Gets the value of topicDistributionCol or its default value.
topicDistributionCol
hasDefault(param)
hasDefault
Checks whether a param has a default value.
hasParam(paramName)
hasParam
Tests whether this instance contains a param with a given (string) name.
isDefined(param)
isDefined
Checks whether a param is explicitly set by user or has a default value.
isDistributed()
isDistributed
Indicates whether this instance is of type DistributedLDAModel
isSet(param)
isSet
Checks whether a param is explicitly set by user.
logLikelihood(dataset)
logLikelihood
Calculates a lower bound on the log likelihood of the entire corpus.
logPerplexity(dataset)
logPerplexity
Calculate an upper bound on perplexity.
set(param, value)
set
Sets a parameter in the embedded param map.
setFeaturesCol(value)
setFeaturesCol
Sets the value of featuresCol.
featuresCol
setSeed(value)
setSeed
Sets the value of seed.
seed
setTopicDistributionCol(value)
setTopicDistributionCol
Sets the value of topicDistributionCol.
topicsMatrix()
topicsMatrix
Inferred topics, where each topic is represented by a distribution over terms.
transform(dataset[, params])
transform
Transforms the input dataset with optional parameters.
vocabSize()
vocabSize
Vocabulary size (number of terms or words in the vocabulary)
Attributes
checkpointInterval
maxIter
params
Returns all params ordered by name.
Methods Documentation
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
Extra parameters to copy to the new instance
JavaParams
Copy of this instance
Value for LDA.docConcentration estimated from data. If Online LDA was used and LDA.optimizeDocConcentration was set to false, then this returns the fixed (given) value for the LDA.docConcentration parameter.
LDA.optimizeDocConcentration
extra param values
merged param map
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
Calculates a lower bound on the log likelihood of the entire corpus. See Equation (16) in the Online LDA paper (Hoffman et al., 2010).
Warning
If this model is an instance of DistributedLDAModel (produced when optimizer is set to “em”), this involves collecting a large topicsMatrix() to the driver. This implementation may be changed in the future.
DistributedLDAModel
Calculate an upper bound on perplexity. (Lower is better.) See Equation (16) in the Online LDA paper (Hoffman et al., 2010).
New in version 3.0.0.
Inferred topics, where each topic is represented by a distribution over terms. This is a matrix of size vocabSize x k, where each column is a topic. No guarantees are given about the ordering of the topics.
If this model is actually a DistributedLDAModel instance produced by the Expectation-Maximization (“em”) optimizer, then this method could involve collecting a large amount of data to the driver (on the order of vocabSize x k).
New in version 1.3.0.
pyspark.sql.DataFrame
input dataset
an optional param map that overrides embedded params.
transformed dataset
Attributes Documentation
Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.
dir()
Param