LDA#
- class pyspark.mllib.clustering.LDA[source]#
Train Latent Dirichlet Allocation (LDA) model.
New in version 1.5.0.
Methods
train
(rdd[, k, maxIterations, ...])Train a LDA model.
Methods Documentation
- classmethod train(rdd, k=10, maxIterations=20, docConcentration=- 1.0, topicConcentration=- 1.0, seed=None, checkpointInterval=10, optimizer='em')[source]#
Train a LDA model.
New in version 1.5.0.
- Parameters
- rdd
pyspark.RDD
RDD of documents, which are tuples of document IDs and term (word) count vectors. The term count vectors are “bags of words” with a fixed-size vocabulary (where the vocabulary size is the length of the vector). Document IDs must be unique and >= 0.
- kint, optional
Number of topics to infer, i.e., the number of soft cluster centers. (default: 10)
- maxIterationsint, optional
Maximum number of iterations allowed. (default: 20)
- docConcentrationfloat, optional
Concentration parameter (commonly named “alpha”) for the prior placed on documents’ distributions over topics (“theta”). (default: -1.0)
- topicConcentrationfloat, optional
Concentration parameter (commonly named “beta” or “eta”) for the prior placed on topics’ distributions over terms. (default: -1.0)
- seedint, optional
Random seed for cluster initialization. Set as None to generate seed based on system time. (default: None)
- checkpointIntervalint, optional
Period (in iterations) between checkpoints. (default: 10)
- optimizerstr, optional
LDAOptimizer used to perform the actual calculation. Currently “em”, “online” are supported. (default: “em”)
- rdd