pyspark.mllib.clustering.
LDAModel
A clustering model derived from the LDA method.
Latent Dirichlet Allocation (LDA), a topic model designed for text documents. Terminology
“word” = “term”: an element of the vocabulary
“token”: instance of a term appearing in a document
“topic”: multinomial distribution over words representing some concept
New in version 1.5.0.
Notes
See the original LDA paper (journal version) [1]
Blei, D. et al. “Latent Dirichlet Allocation.” J. Mach. Learn. Res. 3 (2003): 993-1022. https://www.jmlr.org/papers/v3/blei03a
Examples
>>> from pyspark.mllib.linalg import Vectors >>> from numpy.testing import assert_almost_equal, assert_equal >>> data = [ ... [1, Vectors.dense([0.0, 1.0])], ... [2, SparseVector(2, {0: 1.0})], ... ] >>> rdd = sc.parallelize(data) >>> model = LDA.train(rdd, k=2, seed=1) >>> model.vocabSize() 2 >>> model.describeTopics() [([1, 0], [0.5..., 0.49...]), ([0, 1], [0.5..., 0.49...])] >>> model.describeTopics(1) [([1], [0.5...]), ([0], [0.5...])]
>>> topics = model.topicsMatrix() >>> topics_expect = array([[0.5, 0.5], [0.5, 0.5]]) >>> assert_almost_equal(topics, topics_expect, 1)
>>> import os, tempfile >>> from shutil import rmtree >>> path = tempfile.mkdtemp() >>> model.save(sc, path) >>> sameModel = LDAModel.load(sc, path) >>> assert_equal(sameModel.topicsMatrix(), model.topicsMatrix()) >>> sameModel.vocabSize() == model.vocabSize() True >>> try: ... rmtree(path) ... except OSError: ... pass
Methods
call(name, *a)
call
Call method of java_model
describeTopics([maxTermsPerTopic])
describeTopics
Return the topics described by weighted terms.
load(sc, path)
load
Load the LDAModel from disk.
save(sc, path)
save
Save this model to the given path.
topicsMatrix()
topicsMatrix
Inferred topics, where each topic is represented by a distribution over terms.
vocabSize()
vocabSize
Vocabulary size (number of terms or terms in the vocabulary)
Methods Documentation
New in version 1.6.0.
Warning
If vocabSize and k are large, this can return a large object!
Maximum number of terms to collect for each topic. (default: vocabulary size)
Array over topics. Each topic is represented as a pair of matching arrays: (term indices, term weights in topic). Each topic’s terms are sorted in order of decreasing weight.
pyspark.SparkContext
Path to where the model is stored.
New in version 1.3.0.