pyspark.RDD.distinct¶

RDD.distinct(numPartitions: Optional[int] = None) → pyspark.rdd.RDD[T][source]¶

Return a new RDD containing the distinct elements in this RDD.

New in version 0.7.0.

Parameters

Returns

See also

Examples

>>> sorted(sc.parallelize([1, 1, 2, 3]).distinct().collect())
[1, 2, 3]

pyspark.RDD.countByValue

pyspark.RDD.filter