pyspark.RDD.distinct¶
-
RDD.
distinct
(numPartitions: Optional[int] = None) → pyspark.rdd.RDD[T][source]¶ Return a new RDD containing the distinct elements in this RDD.
New in version 0.7.0.
- Parameters
- numPartitionsint, optional
the number of partitions in new
RDD
- Returns
See also
Examples
>>> sorted(sc.parallelize([1, 1, 2, 3]).distinct().collect()) [1, 2, 3]