pyspark.RDD.repartition¶
-
RDD.
repartition
(numPartitions: int) → pyspark.rdd.RDD[T][source]¶ Return a new RDD that has exactly numPartitions partitions.
Can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data. If you are decreasing the number of partitions in this RDD, consider using coalesce, which can avoid performing a shuffle.
New in version 1.0.0.
- Parameters
- numPartitionsint, optional
the number of partitions in new
RDD
- Returns
Examples
>>> rdd = sc.parallelize([1,2,3,4,5,6,7], 4) >>> sorted(rdd.glom().collect()) [[1], [2, 3], [4, 5], [6, 7]] >>> len(rdd.repartition(2).glom().collect()) 2 >>> len(rdd.repartition(10).glom().collect()) 10