pyspark.RDD.repartition¶

RDD.repartition(numPartitions: int) → pyspark.rdd.RDD[T][source]¶

Return a new RDD that has exactly numPartitions partitions.

Can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data. If you are decreasing the number of partitions in this RDD, consider using coalesce, which can avoid performing a shuffle.

New in version 1.0.0.

Parameters

numPartitionsint, optional: the number of partitions in new RDD

Returns

RDD: a RDD with exactly numPartitions partitions

See also

RDD.coalesce()
RDD.partitionBy()
RDD.repartitionAndSortWithinPartitions()

Examples

>>> rdd = sc.parallelize([1,2,3,4,5,6,7], 4)
>>> sorted(rdd.glom().collect())
[[1], [2, 3], [4, 5], [6, 7]]
>>> len(rdd.repartition(2).glom().collect())
2
>>> len(rdd.repartition(10).glom().collect())
10

pyspark.RDD.reduceByKeyLocally

pyspark.RDD.repartitionAndSortWithinPartitions