pyspark.sql.DataFrame.repartitionByRange#
- DataFrame.repartitionByRange(numPartitions, *cols)[source]#
Returns a new
DataFrame
partitioned by the given partitioning expressions. The resultingDataFrame
is range partitioned.New in version 2.4.0.
Changed in version 3.4.0: Supports Spark Connect.
- Parameters
- numPartitionsint
can be an int to specify the target number of partitions or a Column. If it is a Column, it will be used as the first partitioning column. If not specified, the default number of partitions is used.
- colsstr or
Column
partitioning columns.
- Returns
DataFrame
Repartitioned DataFrame.
Notes
At least one partition-by expression must be specified. When no explicit sort order is specified, “ascending nulls first” is assumed.
Due to performance reasons this method uses sampling to estimate the ranges. Hence, the output may not be consistent, since sampling can return different values. The sample size can be controlled by the config spark.sql.execution.rangeExchange.sampleSizePerPartition.
Examples
Repartition the data into 2 partitions by range in ‘age’ column. For example, the first partition can have
(14, "Tom")
and(16, "Bob")
, and the second partition would have(23, "Alice")
.>>> from pyspark.sql import functions as sf >>> spark.createDataFrame( ... [(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"] ... ).repartitionByRange(2, "age").select( ... "age", "name", sf.spark_partition_id() ... ).show() +---+-----+--------------------+ |age| name|SPARK_PARTITION_ID()| +---+-----+--------------------+ | 14| Tom| 0| | 16| Bob| 0| | 23|Alice| 1| +---+-----+--------------------+