pyspark.sql.datasource.DataSourceReader.partitions#
- DataSourceReader.partitions()[source]#
Returns an iterator of partitions for this data source.
Partitions are used to split data reading operations into parallel tasks. If this method returns N partitions, the query planner will create N tasks. Each task will execute
DataSourceReader.read()
in parallel, using the respective partition value to read the data.This method is called once during query planning. By default, it returns a single partition with the value
None
. Subclasses can override this method to return multiple partitions.It’s recommended to override this method for better performance when reading large datasets.
- Returns
- sequence of
InputPartition
s A sequence of partitions for this data source. Each partition value must be an instance of InputPartition or a subclass of it.
- sequence of
Notes
All partition values must be picklable objects.
Examples
Returns a list of integers:
>>> def partitions(self): ... return [InputPartition(1), InputPartition(2), InputPartition(3)]
Returns a list of string:
>>> def partitions(self): ... return [InputPartition("a"), InputPartition("b"), InputPartition("c")]
Returns a list of ranges:
>>> class RangeInputPartition(InputPartition): ... def __init__(self, start, end): ... self.start = start ... self.end = end
>>> def partitions(self): ... return [RangeInputPartition(1, 3), RangeInputPartition(5, 10)]