pyspark.sql.datasource.DataSourceReader.partitions#

DataSourceReader.partitions()[source]#

Returns an iterator of partitions for this data source.

Partitions are used to split data reading operations into parallel tasks. If this method returns N partitions, the query planner will create N tasks. Each task will execute DataSourceReader.read() in parallel, using the respective partition value to read the data.

This method is called once during query planning. By default, it returns a single partition with the value None. Subclasses can override this method to return multiple partitions.

It’s recommended to override this method for better performance when reading large datasets.

Returns
sequence of InputPartitions

A sequence of partitions for this data source. Each partition value must be an instance of InputPartition or a subclass of it.

Notes

All partition values must be picklable objects.

Examples

Returns a list of integers:

>>> def partitions(self):
...     return [InputPartition(1), InputPartition(2), InputPartition(3)]

Returns a list of string:

>>> def partitions(self):
...     return [InputPartition("a"), InputPartition("b"), InputPartition("c")]

Returns a list of ranges:

>>> class RangeInputPartition(InputPartition):
...    def __init__(self, start, end):
...        self.start = start
...        self.end = end
>>> def partitions(self):
...     return [RangeInputPartition(1, 3), RangeInputPartition(5, 10)]