pyspark.sql.datasource.DataSourceStreamReader.latestOffset#

DataSourceStreamReader.latestOffset(start, limit)[source]#

Returns the most recent offset available given a read limit. The start offset can be used to figure out how much new data should be read given the limit.

The start will be provided from the return value of initialOffset() for the very first micro-batch, and for subsequent micro-batches, the start offset is the ending offset from the previous micro-batch. The source can return the start parameter as it is, if there is no data to process.

ReadLimit can be used by the source to limit the amount of data returned in this call. The implementation should implement getDefaultReadLimit() to provide the proper ReadLimit if the source can limit the amount of data returned based on the source options.

The engine can still call latestOffset() with ReadAllAvailable even if the source produces the different read limit from getDefaultReadLimit(), to respect the semantic of trigger. The source must always respect the given readLimit provided by the engine; e.g. if the readLimit is ReadAllAvailable, the source must ignore the read limit configured through options.

New in version 4.2.0.

Parameters

startdict: The start offset of the microbatch to continue reading from.
limitReadLimit: The limit on the amount of data to be returned by this call.

Returns

dict: A dict or recursive dict whose key and value are primitive types, which includes Integer, String and Boolean.

Examples

>>> from pyspark.sql.streaming.datasource import ReadAllAvailable, ReadMaxRows
>>> def latestOffset(self, start, limit):
...     # Assume the source has 10 new records between start and latest offset
...     if isinstance(limit, ReadAllAvailable):
...        return {"index": start["index"] + 10}
...     else:  # e.g., limit is ReadMaxRows(5)
...        return {"index": start["index"] + min(10, limit.maxRows)}