pyspark.sql.datasource.DataSourceStreamReader.latestOffset#
- DataSourceStreamReader.latestOffset(start, limit)[source]#
Returns the most recent offset available given a read limit. The start offset can be used to figure out how much new data should be read given the limit.
The start will be provided from the return value of
initialOffset()for the very first micro-batch, and for subsequent micro-batches, the start offset is the ending offset from the previous micro-batch. The source can return the start parameter as it is, if there is no data to process.ReadLimitcan be used by the source to limit the amount of data returned in this call. The implementation should implementgetDefaultReadLimit()to provide the properReadLimitif the source can limit the amount of data returned based on the source options.The engine can still call
latestOffset()withReadAllAvailableeven if the source produces the different read limit fromgetDefaultReadLimit(), to respect the semantic of trigger. The source must always respect the given readLimit provided by the engine; e.g. if the readLimit isReadAllAvailable, the source must ignore the read limit configured through options.New in version 4.2.0.
- Parameters
- startdict
The start offset of the microbatch to continue reading from.
- limit
ReadLimit The limit on the amount of data to be returned by this call.
- Returns
- dict
A dict or recursive dict whose key and value are primitive types, which includes Integer, String and Boolean.
Examples
>>> from pyspark.sql.streaming.datasource import ReadAllAvailable, ReadMaxRows >>> def latestOffset(self, start, limit): ... # Assume the source has 10 new records between start and latest offset ... if isinstance(limit, ReadAllAvailable): ... return {"index": start["index"] + 10} ... else: # e.g., limit is ReadMaxRows(5) ... return {"index": start["index"] + min(10, limit.maxRows)}