pyspark.pandas.DataFrame.spark.persist

spark.persist(storage_level: pyspark.storagelevel.StorageLevel = StorageLevel(True, True, False, False, 1)) → CachedDataFrame

Yields and caches the current DataFrame with a specific StorageLevel. If a StorageLevel is not given, the MEMORY_AND_DISK level is used by default like PySpark.

The pandas-on-Spark DataFrame is yielded as a protected resource and its corresponding data is cached which gets uncached after execution goes off the context.

See also

DataFrame.spark.cache

Examples

>>> import pyspark
>>> df = ps.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)],
...                   columns=['dogs', 'cats'])
>>> df
   dogs  cats
0   0.2   0.3
1   0.0   0.6
2   0.6   0.0
3   0.2   0.1

Set the StorageLevel to MEMORY_ONLY.

>>> with df.spark.persist(pyspark.StorageLevel.MEMORY_ONLY) as cached_df:
...     print(cached_df.spark.storage_level)
...     print(cached_df.count())
...
Memory Serialized 1x Replicated
dogs    4
cats    4
dtype: int64

Set the StorageLevel to DISK_ONLY.

>>> with df.spark.persist(pyspark.StorageLevel.DISK_ONLY) as cached_df:
...     print(cached_df.spark.storage_level)
...     print(cached_df.count())
...
Disk Serialized 1x Replicated
dogs    4
cats    4
dtype: int64

If a StorageLevel is not given, it uses MEMORY_AND_DISK by default.

>>> with df.spark.persist() as cached_df:
...     print(cached_df.spark.storage_level)
...     print(cached_df.count())
...
Disk Memory Serialized 1x Replicated
dogs    4
cats    4
dtype: int64
>>> df = df.spark.persist()
>>> df.to_pandas().mean(axis=1)
0    0.25
1    0.30
2    0.30
3    0.15
dtype: float64

To uncache the dataframe, use unpersist function

>>> df.spark.unpersist()