pyspark.pandas.DataFrame.spark.apply¶
-
spark.
apply
(func: Callable[[pyspark.sql.dataframe.DataFrame], pyspark.sql.dataframe.DataFrame], index_col: Union[str, List[str], None] = None) → ps.DataFrame¶ Applies a function that takes and returns a Spark DataFrame. It allows natively apply a Spark function and column APIs with the Spark column internally used in Series or Index.
Note
set index_col and keep the column named as so in the output Spark DataFrame to avoid using the default index to prevent performance penalty. If you omit index_col, it will use default index which is potentially expensive in general.
Note
it will lose column labels. This is a synonym of
func(psdf.to_spark(index_col)).pandas_api(index_col)
.- Parameters
- funcfunction
Function to apply the function against the data by using Spark DataFrame.
- Returns
- DataFrame
- Raises
- ValueErrorIf the output from the function is not a Spark DataFrame.
Examples
>>> psdf = ps.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}, columns=["a", "b"]) >>> psdf a b 0 1 4 1 2 5 2 3 6
>>> psdf.spark.apply( ... lambda sdf: sdf.selectExpr("a + b as c", "index"), index_col="index") ... c index 0 5 1 7 2 9
The case below ends up with using the default index, which should be avoided if possible.
>>> psdf.spark.apply(lambda sdf: sdf.groupby("a").count().sort("a")) a count 0 1 1 1 2 1 2 3 1