pyspark.sql.DataFrame.groupBy#

DataFrame.groupBy(*cols)[source]#

Groups the DataFrame by the specified columns so that aggregation can be performed on them. See GroupedData for all the available aggregate functions.

groupby() is an alias for groupBy().

New in version 1.3.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters

colslist, str, int or Column: The columns to group by. Each element can be a column name (string) or an expression (Column) or a column ordinal (int, 1-based) or list of them.

Changed in version 4.0.0: Supports column ordinal.

Returns

GroupedData: A GroupedData object representing the grouped data by the specified columns.

Notes

A column ordinal starts from 1, which is different from the 0-based __getitem__().

Examples

>>> df = spark.createDataFrame([
...     ("Alice", 2), ("Bob", 2), ("Bob", 2), ("Bob", 5)], schema=["name", "age"])

Example 1: Empty grouping columns triggers a global aggregation.

>>> df.groupBy().avg().show()
+--------+
|avg(age)|
+--------+
|    2.75|
+--------+

Example 2: Group-by ‘name’, and specify a dictionary to calculate the summation of ‘age’.

>>> df.groupBy("name").agg({"age": "sum"}).sort("name").show()
+-----+--------+
| name|sum(age)|
+-----+--------+
|Alice|       2|
|  Bob|       9|
+-----+--------+

Example 3: Group-by ‘name’, and calculate maximum values.

>>> df.groupBy(df.name).max().sort("name").show()
+-----+--------+
| name|max(age)|
+-----+--------+
|Alice|       2|
|  Bob|       5|
+-----+--------+

Example 4: Also group-by ‘name’, but using the column ordinal.

>>> df.groupBy(1).max().sort("name").show()
+-----+--------+
| name|max(age)|
+-----+--------+
|Alice|       2|
|  Bob|       5|
+-----+--------+

Example 5: Group-by ‘name’ and ‘age’, and calculate the number of rows in each group.

>>> df.groupBy(["name", df.age]).count().sort("name", "age").show()
+-----+---+-----+
| name|age|count|
+-----+---+-----+
|Alice|  2|    1|
|  Bob|  2|    2|
|  Bob|  5|    1|
+-----+---+-----+

Example 6: Also Group-by ‘name’ and ‘age’, but using the column ordinal.

>>> df.groupBy([df.name, 2]).count().sort("name", "age").show()
+-----+---+-----+
| name|age|count|
+-----+---+-----+
|Alice|  2|    1|
|  Bob|  2|    2|
|  Bob|  5|    1|
+-----+---+-----+