pyspark.sql.DataFrame.groupBy#
- DataFrame.groupBy(*cols)[source]#
Groups the
DataFrame
by the specified columns so that aggregation can be performed on them. SeeGroupedData
for all the available aggregate functions.groupby()
is an alias forgroupBy()
.New in version 1.3.0.
Changed in version 3.4.0: Supports Spark Connect.
- Parameters
- Returns
GroupedData
A
GroupedData
object representing the grouped data by the specified columns.
Notes
A column ordinal starts from 1, which is different from the 0-based
__getitem__()
.Examples
>>> df = spark.createDataFrame([ ... ("Alice", 2), ("Bob", 2), ("Bob", 2), ("Bob", 5)], schema=["name", "age"])
Example 1: Empty grouping columns triggers a global aggregation.
>>> df.groupBy().avg().show() +--------+ |avg(age)| +--------+ | 2.75| +--------+
Example 2: Group-by ‘name’, and specify a dictionary to calculate the summation of ‘age’.
>>> df.groupBy("name").agg({"age": "sum"}).sort("name").show() +-----+--------+ | name|sum(age)| +-----+--------+ |Alice| 2| | Bob| 9| +-----+--------+
Example 3: Group-by ‘name’, and calculate maximum values.
>>> df.groupBy(df.name).max().sort("name").show() +-----+--------+ | name|max(age)| +-----+--------+ |Alice| 2| | Bob| 5| +-----+--------+
Example 4: Also group-by ‘name’, but using the column ordinal.
>>> df.groupBy(1).max().sort("name").show() +-----+--------+ | name|max(age)| +-----+--------+ |Alice| 2| | Bob| 5| +-----+--------+
Example 5: Group-by ‘name’ and ‘age’, and calculate the number of rows in each group.
>>> df.groupBy(["name", df.age]).count().sort("name", "age").show() +-----+---+-----+ | name|age|count| +-----+---+-----+ |Alice| 2| 1| | Bob| 2| 2| | Bob| 5| 1| +-----+---+-----+
Example 6: Also Group-by ‘name’ and ‘age’, but using the column ordinal.
>>> df.groupBy([df.name, 2]).count().sort("name", "age").show() +-----+---+-----+ | name|age|count| +-----+---+-----+ |Alice| 2| 1| | Bob| 2| 2| | Bob| 5| 1| +-----+---+-----+