pyspark.sql.DataFrame.groupingSets#

DataFrame.groupingSets(groupingSets, *cols)[source]#

Create multi-dimensional aggregation for the current class:DataFrame using the specified grouping sets, so we can run aggregation on them.

New in version 4.0.0.

Parameters
groupingSetssequence of sequence of columns or str

Individual set of columns to group on.

colsColumn or str

Addional grouping columns specified by users. Those columns are shown as the output columns after aggregation.

Returns
GroupedData

Grouping sets of the data based on the specified columns.

See also

GroupedData

Examples

Example 1: Group by city and car_model, city, and all, and calculate the sum of quantity.

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([
...     (100, 'Fremont', 'Honda Civic', 10),
...     (100, 'Fremont', 'Honda Accord', 15),
...     (100, 'Fremont', 'Honda CRV', 7),
...     (200, 'Dublin', 'Honda Civic', 20),
...     (200, 'Dublin', 'Honda Accord', 10),
...     (200, 'Dublin', 'Honda CRV', 3),
...     (300, 'San Jose', 'Honda Civic', 5),
...     (300, 'San Jose', 'Honda Accord', 8)
... ], schema="id INT, city STRING, car_model STRING, quantity INT")
>>> df.groupingSets(
...     [("city", "car_model"), ("city",), ()],
...     "city", "car_model"
... ).agg(sf.sum(sf.col("quantity")).alias("sum")).sort("city", "car_model").show()
+--------+------------+---+
|    city|   car_model|sum|
+--------+------------+---+
|    NULL|        NULL| 78|
|  Dublin|        NULL| 33|
|  Dublin|Honda Accord| 10|
|  Dublin|   Honda CRV|  3|
|  Dublin| Honda Civic| 20|
| Fremont|        NULL| 32|
| Fremont|Honda Accord| 15|
| Fremont|   Honda CRV|  7|
| Fremont| Honda Civic| 10|
|San Jose|        NULL| 13|
|San Jose|Honda Accord|  8|
|San Jose| Honda Civic|  5|
+--------+------------+---+

Example 2: Group by multiple columns and calculate both average and sum.

>>> df.groupingSets(
...     [("city", "car_model"), ("city",), ()],
...     "city", "car_model"
... ).agg(
...     sf.avg(sf.col("quantity")).alias("avg_quantity"),
...     sf.sum(sf.col("quantity")).alias("sum_quantity")
... ).sort("city", "car_model").show()
+--------+------------+------------------+------------+
|    city|   car_model|      avg_quantity|sum_quantity|
+--------+------------+------------------+------------+
|    NULL|        NULL|              9.75|          78|
|  Dublin|        NULL|              11.0|          33|
|  Dublin|Honda Accord|              10.0|          10|
|  Dublin|   Honda CRV|               3.0|           3|
|  Dublin| Honda Civic|              20.0|          20|
| Fremont|        NULL|10.666666666666666|          32|
| Fremont|Honda Accord|              15.0|          15|
| Fremont|   Honda CRV|               7.0|           7|
| Fremont| Honda Civic|              10.0|          10|
|San Jose|        NULL|               6.5|          13|
|San Jose|Honda Accord|               8.0|           8|
|San Jose| Honda Civic|               5.0|           5|
+--------+------------+------------------+------------+