pyspark.sql.functions.mode#

pyspark.sql.functions.mode(col, deterministic=False)[source]#

Returns the most frequent value in a group.

New in version 3.4.0.

Changed in version 4.0.0: Supports deterministic argument.

Parameters

colColumn or str: target column to compute on.
deterministicbool, optional: if there are multiple equally-frequent results then return the lowest (defaults to false).

Returns

Column: the most frequent value in a group.

Notes

Supports Spark Connect.

Examples

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([
...     ("Java", 2012, 20000), ("dotNET", 2012, 5000),
...     ("Java", 2012, 20000), ("dotNET", 2012, 5000),
...     ("dotNET", 2013, 48000), ("Java", 2013, 30000)],
...     schema=("course", "year", "earnings"))
>>> df.groupby("course").agg(sf.mode("year")).sort("course").show()
+------+----------+
|course|mode(year)|
+------+----------+
|  Java|      2012|
|dotNET|      2012|
+------+----------+

When multiple values have the same greatest frequency then either any of values is returned if deterministic is false or is not defined, or the lowest value is returned if deterministic is true.

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([(-10,), (0,), (10,)], ["col"])
>>> df.select(sf.mode("col", False)).show() 
+---------+
|mode(col)|
+---------+
|        0|
+---------+
>>> df.select(sf.mode("col", True)).show()
+---------------------------------------+
|mode() WITHIN GROUP (ORDER BY col DESC)|
+---------------------------------------+
|                                    -10|
+---------------------------------------+