pyspark.sql.DataFrame.drop#

DataFrame.drop(*cols)[source]#

Returns a new DataFrame without specified columns. This is a no-op if the schema doesn’t contain the given column name(s).

New in version 1.4.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters
cols: str or :class:`Column`

A name of the column, or the Column to be dropped.

Returns
DataFrame

A new DataFrame without the specified columns.

Notes

  • When an input is a column name, it is treated literally without further interpretation. Otherwise, it will try to match the equivalent expression. So dropping a column by its name drop(colName) has a different semantic with directly dropping the column drop(col(colName)).

Examples

Example 1: Drop a column by name.

>>> df = spark.createDataFrame(
...     [(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"])
>>> df.drop('age').show()
+-----+
| name|
+-----+
|  Tom|
|Alice|
|  Bob|
+-----+

Example 2: Drop a column by Column object.

>>> df.drop(df.age).show()
+-----+
| name|
+-----+
|  Tom|
|Alice|
|  Bob|
+-----+

Example 3: Drop the column that joined both DataFrames on.

>>> df2 = spark.createDataFrame([(80, "Tom"), (85, "Bob")], ["height", "name"])
>>> df.join(df2, df.name == df2.name).drop('name').sort('age').show()
+---+------+
|age|height|
+---+------+
| 14|    80|
| 16|    85|
+---+------+
>>> df3 = df.join(df2)
>>> df3.show()
+---+-----+------+----+
|age| name|height|name|
+---+-----+------+----+
| 14|  Tom|    80| Tom|
| 14|  Tom|    85| Bob|
| 23|Alice|    80| Tom|
| 23|Alice|    85| Bob|
| 16|  Bob|    80| Tom|
| 16|  Bob|    85| Bob|
+---+-----+------+----+

Example 4: Drop two column by the same name.

>>> df3.drop("name").show()
+---+------+
|age|height|
+---+------+
| 14|    80|
| 14|    85|
| 23|    80|
| 23|    85|
| 16|    80|
| 16|    85|
+---+------+

Example 5: Can not drop col(‘name’) due to ambiguous reference.

>>> from pyspark.sql import functions as sf
>>> df3.drop(sf.col("name")).show()
Traceback (most recent call last):
...
pyspark.errors.exceptions.captured.AnalysisException: [AMBIGUOUS_REFERENCE] Reference...

Example 6: Can not find a column matching the expression “a.b.c”.

>>> from pyspark.sql import functions as sf
>>> df4 = df.withColumn("a.b.c", sf.lit(1))
>>> df4.show()
+---+-----+-----+
|age| name|a.b.c|
+---+-----+-----+
| 14|  Tom|    1|
| 23|Alice|    1|
| 16|  Bob|    1|
+---+-----+-----+
>>> df4.drop("a.b.c").show()
+---+-----+
|age| name|
+---+-----+
| 14|  Tom|
| 23|Alice|
| 16|  Bob|
+---+-----+
>>> df4.drop(sf.col("a.b.c")).show()
+---+-----+-----+
|age| name|a.b.c|
+---+-----+-----+
| 14|  Tom|    1|
| 23|Alice|    1|
| 16|  Bob|    1|
+---+-----+-----+