pyspark.sql.
DataFrame
A distributed collection of data grouped into named columns.
New in version 1.3.0.
Changed in version 3.4.0: Supports Spark Connect.
Notes
A DataFrame should only be created as described above. It should not be directly created via using the constructor.
Examples
A DataFrame is equivalent to a relational table in Spark SQL, and can be created using various functions in SparkSession:
SparkSession
>>> people = spark.createDataFrame([ ... {"deptId": 1, "age": 40, "name": "Hyukjin Kwon", "gender": "M", "salary": 50}, ... {"deptId": 1, "age": 50, "name": "Takuya Ueshin", "gender": "M", "salary": 100}, ... {"deptId": 2, "age": 60, "name": "Xinrong Meng", "gender": "F", "salary": 150}, ... {"deptId": 3, "age": 20, "name": "Haejoon Lee", "gender": "M", "salary": 200} ... ])
Once created, it can be manipulated using the various domain-specific-language (DSL) functions defined in: DataFrame, Column.
Column
To select a column from the DataFrame, use the apply method:
>>> age_col = people.age
A more concrete example:
>>> # To create DataFrame using SparkSession ... department = spark.createDataFrame([ ... {"id": 1, "name": "PySpark"}, ... {"id": 2, "name": "ML"}, ... {"id": 3, "name": "Spark SQL"} ... ])
>>> people.filter(people.age > 30).join( ... department, people.deptId == department.id).groupBy( ... department.name, "gender").agg({"salary": "avg", "age": "max"}).show() +-------+------+-----------+--------+ | name|gender|avg(salary)|max(age)| +-------+------+-----------+--------+ | ML| F| 150.0| 60| |PySpark| M| 75.0| 50| +-------+------+-----------+--------+
Methods
agg(*exprs)
agg
Aggregate on the entire DataFrame without groups (shorthand for df.groupBy().agg()).
df.groupBy().agg()
alias(alias)
alias
Returns a new DataFrame with an alias set.
approxQuantile(col, probabilities, relativeError)
approxQuantile
Calculates the approximate quantiles of numerical columns of a DataFrame.
cache()
cache
Persists the DataFrame with the default storage level (MEMORY_AND_DISK_DESER).
checkpoint([eager])
checkpoint
Returns a checkpointed version of this DataFrame.
coalesce(numPartitions)
coalesce
Returns a new DataFrame that has exactly numPartitions partitions.
colRegex(colName)
colRegex
Selects column based on the column name specified as a regex and returns it as Column.
collect()
collect
Returns all the records as a list of Row.
Row
corr(col1, col2[, method])
corr
Calculates the correlation of two columns of a DataFrame as a double value.
count()
count
Returns the number of rows in this DataFrame.
cov(col1, col2)
cov
Calculate the sample covariance for the given columns, specified by their names, as a double value.
createGlobalTempView(name)
createGlobalTempView
Creates a global temporary view with this DataFrame.
createOrReplaceGlobalTempView(name)
createOrReplaceGlobalTempView
Creates or replaces a global temporary view using the given name.
createOrReplaceTempView(name)
createOrReplaceTempView
Creates or replaces a local temporary view with this DataFrame.
createTempView(name)
createTempView
Creates a local temporary view with this DataFrame.
crossJoin(other)
crossJoin
Returns the cartesian product with another DataFrame.
crosstab(col1, col2)
crosstab
Computes a pair-wise frequency table of the given columns.
cube(*cols)
cube
Create a multi-dimensional cube for the current DataFrame using the specified columns, so we can run aggregations on them.
describe(*cols)
describe
Computes basic statistics for numeric and string columns.
distinct()
distinct
Returns a new DataFrame containing the distinct rows in this DataFrame.
drop(*cols)
drop
Returns a new DataFrame without specified columns.
dropDuplicates([subset])
dropDuplicates
Return a new DataFrame with duplicate rows removed, optionally only considering certain columns.
drop_duplicates([subset])
drop_duplicates
drop_duplicates() is an alias for dropDuplicates().
drop_duplicates()
dropDuplicates()
dropna([how, thresh, subset])
dropna
Returns a new DataFrame omitting rows with null values.
exceptAll(other)
exceptAll
Return a new DataFrame containing rows in this DataFrame but not in another DataFrame while preserving duplicates.
explain([extended, mode])
explain
Prints the (logical and physical) plans to the console for debugging purposes.
fillna(value[, subset])
fillna
Replace null values, alias for na.fill().
na.fill()
filter(condition)
filter
Filters rows using the given condition.
first()
first
Returns the first row as a Row.
foreach(f)
foreach
Applies the f function to all Row of this DataFrame.
f
foreachPartition(f)
foreachPartition
Applies the f function to each partition of this DataFrame.
freqItems(cols[, support])
freqItems
Finding frequent items for columns, possibly with false positives.
groupBy(*cols)
groupBy
Groups the DataFrame using the specified columns, so we can run aggregation on them.
groupby(*cols)
groupby
groupby() is an alias for groupBy().
groupby()
groupBy()
head([n])
head
Returns the first n rows.
n
hint(name, *parameters)
hint
Specifies some hint on the current DataFrame.
inputFiles()
inputFiles
Returns a best-effort snapshot of the files that compose this DataFrame.
intersect(other)
intersect
Return a new DataFrame containing rows only in both this DataFrame and another DataFrame.
intersectAll(other)
intersectAll
Return a new DataFrame containing rows in both this DataFrame and another DataFrame while preserving duplicates.
isEmpty()
isEmpty
Returns True if this DataFrame is empty.
True
isLocal()
isLocal
Returns True if the collect() and take() methods can be run locally (without any Spark executors).
take()
join(other[, on, how])
join
Joins with another DataFrame, using the given join expression.
limit(num)
limit
Limits the result count to the number specified.
localCheckpoint([eager])
localCheckpoint
Returns a locally checkpointed version of this DataFrame.
mapInArrow(func, schema)
mapInArrow
Maps an iterator of batches in the current DataFrame using a Python native function that takes and outputs a PyArrow’s RecordBatch, and returns the result as a DataFrame.
mapInPandas(func, schema)
mapInPandas
Maps an iterator of batches in the current DataFrame using a Python native function that takes and outputs a pandas DataFrame, and returns the result as a DataFrame.
melt(ids, values, variableColumnName, …)
melt
Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.
observe(observation, *exprs)
observe
Define (named) metrics to observe on the DataFrame.
orderBy(*cols, **kwargs)
orderBy
Returns a new DataFrame sorted by the specified column(s).
pandas_api([index_col])
pandas_api
Converts the existing DataFrame into a pandas-on-Spark DataFrame.
persist([storageLevel])
persist
Sets the storage level to persist the contents of the DataFrame across operations after the first time it is computed.
printSchema()
printSchema
Prints out the schema in the tree format.
randomSplit(weights[, seed])
randomSplit
Randomly splits this DataFrame with the provided weights.
registerTempTable(name)
registerTempTable
Registers this DataFrame as a temporary table using the given name.
repartition(numPartitions, *cols)
repartition
Returns a new DataFrame partitioned by the given partitioning expressions.
repartitionByRange(numPartitions, *cols)
repartitionByRange
replace(to_replace[, value, subset])
replace
Returns a new DataFrame replacing a value with another value.
rollup(*cols)
rollup
Create a multi-dimensional rollup for the current DataFrame using the specified columns, so we can run aggregation on them.
sameSemantics(other)
sameSemantics
Returns True when the logical query plans inside both DataFrames are equal and therefore return the same results.
sample([withReplacement, fraction, seed])
sample
Returns a sampled subset of this DataFrame.
sampleBy(col, fractions[, seed])
sampleBy
Returns a stratified sample without replacement based on the fraction given on each stratum.
select(*cols)
select
Projects a set of expressions and returns a new DataFrame.
selectExpr(*expr)
selectExpr
Projects a set of SQL expressions and returns a new DataFrame.
semanticHash()
semanticHash
Returns a hash code of the logical query plan against this DataFrame.
show([n, truncate, vertical])
show
Prints the first n rows to the console.
sort(*cols, **kwargs)
sort
sortWithinPartitions(*cols, **kwargs)
sortWithinPartitions
Returns a new DataFrame with each partition sorted by the specified column(s).
subtract(other)
subtract
Return a new DataFrame containing rows in this DataFrame but not in another DataFrame.
summary(*statistics)
summary
Computes specified statistics for numeric and string columns.
tail(num)
tail
Returns the last num rows as a list of Row.
num
list
take(num)
take
Returns the first num rows as a list of Row.
to(schema)
to
Returns a new DataFrame where each row is reconciled to match the specified schema.
toDF(*cols)
toDF
Returns a new DataFrame that with new specified column names
toJSON([use_unicode])
toJSON
Converts a DataFrame into a RDD of string.
RDD
toLocalIterator([prefetchPartitions])
toLocalIterator
Returns an iterator that contains all of the rows in this DataFrame.
toPandas()
toPandas
Returns the contents of this DataFrame as Pandas pandas.DataFrame.
pandas.DataFrame
to_koalas([index_col])
to_koalas
to_pandas_on_spark([index_col])
to_pandas_on_spark
transform(func, *args, **kwargs)
transform
Returns a new DataFrame.
union(other)
union
Return a new DataFrame containing union of rows in this and another DataFrame.
unionAll(other)
unionAll
unionByName(other[, allowMissingColumns])
unionByName
Returns a new DataFrame containing union of rows in this and another DataFrame.
unpersist([blocking])
unpersist
Marks the DataFrame as non-persistent, and remove all blocks for it from memory and disk.
unpivot(ids, values, variableColumnName, …)
unpivot
where(condition)
where
where() is an alias for filter().
where()
filter()
withColumn(colName, col)
withColumn
Returns a new DataFrame by adding a column or replacing the existing column that has the same name.
withColumnRenamed(existing, new)
withColumnRenamed
Returns a new DataFrame by renaming an existing column.
withColumns(*colsMap)
withColumns
Returns a new DataFrame by adding multiple columns or replacing the existing columns that have the same names.
withColumnsRenamed(colsMap)
withColumnsRenamed
Returns a new DataFrame by renaming multiple columns.
withMetadata(columnName, metadata)
withMetadata
Returns a new DataFrame by updating an existing column with metadata.
withWatermark(eventTime, delayThreshold)
withWatermark
Defines an event time watermark for this DataFrame.
writeTo(table)
writeTo
Create a write configuration builder for v2 sources.
Attributes
columns
Returns all column names as a list.
dtypes
Returns all column names and their data types as a list.
isStreaming
Returns True if this DataFrame contains one or more sources that continuously return data as it arrives.
na
Returns a DataFrameNaFunctions for handling missing values.
DataFrameNaFunctions
rdd
Returns the content as an pyspark.RDD of Row.
pyspark.RDD
schema
Returns the schema of this DataFrame as a pyspark.sql.types.StructType.
pyspark.sql.types.StructType
sparkSession
Returns Spark session that created this DataFrame.
sql_ctx
stat
Returns a DataFrameStatFunctions for statistic functions.
DataFrameStatFunctions
storageLevel
Get the DataFrame’s current storage level.
write
Interface for saving the content of the non-streaming DataFrame out into external storage.
writeStream
Interface for saving the content of the streaming DataFrame out into external storage.