SparkSession.
createDataFrame
Creates a DataFrame from an RDD, a list, a pandas.DataFrame or a numpy.ndarray.
DataFrame
RDD
pandas.DataFrame
numpy.ndarray
New in version 2.0.0.
Changed in version 3.4.0: Supports Spark Connect.
an RDD of any kind of SQL data representation (Row, tuple, int, boolean, etc.), or list, pandas.DataFrame or numpy.ndarray.
Row
tuple
int
boolean
list
pyspark.sql.types.DataType
a pyspark.sql.types.DataType or a datatype string or a list of column names, default is None. The data type string format equals to pyspark.sql.types.DataType.simpleString, except that top level struct type can omit the struct<>.
pyspark.sql.types.DataType.simpleString
struct<>
When schema is a list of column names, the type of each column will be inferred from data.
schema
data
When schema is None, it will try to infer the schema (column names and types) from data, which should be an RDD of either Row, namedtuple, or dict.
None
namedtuple
dict
When schema is pyspark.sql.types.DataType or a datatype string, it must match the real data, or an exception will be thrown at runtime. If the given schema is not pyspark.sql.types.StructType, it will be wrapped into a pyspark.sql.types.StructType as its only field, and the field name will be “value”. Each record will also be wrapped into a tuple, which can be converted to row later.
pyspark.sql.types.StructType
the sample ratio of rows used for inferring. The first few rows will be used if samplingRatio is None.
samplingRatio
verify data types of every row against schema. Enabled by default.
New in version 2.1.0.
Notes
Usage with spark.sql.execution.arrow.pyspark.enabled=True is experimental.
Examples
Create a DataFrame from a list of tuples.
>>> spark.createDataFrame([('Alice', 1)]).collect() [Row(_1='Alice', _2=1)] >>> spark.createDataFrame([('Alice', 1)], ['name', 'age']).collect() [Row(name='Alice', age=1)]
Create a DataFrame from a list of dictionaries
>>> d = [{'name': 'Alice', 'age': 1}] >>> spark.createDataFrame(d).collect() [Row(age=1, name='Alice')]
Create a DataFrame from an RDD.
>>> rdd = spark.sparkContext.parallelize([('Alice', 1)]) >>> spark.createDataFrame(rdd).collect() [Row(_1='Alice', _2=1)] >>> df = spark.createDataFrame(rdd, ['name', 'age']) >>> df.collect() [Row(name='Alice', age=1)]
Create a DataFrame from Row instances.
>>> from pyspark.sql import Row >>> Person = Row('name', 'age') >>> person = rdd.map(lambda r: Person(*r)) >>> df2 = spark.createDataFrame(person) >>> df2.collect() [Row(name='Alice', age=1)]
Create a DataFrame with the explicit schema specified.
>>> from pyspark.sql.types import * >>> schema = StructType([ ... StructField("name", StringType(), True), ... StructField("age", IntegerType(), True)]) >>> df3 = spark.createDataFrame(rdd, schema) >>> df3.collect() [Row(name='Alice', age=1)]
Create a DataFrame from a pandas DataFrame.
>>> spark.createDataFrame(df.toPandas()).collect() [Row(name='Alice', age=1)] >>> spark.createDataFrame(pandas.DataFrame([[1, 2]])).collect() [Row(0=1, 1=2)]
Create a DataFrame from an RDD with the schema in DDL formatted string.
>>> spark.createDataFrame(rdd, "a: string, b: int").collect() [Row(a='Alice', b=1)] >>> rdd = rdd.map(lambda row: row[1]) >>> spark.createDataFrame(rdd, "int").collect() [Row(value=1)]
When the type is unmatched, it throws an exception.
>>> spark.createDataFrame(rdd, "boolean").collect() Traceback (most recent call last): ... Py4JJavaError: ...