pyspark.sql.SparkSession.createDataFrame¶
-
SparkSession.
createDataFrame
(data: Union[pyspark.rdd.RDD[Any], Iterable[Any], PandasDataFrameLike, ArrayLike], schema: Union[pyspark.sql.types.AtomicType, pyspark.sql.types.StructType, str, None] = None, samplingRatio: Optional[float] = None, verifySchema: bool = True) → pyspark.sql.dataframe.DataFrame[source]¶ Creates a
DataFrame
from anRDD
, a list, apandas.DataFrame
or anumpy.ndarray
.New in version 2.0.0.
Changed in version 3.4.0: Supports Spark Connect.
- Parameters
- data
RDD
or iterable an RDD of any kind of SQL data representation (
Row
,tuple
,int
,boolean
, etc.), orlist
,pandas.DataFrame
ornumpy.ndarray
.- schema
pyspark.sql.types.DataType
, str or list, optional a
pyspark.sql.types.DataType
or a datatype string or a list of column names, default is None. The data type string format equals topyspark.sql.types.DataType.simpleString
, except that top level struct type can omit thestruct<>
.When
schema
is a list of column names, the type of each column will be inferred fromdata
.When
schema
isNone
, it will try to infer the schema (column names and types) fromdata
, which should be an RDD of eitherRow
,namedtuple
, ordict
.When
schema
ispyspark.sql.types.DataType
or a datatype string, it must match the real data, or an exception will be thrown at runtime. If the given schema is notpyspark.sql.types.StructType
, it will be wrapped into apyspark.sql.types.StructType
as its only field, and the field name will be “value”. Each record will also be wrapped into a tuple, which can be converted to row later.- samplingRatiofloat, optional
the sample ratio of rows used for inferring. The first few rows will be used if
samplingRatio
isNone
.- verifySchemabool, optional
verify data types of every row against schema. Enabled by default.
New in version 2.1.0.
- data
- Returns
Notes
Usage with spark.sql.execution.arrow.pyspark.enabled=True is experimental.
Examples
Create a DataFrame from a list of tuples.
>>> spark.createDataFrame([('Alice', 1)]).show() +-----+---+ | _1| _2| +-----+---+ |Alice| 1| +-----+---+
Create a DataFrame from a list of dictionaries.
>>> d = [{'name': 'Alice', 'age': 1}] >>> spark.createDataFrame(d).show() +---+-----+ |age| name| +---+-----+ | 1|Alice| +---+-----+
Create a DataFrame with column names specified.
>>> spark.createDataFrame([('Alice', 1)], ['name', 'age']).show() +-----+---+ | name|age| +-----+---+ |Alice| 1| +-----+---+
Create a DataFrame with the explicit schema specified.
>>> from pyspark.sql.types import * >>> schema = StructType([ ... StructField("name", StringType(), True), ... StructField("age", IntegerType(), True)]) >>> spark.createDataFrame([('Alice', 1)], schema).show() +-----+---+ | name|age| +-----+---+ |Alice| 1| +-----+---+
Create a DataFrame with the schema in DDL formatted string.
>>> spark.createDataFrame([('Alice', 1)], "name: string, age: int").show() +-----+---+ | name|age| +-----+---+ |Alice| 1| +-----+---+
Create an empty DataFrame. When initializing an empty DataFrame in PySpark, it’s mandatory to specify its schema, as the DataFrame lacks data from which the schema can be inferred.
>>> spark.createDataFrame([], "name: string, age: int").show() +----+---+ |name|age| +----+---+ +----+---+
Create a DataFrame from Row objects.
>>> from pyspark.sql import Row >>> Person = Row('name', 'age') >>> df = spark.createDataFrame([Person("Alice", 1)]) >>> df.show() +-----+---+ | name|age| +-----+---+ |Alice| 1| +-----+---+
Create a DataFrame from a pandas DataFrame.
>>> spark.createDataFrame(df.toPandas()).show() +-----+---+ | name|age| +-----+---+ |Alice| 1| +-----+---+ >>> spark.createDataFrame(pandas.DataFrame([[1, 2]])).collect() +---+---+ | 0| 1| +---+---+ | 1| 2| +---+---+