SparkSession.
sql
Returns a DataFrame representing the result of the given query. When kwargs is specified, this method formats the given string by using the Python standard formatter.
DataFrame
kwargs
New in version 2.0.0.
SQL query string.
Other variables that the user wants to set that can be referenced in the query
Changed in version 3.3.0: Added optional argument kwargs to specify the mapping of variables in the query. This feature is experimental and unstable.
Examples
Executing a SQL query.
>>> spark.sql("SELECT * FROM range(10) where id > 7").show() +---+ | id| +---+ | 8| | 9| +---+
Executing a SQL query with variables as Python formatter standard.
>>> spark.sql( ... "SELECT * FROM range(10) WHERE id > {bound1} AND id < {bound2}", bound1=7, bound2=9 ... ).show() +---+ | id| +---+ | 8| +---+
>>> mydf = spark.range(10) >>> spark.sql( ... "SELECT {col} FROM {mydf} WHERE id IN {x}", ... col=mydf.id, mydf=mydf, x=tuple(range(4))).show() +---+ | id| +---+ | 0| | 1| | 2| | 3| +---+
>>> spark.sql(''' ... SELECT m1.a, m2.b ... FROM {table1} m1 INNER JOIN {table2} m2 ... ON m1.key = m2.key ... ORDER BY m1.a, m2.b''', ... table1=spark.createDataFrame([(1, "a"), (2, "b")], ["a", "key"]), ... table2=spark.createDataFrame([(3, "a"), (4, "b"), (5, "b")], ["b", "key"])).show() +---+---+ | a| b| +---+---+ | 1| 3| | 2| 4| | 2| 5| +---+---+
Also, it is possible to query using class:Column from DataFrame.
>>> mydf = spark.createDataFrame([(1, 4), (2, 4), (3, 6)], ["A", "B"]) >>> spark.sql("SELECT {df.A}, {df[B]} FROM {df}", df=mydf).show() +---+---+ | A| B| +---+---+ | 1| 4| | 2| 4| | 3| 6| +---+---+