pyspark.pandas.DataFrame.describe¶

DataFrame.describe(percentiles: Optional[List[float]] = None) → pyspark.pandas.frame.DataFrame[source]¶

Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.

Parameters

percentileslist of float in range [0.0, 1.0], default [0.25, 0.5, 0.75]: A list of percentiles to be computed.

Returns

DataFrame: Summary statistics of the Dataframe provided.

See also

DataFrame.count: Count number of non-NA/null observations.
DataFrame.max: Maximum of the values in the object.
DataFrame.min: Minimum of the values in the object.
DataFrame.mean: Mean of the values.
DataFrame.std: Standard deviation of the observations.

Notes

For numeric data, the result’s index will include count, mean, std, min, 25%, 50%, 75%, max.

For object data (e.g. strings or timestamps), the result’s index will include count, unique, top, and freq. The top is the most common value. The freq is the most common value’s frequency. Timestamps also include the first and last items.

Examples

Describing a numeric Series.

>>> s = ps.Series([1, 2, 3])
>>> s.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.0
50%      2.0
75%      3.0
max      3.0
dtype: float64

Describing a DataFrame. Only numeric fields are returned.

>>> df = ps.DataFrame({'numeric1': [1, 2, 3],
...                    'numeric2': [4.0, 5.0, 6.0],
...                    'object': ['a', 'b', 'c']
...                   },
...                   columns=['numeric1', 'numeric2', 'object'])
>>> df.describe()
       numeric1  numeric2
count       3.0       3.0
mean        2.0       5.0
std         1.0       1.0
min         1.0       4.0
25%         1.0       4.0
50%         2.0       5.0
75%         3.0       6.0
max         3.0       6.0

For multi-index columns:

>>> df.columns = [('num', 'a'), ('num', 'b'), ('obj', 'c')]
>>> df.describe()  
       num
         a    b
count  3.0  3.0
mean   2.0  5.0
std    1.0  1.0
min    1.0  4.0
25%    1.0  4.0
50%    2.0  5.0
75%    3.0  6.0
max    3.0  6.0

>>> df[('num', 'b')].describe()
count    3.0
mean     5.0
std      1.0
min      4.0
25%      4.0
50%      5.0
75%      6.0
max      6.0
Name: (num, b), dtype: float64

Describing a DataFrame and selecting custom percentiles.

>>> df = ps.DataFrame({'numeric1': [1, 2, 3],
...                    'numeric2': [4.0, 5.0, 6.0]
...                   },
...                   columns=['numeric1', 'numeric2'])
>>> df.describe(percentiles = [0.85, 0.15])
       numeric1  numeric2
count       3.0       3.0
mean        2.0       5.0
std         1.0       1.0
min         1.0       4.0
15%         1.0       4.0
50%         2.0       5.0
85%         3.0       6.0
max         3.0       6.0

Describing a column from a DataFrame by accessing it as an attribute.

>>> df.numeric1.describe()
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.0
50%      2.0
75%      3.0
max      3.0
Name: numeric1, dtype: float64

Describing a column from a DataFrame by accessing it as an attribute and selecting custom percentiles.

>>> df.numeric1.describe(percentiles = [0.85, 0.15])
count    3.0
mean     2.0
std      1.0
min      1.0
15%      1.0
50%      2.0
85%      3.0
max      3.0
Name: numeric1, dtype: float64

pyspark.pandas.DataFrame.cov

pyspark.pandas.DataFrame.ewm