pyspark.RDD.countApproxDistinct#
- RDD.countApproxDistinct(relativeSD=0.05)[source]#
Return approximate number of distinct elements in the RDD.
New in version 1.2.0.
- Parameters
- relativeSDfloat, optional
Relative accuracy. Smaller values create counters that require more space. It must be greater than 0.000017.
- Returns
- int
approximate number of distinct elements
See also
Notes
The algorithm used is based on streamlib’s implementation of “HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm”, available here.
Examples
>>> n = sc.parallelize(range(1000)).map(str).countApproxDistinct() >>> 900 < n < 1100 True >>> n = sc.parallelize([i % 20 for i in range(1000)]).countApproxDistinct() >>> 16 < n < 24 True