pyspark.RDD.sampleByKey¶

RDD.sampleByKey(withReplacement: bool, fractions: Dict[K, Union[float, int]], seed: Optional[int] = None) → pyspark.rdd.RDD[Tuple[K, V]][source]¶

Return a subset of this RDD sampled by key (via stratified sampling). Create a sample of this RDD using variable sampling rates for different keys as specified by fractions, a key to sampling rate map.

New in version 0.7.0.

Parameters

withReplacementbool: whether to sample with or without replacement
fractionsdict: map of specific keys to sampling rates
seedint, optional: seed for the random number generator

Returns

RDD: a RDD containing the stratified sampling result

See also

RDD.sample()

Examples

>>> fractions = {"a": 0.2, "b": 0.1}
>>> rdd = sc.parallelize(fractions.keys()).cartesian(sc.parallelize(range(0, 1000)))
>>> sample = dict(rdd.sampleByKey(False, fractions, 2).groupByKey().collect())
>>> 100 < len(sample["a"]) < 300 and 50 < len(sample["b"]) < 150
True
>>> max(sample["a"]) <= 999 and min(sample["a"]) >= 0
True
>>> max(sample["b"]) <= 999 and min(sample["b"]) >= 0
True

pyspark.RDD.sample pyspark.RDD.sampleStdev