pyspark.RDD.cogroup¶

RDD.cogroup(other: pyspark.rdd.RDD[Tuple[K, U]], numPartitions: Optional[int] = None) → pyspark.rdd.RDD[Tuple[K, Tuple[pyspark.resultiterable.ResultIterable[V], pyspark.resultiterable.ResultIterable[U]]]][source]¶

For each key k in self or other, return a resulting RDD that contains a tuple with the list of values for that key in self as well as other.

New in version 0.7.0.

Parameters

otherRDD: another RDD

Returns

RDD: a RDD containing the keys and cogrouped values

See also

RDD.groupWith()
RDD.join()

Examples

>>> rdd1 = sc.parallelize([("a", 1), ("b", 4)])
>>> rdd2 = sc.parallelize([("a", 2)])
>>> [(x, tuple(map(list, y))) for x, y in sorted(list(rdd1.cogroup(rdd2).collect()))]
[('a', ([1], [2])), ('b', ([4], []))]

pyspark.RDD.coalesce pyspark.RDD.collect