Home | Trees | Indices | Help |
|
---|
|
PySpark supports custom serializers for transferring data; this can improve performance.
By default, PySpark uses PickleSerializer to serialize objects using Python's
cPickle
serializer, which can serialize nearly any Python
object. Other serializers, like MarshalSerializer, support fewer datatypes but can be
faster.
The serializer is chosen when creating SparkContext:
>>> from pyspark.context import SparkContext >>> from pyspark.serializers import MarshalSerializer >>> sc = SparkContext('local', 'test', serializer=MarshalSerializer()) >>> sc.parallelize(list(range(1000))).map(lambda x: 2 * x).take(10) [0, 2, 4, 6, 8, 10, 12, 14, 16, 18] >>> sc.stop()
By default, PySpark serialize objects in batches; the batch size can
be controlled through SparkContext's batchSize
parameter
(the default size is 1024 objects):
>>> sc = SparkContext('local', 'test', batchSize=2) >>> rdd = sc.parallelize(range(16), 4).map(lambda x: x)
Behind the scenes, this creates a JavaRDD with four partitions, each of which contains two batches of two objects:
>>> rdd.glom().collect() [[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15]] >>> rdd._jrdd.count() 8L >>> sc.stop()
A batch size of -1 uses an unlimited batch size, and a size of 1 disables batching:
>>> sc = SparkContext('local', 'test', batchSize=1) >>> rdd = sc.parallelize(range(16), 4).map(lambda x: x) >>> rdd.glom().collect() [[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15]] >>> rdd._jrdd.count() 16L
Classes | |
PickleSerializer Serializes objects using Python's cPickle serializer: |
|
MarshalSerializer Serializes objects using Python's Marshal serializer: |
Home | Trees | Indices | Help |
|
---|
Generated by Epydoc 3.0.1 on Wed Apr 9 15:41:34 2014 | http://epydoc.sourceforge.net |