pyspark.SparkContext.binaryFiles¶
-
SparkContext.
binaryFiles
(path: str, minPartitions: Optional[int] = None) → pyspark.rdd.RDD[Tuple[str, bytes]][source]¶ Read a directory of binary files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI as a byte array. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
New in version 1.3.0.
- Parameters
- pathstr
directory to the input data files, the path can be comma separated paths as a list of inputs
- minPartitionsint, optional
suggested minimum number of partitions for the resulting RDD
- Returns
RDD
RDD representing path-content pairs from the file(s).
See also
Notes
Small files are preferred, large file is also allowable, but may cause bad performance.
Examples
>>> import os >>> import tempfile >>> with tempfile.TemporaryDirectory() as d: ... # Write a temporary binary file ... with open(os.path.join(d, "1.bin"), "wb") as f1: ... _ = f1.write(b"binary data I") ... ... # Write another temporary binary file ... with open(os.path.join(d, "2.bin"), "wb") as f2: ... _ = f2.write(b"binary data II") ... ... collected = sorted(sc.binaryFiles(d).collect())
>>> collected [('.../1.bin', b'binary data I'), ('.../2.bin', b'binary data II')]