Skip to content

Instantly share code, notes, and snippets.

@huasanyelao
Forked from msukmanowsky/spark_gzip.py
Created December 11, 2015 09:29
Show Gist options
  • Save huasanyelao/2dab0c2059249c5fe172 to your computer and use it in GitHub Desktop.
Save huasanyelao/2dab0c2059249c5fe172 to your computer and use it in GitHub Desktop.
Example of how to save Spark RDDs to disk using GZip compression in response to https://twitter.com/rjurney/status/533061960128929793.
from pyspark import SparkContext
def main():
sc = SparkContext(appName="Test Compression")
# RDD has to be key, value pairs
data = sc.parallelize([
("key1", "value1"),
("key2", "value2"),
("key3", "value3"),
])
data.saveAsHadoopFile("/tmp/spark_compressed",
"org.apache.hadoop.mapred.TextOutputFormat",
compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")
sc.stop()
if __name__ == "__main__":
main()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment