Skip to content

Instantly share code, notes, and snippets.

@jovianlin
Created April 12, 2017 05:31
Show Gist options
  • Save jovianlin/8c7f8191974cd43fc121d87c9446bc2d to your computer and use it in GitHub Desktop.
Save jovianlin/8c7f8191974cd43fc121d87c9446bc2d to your computer and use it in GitHub Desktop.
PySpark Quick Codes
# Write DataFrame to Disk
spark_df.coalesce(1).write.csv( '<saved_output/YOUR_FOLDER_NAME>', header=True, mode='overwrite' )
# Read from Disk to DataFrame
new_spark_df = sqlContext.read.csv(s3_path, header=True, inferSchema=False) # For S3
new_spark_df = sqlContext.read.csv('<LOCATION>', header=True, inferSchema=False) # mode='FAILFAST'
# SORTING
from pyspark.sql.functions import col
col_name = 'restaurant_id'
spark_df.groupBy(col_name).count().filter("count >= 99").sort(col("count").desc()) # optional: ".toPandas()"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment