Created
April 12, 2017 05:31
-
-
Save jovianlin/8c7f8191974cd43fc121d87c9446bc2d to your computer and use it in GitHub Desktop.
PySpark Quick Codes
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Write DataFrame to Disk | |
spark_df.coalesce(1).write.csv( '<saved_output/YOUR_FOLDER_NAME>', header=True, mode='overwrite' ) | |
# Read from Disk to DataFrame | |
new_spark_df = sqlContext.read.csv(s3_path, header=True, inferSchema=False) # For S3 | |
new_spark_df = sqlContext.read.csv('<LOCATION>', header=True, inferSchema=False) # mode='FAILFAST' | |
# SORTING | |
from pyspark.sql.functions import col | |
col_name = 'restaurant_id' | |
spark_df.groupBy(col_name).count().filter("count >= 99").sort(col("count").desc()) # optional: ".toPandas()" |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment