Skip to content

Instantly share code, notes, and snippets.

View yai333's full-sized avatar

Yi Ai yai333

  • Melbourne
View GitHub Profile
@yai333
yai333 / combine_hadoop_files.py
Created June 29, 2020 11:51 — forked from mappingvermont/combine_hadoop_files.py
Use distcp to move concatenate multi-file CSV output to single file on S3
import os
import sys
import subprocess
os.environ["SPARK_HOME"] = r"/usr/lib/spark"
# Set PYTHONPATH for Spark
for path in [r'/usr/lib/spark/python/', r'/usr/lib/spark/python/lib/py4j-src.zip']:
sys.path.append(path)