Skip to content

Instantly share code, notes, and snippets.

@edisplay
Forked from mneedham/parquet-cli.sh
Created April 26, 2025 20:51
Show Gist options
  • Save edisplay/0909b853bd43803c07373e4d1290bd25 to your computer and use it in GitHub Desktop.
Save edisplay/0909b853bd43803c07373e4d1290bd25 to your computer and use it in GitHub Desktop.
An intro to Apache Parquet
# The NYC Taxis Dataset - https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
pip install parquet-cli
parq data/yellow_tripdata_2022-01.parquet
parq data/yellow_tripdata_2022-01.parquet --schema
parq data/yellow_tripdata_2022-01.parquet --head 10
parq data/yellow_tripdata_2022-01.parquet --tail 10
import pyarrow.parquet as pq
file = pq.ParquetFile("data/yellow_tripdata_2022-01.parquet")
file.metadata
file.schema
file.read().to_pandas()
df = file.read().to_pandas()
df.to_csv("trips.csv")
df.to_json("trips.json", orient="records", lines=True)
stat -f %z data/yellow_tripdata_2022-01.parquet | numfmt --to=iec
stat -f %z trips.csv | numfmt --to=iec
stat -f %z trips.json | numfmt --to=iec
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment