-
-
Save lpillmann/fa1874c7deb8434ca8cba8e5a045dde2 to your computer and use it in GitHub Desktop.
import gcsfs | |
import pyarrow | |
def read_parquet(gs_directory_path, to_pandas=True): | |
""" | |
Reads multiple (partitioned) parquet files from a GS directory | |
e.g. 'gs://<bucket>/<directory>' (without ending /) | |
""" | |
gs = gcsfs.GCSFileSystem() | |
arrow_df = pyarrow.parquet.ParquetDataset(gs_directory_path, filesystem=gs) | |
if to_pandas: | |
return arrow_df.read_pandas().to_pandas() | |
return arrow_df | |
Hey @lpillmann !
Yeah, if i set this path i can reach the files : gs://bucket/folder/DATA_PART=201801
but how to access paths like this? gs://bucket/folder/*
I ask you because in other environments I can usually look for this path.
thank you for your help!!
Got it @@felipejardimf.
I'd expect PyArrow to be able to read from that path if you pass gs://bucket/folder
as gs_directory_path
.
However, I'm not able to test it right now. You might want to take a look at pyarrow.parquet.ParquetDataset
documentation and see if you need to tweak any of the parameters in order for that to work.
Hi everyone!
Unfortunately, I got errors like below.
OSError: Passed non-file path: gs://<bucket>/<folder>
or
ArrowInvalid: Parquet file size is 0 bytes
I found another way here to achieve the same, which could hopefully help someone.
Note that pandas dons not support this
cool, thank you
It worked perfectly for me! Thanks a bunch!
Hi @felipejardimf and @rjurney!
It has been a while since I've worked with GS - I'm not currently able to reproduce it.
Just to be sure, please note that
gs_directory_path
argument is the path of the "folder" in which the files are stored, without the ending/
.For the structure
the argument
gs_directory_path
would begs://bucket/folder/DATA_PART=201801
.For the structure
the argument
gs_directory_path
would begs://bar-foo/derived/2021_01/person_company_examples.parquet
.Can you try with these arguments and see if it works? If it doesn't, please share the error message or behavior and I can try to debug with you.