Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
383 views
in Technique[技术] by (71.8m points)

python - A comparison between fastparquet and pyarrow?

After some searching I failed to find a thorough comparison of fastparquet and pyarrow.

I found this blog post (a basic comparison of speeds).

and a github discussion that claims that files created with fastparquet do not support AWS-athena (btw is it still the case?)

when/why would I use one over the other? what are the major advantages and disadvantages ?


my specific use case is processing data with dask writing it to s3 and then reading/analyzing it with AWS-athena.

question from:https://stackoverflow.com/questions/51361356/a-comparison-between-fastparquet-and-pyarrow

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I used both fastparquet and pyarrow for converting protobuf data to parquet and to query the same in S3 using Athena. Both worked, however, in my use-case, which is a lambda function, package zip file has to be lightweight, so went ahead with fastparquet. (fastparquet library was only about 1.1mb, while pyarrow library was 176mb, and Lambda package limit is 250mb).

I used the following to store a dataframe as parquet file:

from fastparquet import write

parquet_file = path.join(filename + '.parq')
write(parquet_file, df_data)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...