Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
910 views
in Technique[技术] by (71.8m points)

python - How to import a gzip file larger than RAM limit into a Pandas DataFrame? "Kill 9" Use HDF5?

I have a gzip which is approximately 90 GB. This is well within disk space, but far larger than RAM.

How can I import this into a pandas dataframe? I tried the following in the command line:

# start with Python 3.4.5
import pandas as pd
filename = 'filename.gzip'   # size 90 GB
df = read_table(filename, compression='gzip')

However, after several minutes, Python shuts down with Kill 9.

After defining the database object df, I was planning to save it into HDF5.

What is the correct way to do this? How can I use pandas.read_table() to do this?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I'd do it this way:

filename = 'filename.gzip'      # size 90 GB
hdf_fn = 'result.h5'
hdf_key = 'my_huge_df'
cols = ['colA','colB','colC','ColZ'] # put here a list of all your columns
cols_to_index = ['colA','colZ'] # put here the list of YOUR columns, that you want to index
chunksize = 10**6               # you may want to adjust it ... 

store = pd.HDFStore(hdf_fn)

for chunk in pd.read_table(filename, compression='gzip', header=None, names=cols, chunksize=chunksize):
    # don't index data columns in each iteration - we'll do it later
    store.append(hdf_key, chunk, data_columns=cols_to_index, index=False)

# index data columns in HDFStore
store.create_table_index(hdf_key, columns=cols_to_index, optlevel=9, kind='full')
store.close()

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...