python - Best way to join two large datasets in Pandas

Question

Welcome To Ask or Share your Answers For Others

python - Best way to join two large datasets in Pandas

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Best way to join two large datasets in Pandas

I'm downloading two datasets from two different databases that need to be joined. Each of them separately is around 500MB when I store them as CSV. Separately the fit into the memory but when I load both I sometimes get a memory error. I definitely get into trouble when I try to merge them with pandas.

What is the best way to do an outer join on them so that I don't get a memory error? I don't have any database servers at hand but I can install any kind of open source software on my computer if that helps. Ideally I would still like to solve it in pandas only but not sure if this is possible at all.

To clarify: with merging I mean an outer join. Each table has two row: product and version. I want to check which products and versions are in the left table only, right table only and both tables. That I do with a

pd.merge(df1,df2,left_on=['product','version'],right_on=['product','version'], how='outer')

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

...

深蓝 · Answer 1 · 2021-10-16T23:30:27+0000

This seems like a task that dask was designed for. Essentially, dask can do pandas operations out-of-core, so you can work with datasets that don't fit into memory. The dask.dataframe API is a subset of the pandas API, so there shouldn't be much of a learning curve. See the Dask DataFrame Overview page for some additional DataFrame specific details.

import dask.dataframe as dd

# Read in the csv files.
df1 = dd.read_csv('file1.csv')
df2 = dd.read_csv('file2.csv')

# Merge the csv files.
df = dd.merge(df1, df2, how='outer', on=['product','version'])

# Write the output.
df.to_csv('file3.csv', index=False)

Assuming that 'product' and 'version' are the only columns, it may be more efficient to replace the merge with:

df = dd.concat([df1, df2]).drop_duplicates()

I'm not entirely sure if that will be better, but apparently merges that aren't done on the index are "slow-ish" in dask, so it could be worth a try.

Categories

python - Best way to join two large datasets in Pandas

python - Best way to join two large datasets in Pandas

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags