This seems like a task that dask
was designed for. Essentially, dask
can do pandas
operations out-of-core, so you can work with datasets that don't fit into memory. The dask.dataframe
API is a subset of the pandas
API, so there shouldn't be much of a learning curve. See the Dask DataFrame Overview page for some additional DataFrame specific details.
import dask.dataframe as dd
# Read in the csv files.
df1 = dd.read_csv('file1.csv')
df2 = dd.read_csv('file2.csv')
# Merge the csv files.
df = dd.merge(df1, df2, how='outer', on=['product','version'])
# Write the output.
df.to_csv('file3.csv', index=False)
Assuming that 'product'
and 'version'
are the only columns, it may be more efficient to replace the merge
with:
df = dd.concat([df1, df2]).drop_duplicates()
I'm not entirely sure if that will be better, but apparently merges that aren't done on the index are "slow-ish" in dask
, so it could be worth a try.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…