I have data in C:scriptdataYYYYMMdata.feather
To understand Dask better, I am trying to optimize a simple script which gets the row count from each of those files and sums them up. There are almost 100 million rows across 200 files.
import dask.dataframe as dd
import feather
from dask.distributed import Client,LocalCluster
from dask import delayed
counts = []
with LocalCluster() as cluster, Client(cluster) as client:
for f in dates:
df = delayed(feather.read_feather)(f'data{f.year}{f.month:02}data.feather',columns=['colA','colB'])
counts.append(df.shape[0])
tot = sum(counts)
print(dd.compute(tot))
I've included both colA and colB in the read_feather because I want to eventually move to counting distinct values in different timespans. What I see in the task stream is read_feather firing off at about 2 seconds each, as well as a number of disk-write-read_feather
and disk-read-getattr
taking 2-4 seconds each. If I run read_feather in it's own script on a single file, it takes about 0.4 seconds to complete. Is the additional time for read_feather due to the parallelism? Also, do the disk-*
tasks mean it is reading the file, then writing to disk, and reading it again? I'd thought Dask would read in the file, count the rows, and store that number for summing later.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…