Dask provides multi-core execution on larger-than-memory datasets.
We can think of dask at a high and a low level
High-level collections: Dask provides high-level Array, Bag, and DataFrame
collections that mimic NumPy, lists, and Pandas but can operate in parallel on
datasets that don't fit into main memory. Dask's high-level collections are
alternatives to NumPy and Pandas for large datasets.
Low-level schedulers: Dask provides dynamic task schedulers that
execute task graphs in parallel. These execution engines power the
high-level collections mentioned above but can also power custom,
user-defined workloads. These schedulers are low-latency (around 1ms) and
work hard to run computations in a small memory footprint. Dask's
schedulers are an alternative to direct use of threading or
multiprocessing libraries in complex cases or other task scheduling
systems like Luigi or IPython parallel.
Different users operate at different levels but it is useful to understand
both. This tutorial will interleave between high-level use of dask.array and
dask.dataframe (even sections) and low-level use of dask graphs and
schedulers (odd sections.)
Prepare
1. You should clone this repository
git clone http://github.com/dask/dask-tutorial
and then install necessary packages.
There are three different ways to achieve this, pick the one that best suits you, and only pick one option.
They are, in order of preference:
You may find the following libraries helpful for some exercises
conda install python-graphviz -c conda-forge
Note that these options will alter your existing environment, potentially changing the versions of packages you already
have installed.
2c) Use Dockerfile
You can build a docker image from the provided Dockerfile.
$ docker build . # This will build using the same env as in a)
Run a container, replacing the ID with the output of the previous command
$ docker run -it -p 8888:8888 -p 8787:8787 <container_id_or_tag>
The above command will give an URL (Like http://(container_id or 127.0.0.1):8888/?token=<sometoken>) which
can be used to access the notebook from browser. You may need to replace the given hostname with "localhost" or
"127.0.0.1".
You should follow only one of the options above!
Launch notebook
From the repo directory
jupyter notebook
Or
jupyter lab
This was already done for method c) and does not need repeating.
请发表评论