Adding to other answers.
- Is it necessary that spark is installed on all the nodes in the yarn
cluster?
No, If the spark job is scheduling in YARN(either client
or cluster
mode). Spark installation is needed in many nodes only for standalone mode.
These are the visualizations of spark app deployment modes.
Spark Standalone Cluster
In cluster
mode driver will be sitting in one of the Spark Worker node whereas in client
mode it will be within the machine which launched the job.
YARN cluster mode
YARN client mode
This table offers a concise list of differences between these modes:
pics source
- It says in the documentation "Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client-side)
configuration files for the Hadoop cluster". Why does the client node have
to install Hadoop when it is sending the job to cluster?
Hadoop installation is not mandatory but configurations(not all) are!. We can call them Gateway nodes. It's for two main reasons.
- The configuration contained in
HADOOP_CONF_DIR
directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration.
- In YARN mode the ResourceManager’s address is picked up from the
Hadoop configuration(
yarn-default.xml
). Thus, the --master
parameter is yarn
.
Update: (2017-01-04)
Spark 2.0+ no longer requires a fat assembly jar for production
deployment. source
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…