Found a nice article on an answer to another posting. However nothing new learned. The poster does make a good remark on the difference between Local driver (yarn-client) and Remote Driver (yarn-cluster). Definitely important to keep in mind.
True ... it has been discussed quite a lot.
However there is a lot of ambiguity and some of the answers provided ... including duplicating jar references in the jars/executor/driver configuration or options.
Following ambiguity, unclear, and/or omitted details should be clarified for each option:
I am aware where I can find the main spark documentation, and specifically about how to submit, the options available, and also the JavaDoc. However that left for me still quite some holes, although it answered partially too.
I hope that it is not all that complex, and that someone can give me a clear and concise answer.
If I were to guess from documentation, it seems that --jars
, and the SparkContext
addJar
and addFile
methods are the ones that will automatically distribute files, while the other options merely modify the ClassPath.
Would it be safe to assume that for simplicity, I can add additional application jar files using the 3 main options at the same time:
I am aware where I can find the main spark documentation, and specifically about how to submit, the options available, and also the JavaDoc. However that left for me still quite some holes, although it answered partially too.
I hope that it is not all that complex, and that someone can give me a clear and concise answer.
If I were to guess from documentation, it seems that --jars
, and the SparkContext
addJar
and addFile
methods are the ones that will automatically distribute files, while the other options merely modify the ClassPath.
Would it be safe to assume that for simplicity, I can add additional application jar files using the 3 main options at the same time:
spark-submit --jar additional1.jar,additional2.jar \
--driver-library-path additional1.jar:additional2.jar \
--conf spark.executor.extraLibraryPath=additional1.jar:additional2.jar \
--class MyClass main-application.jar
Found a nice article on an answer to another posting. However nothing new learned. The poster does make a good remark on the difference between Local driver (yarn-client) and Remote Driver (yarn-cluster). Definitely important to keep in mind.
Other configurable Spark option relating to jars and classpath, in case of yarn as deploy mode are as follows
From the spark documentation,
spark.yarn.jars
List of libraries containing Spark code to distribute to YARN containers. By default, Spark on YARN will use Spark jars installed locally, but the Spark jars can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to jars on HDFS, for example, set this configuration to hdfs:///some/path. Globs are allowed.
spark.yarn.archive
An archive containing needed Spark jars for distribution to the YARN cache. If set, this configuration replaces spark.yarn.jars and the archive is used in all the application's containers. The archive should contain jar files in its root directory. Like with the previous option, the archive can also be hosted on HDFS to speed up file distribution.
Users can configure this parameter to specify their jars, which inturn gets included in Spark driver's classpath.
When using spark-submit with --master yarn-cluster, the application jar along with any jars included with the --jars option will be automatically transferred to the cluster. URLs supplied after --jars must be separated by commas. That list is included in the driver and executor classpaths
Example :
spark-submit --master yarn-cluster --jars ../lib/misc.jar, ../lib/test.jar --class MainClass MainApp.jar
https://spark.apache.org/docs/latest/submitting-applications.html
While we submit spark jobs using spark-submit utility, there is an option --jars . Using this option, we can pass jar file to spark applications.