Add jars to a Spark Job - spark-submit

QBoard » Big Data » Big Data - Spark » Add jars to a Spark Job - spark-submit

Add jars to a Spark Job - spark-submit

Back To Topics

Tags : apache-spark java scala spark-submit jar

Viaan Prakash

461
True ... it has been discussed quite a lot.

However there is a lot of ambiguity and some of the answers provided ... including duplicating jar references in the jars/executor/driver configuration or options.

The ambiguous and/or omitted details

Following ambiguity, unclear, and/or omitted details should be clarified for each option:
- How ClassPath is affected
  - Driver
  - Executor (for tasks running)
  - Both
  - not at all
- Separation character: comma, colon, semicolon
- If provided files are automatically distributed
  - for the tasks (to each executor)
  - for the remote Driver (if ran in cluster mode)
- type of URI accepted: local file, hdfs, http, etc
- If copied into a common location, where that location is (hdfs, local?)
The options to which it affects :
--jars
SparkContext.addJar(...) method
SparkContext.addFile(...) method
--conf spark.driver.extraClassPath=... or --driver-class-path ...
--conf spark.driver.extraLibraryPath=..., or --driver-library-path ...
--conf spark.executor.extraClassPath=...
--conf spark.executor.extraLibraryPath=...
not to forget, the last parameter of the spark-submit is also a .jar file.

I am aware where I can find the main spark documentation, and specifically about how to submit, the options available, and also the JavaDoc. However that left for me still quite some holes, although it answered partially too.

I hope that it is not all that complex, and that someone can give me a clear and concise answer.

If I were to guess from documentation, it seems that --jars, and the SparkContext addJar and addFile methods are the ones that will automatically distribute files, while the other options merely modify the ClassPath.

Would it be safe to assume that for simplicity, I can add additional application jar files using the 3 main options at the same time:

I am aware where I can find the main spark documentation, and specifically about how to submit, the options available, and also the JavaDoc. However that left for me still quite some holes, although it answered partially too.

I hope that it is not all that complex, and that someone can give me a clear and concise answer.

If I were to guess from documentation, it seems that --jars, and the SparkContext addJar and addFile methods are the ones that will automatically distribute files, while the other options merely modify the ClassPath.

Would it be safe to assume that for simplicity, I can add additional application jar files using the 3 main options at the same time:
```
spark-submit --jar additional1.jar,additional2.jar \
  --driver-library-path additional1.jar:additional2.jar \
  --conf spark.executor.extraLibraryPath=additional1.jar:additional2.jar \
  --class MyClass main-application.jar
```
Found a nice article on an answer to another posting. However nothing new learned. The poster does make a good remark on the difference between Local driver (yarn-client) and Remote Driver (yarn-cluster). Definitely important to keep in mind.
December 3, 2021 12:54 PM IST

0
Advika Banerjee

319 1

Other configurable Spark option relating to jars and classpath, in case of yarn as deploy mode are as follows
From the spark documentation,

spark.yarn.jars

List of libraries containing Spark code to distribute to YARN containers. By default, Spark on YARN will use Spark jars installed locally, but the Spark jars can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to jars on HDFS, for example, set this configuration to hdfs:///some/path. Globs are allowed.

spark.yarn.archive

An archive containing needed Spark jars for distribution to the YARN cache. If set, this configuration replaces spark.yarn.jars and the archive is used in all the application's containers. The archive should contain jar files in its root directory. Like with the previous option, the archive can also be hosted on HDFS to speed up file distribution.

Users can configure this parameter to specify their jars, which inturn gets included in Spark driver's classpath.

December 6, 2021 2:00 PM IST

0
Vaibhav Mali

259

When using spark-submit with --master yarn-cluster, the application jar along with any jars included with the --jars option will be automatically transferred to the cluster. URLs supplied after --jars must be separated by commas. That list is included in the driver and executor classpaths

Example :

spark-submit --master yarn-cluster --jars ../lib/misc.jar, ../lib/test.jar --class MainClass MainApp.jar

https://spark.apache.org/docs/latest/submitting-applications.html

December 8, 2021 10:13 AM IST

0
Maryam Bains

317

While we submit spark jobs using spark-submit utility, there is an option --jars . Using this option, we can pass jar file to spark applications.

December 20, 2021 12:19 PM IST

0

Cluzters.ai is the first step towards uniting various Industry participants in the field of Applied Data Innovations. It is a gamified community geared towards creating a level playing turf for Data science professionals.

Member Sign In

Member Sign In

Create Account

Add jars to a Spark Job - spark-submit

The ambiguous and/or omitted details

The options to which it affects :

spark.yarn.jars

spark.yarn.archive

Connect With Us