QBoard » Big Data » Big Data - Spark » Add jars to a Spark Job - spark-submit

Add jars to a Spark Job - spark-submit

  • True ... it has been discussed quite a lot.

    However there is a lot of ambiguity and some of the answers provided ... including duplicating jar references in the jars/executor/driver configuration or options.

    The ambiguous and/or omitted details

    Following ambiguity, unclear, and/or omitted details should be clarified for each option:

    • How ClassPath is affected
      • Driver
      • Executor (for tasks running)
      • Both
      • not at all
    • Separation character: comma, colon, semicolon
    • If provided files are automatically distributed
      • for the tasks (to each executor)
      • for the remote Driver (if ran in cluster mode)
    • type of URI accepted: local file, hdfs, http, etc
    • If copied into a common location, where that location is (hdfs, local?)

    The options to which it affects :

    --jars
    SparkContext.addJar(...) method
    SparkContext.addFile(...) method
    --conf spark.driver.extraClassPath=... or --driver-class-path ...
    --conf spark.driver.extraLibraryPath=..., or --driver-library-path ...
    --conf spark.executor.extraClassPath=...
    --conf spark.executor.extraLibraryPath=...
    not to forget, the last parameter of the spark-submit is also a .jar file.

    I am aware where I can find the main spark documentation, and specifically about how to submit, the options available, and also the JavaDoc. However that left for me still quite some holes, although it answered partially too.

    I hope that it is not all that complex, and that someone can give me a clear and concise answer.

    If I were to guess from documentation, it seems that --jars, and the SparkContext addJar and addFile methods are the ones that will automatically distribute files, while the other options merely modify the ClassPath.

    Would it be safe to assume that for simplicity, I can add additional application jar files using the 3 main options at the same time:

    I am aware where I can find the main spark documentation, and specifically about how to submit, the options available, and also the JavaDoc. However that left for me still quite some holes, although it answered partially too.

    I hope that it is not all that complex, and that someone can give me a clear and concise answer.

    If I were to guess from documentation, it seems that --jars, and the SparkContext addJar and addFile methods are the ones that will automatically distribute files, while the other options merely modify the ClassPath.

    Would it be safe to assume that for simplicity, I can add additional application jar files using the 3 main options at the same time:

    spark-submit --jar additional1.jar,additional2.jar \
      --driver-library-path additional1.jar:additional2.jar \
      --conf spark.executor.extraLibraryPath=additional1.jar:additional2.jar \
      --class MyClass main-application.jar

     

    Found a nice article on an answer to another posting. However nothing new learned. The poster does make a good remark on the difference between Local driver (yarn-client) and Remote Driver (yarn-cluster). Definitely important to keep in mind.

     
      December 3, 2021 12:54 PM IST
    0
  • Other configurable Spark option relating to jars and classpath, in case of yarn as deploy mode are as follows
    From the spark documentation,

    spark.yarn.jars

    List of libraries containing Spark code to distribute to YARN containers. By default, Spark on YARN will use Spark jars installed locally, but the Spark jars can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to jars on HDFS, for example, set this configuration to hdfs:///some/path. Globs are allowed.

    spark.yarn.archive

    An archive containing needed Spark jars for distribution to the YARN cache. If set, this configuration replaces spark.yarn.jars and the archive is used in all the application's containers. The archive should contain jar files in its root directory. Like with the previous option, the archive can also be hosted on HDFS to speed up file distribution.

    Users can configure this parameter to specify their jars, which inturn gets included in Spark driver's classpath.

      December 6, 2021 2:00 PM IST
    0
  • When using spark-submit with --master yarn-cluster, the application jar along with any jars included with the --jars option will be automatically transferred to the cluster. URLs supplied after --jars must be separated by commas. That list is included in the driver and executor classpaths

    Example :

    spark-submit --master yarn-cluster --jars ../lib/misc.jar, ../lib/test.jar --class MainClass MainApp.jar

    https://spark.apache.org/docs/latest/submitting-applications.html

      December 8, 2021 10:13 AM IST
    0
  • While we submit spark jobs using spark-submit utility, there is an option --jars . Using this option, we can pass jar file to spark applications.

     
      December 20, 2021 12:19 PM IST
    0