QBoard » Big Data » Big Data - Spark » Spark performance for Scala vs Python

Spark performance for Scala vs Python

  • I prefer Python over Scala. But, as Spark is natively written in Scala, I was expecting my code to run faster in the Scala than the Python version for obvious reasons.

    With that assumption, I thought to learn & write the Scala version of some very common preprocessing code for some 1 GB of data. Data is picked from the SpringLeaf competition on Kaggle. Just to give an overview of the data (it contains 1936 dimensions and 145232 rows). Data is composed of various types e.g. int, float, string, boolean. I am using 6 cores out of 8 for Spark processing; that's why I used minPartitions=6 so that every core has something to process.

    Scala Code

    val input = sc.textFile("train.csv", minPartitions=6)
    
    val input2 = input.mapPartitionsWithIndex { (idx, iter) => 
      if (idx == 0) iter.drop(1) else iter }
    val delim1 = "\001"
    
    def separateCols(line: String): Array[String] = {
      val line2 = line.replaceAll("true", "1")
      val line3 = line2.replaceAll("false", "0")
      val vals: Array[String] = line3.split(",")
    
      for((x,i) <- vals.view.zipWithIndex) {
        vals(i) = "VAR_%04d".format(i) + delim1 + x
      }
      vals
    }
    
    val input3 = input2.flatMap(separateCols)
    
    def toKeyVal(line: String): (String, String) = {
      val vals = line.split(delim1)
      (vals(0), vals(1))
    }
    
    val input4 = input3.map(toKeyVal)
    
    def valsConcat(val1: String, val2: String): String = {
      val1 + "," + val2
    }
    
    val input5 = input4.reduceByKey(valsConcat)
    
    input5.saveAsTextFile("output")
    Python Code
    
    input = sc.textFile('train.csv', minPartitions=6)
    DELIM_1 = '\001'
    
    
    def drop_first_line(index, itr):
      if index == 0:
        return iter(list(itr)[1:])
      else:
        return itr
    
    input2 = input.mapPartitionsWithIndex(drop_first_line)
    
    def separate_cols(line):
      line = line.replace('true', '1').replace('false', '0')
      vals = line.split(',')
      vals2 = ['VAR_%04d%s%s' %(e, DELIM_1, val.strip('\"'))
               for e, val in enumerate(vals)]
      return vals2
    
    
    input3 = input2.flatMap(separate_cols)
    
    def to_key_val(kv):
      key, val = kv.split(DELIM_1)
      return (key, val)
    input4 = input3.map(to_key_val)
    
    def vals_concat(v1, v2):
      return v1 + ',' + v2
    
    input5 = input4.reduceByKey(vals_concat)
    input5.saveAsTextFile('output')​

    Scala Performance Stage 0 (38 mins), Stage 1 (18 sec) enter image description here

    Python Performance Stage 0 (11 mins), Stage 1 (7 sec) enter image description here



    Both produces different DAG visualization graphs (due to which both pictures show different stage 0 functions for Scala (map) and Python (reduceByKey))

    But, essentially both code tries to transform data into (dimension_id, string of list of values) RDD and save to disk. The output will be used to compute various statistics for each dimension.

    Performance wise, Scala code for this real data like this seems to run 4 times slower than the Python version. Good news for me is that it gave me good motivation to stay with Python. Bad news is I didn't quite understand why?
      December 30, 2021 1:24 PM IST
    0
  • In general, I agree with this overview, but there are a few things I have a problem with that make me feel like the real problem is that the author hasn't actually worked with Scala or tried to learn from a good resource.

    First, in the info-graphic, they say two things that don't work together. First, they say that Scala has an arcane syntax, then they say it is verbose. You can write verbose Scala, it will look just like Java. If you write good Scala though, it won't look like Java. Indeed, I would argue that it looks more like Python. So if you use the full power of the Scala syntax, it will look arcane to those who aren't used to it (granted, the same is true of list comprehensions in Python), but then it will be very concise, at least on par with Python. If you make it verbose, you clearly aren't using anything arcane.

    They also say that Python is easier for a Java developer to pick up than Scala. That's just plain silly given that the primary version of Scala runs on the JVM and has the same memory model as Java. As a result, you really can translate Java pretty much line for line into Scala. The same isn't true for Python.

    Lastly, they call Python functional. I feel confident in saying that Python is just as functional as current versions of Java and far less functional than Scala. I know that people often like to say that any language that has lambdas is functional, but they ignore how significant immutability is to being functional, and Python doesn't do much of anything on that front.

      January 6, 2022 1:05 PM IST
    0
  • Scala proves faster in many ways compare to python but there are some valid reasons why python is becoming more popular that scala, let see few of them —

    Python for Apache Spark is pretty easy to learn and use. However, this not the only reason why Pyspark is a better choice than Scala. There’s more.

    Python API for Spark may be slower on the cluster, but at the end, data scientists can do a lot more with it as compared to Scala. The complexity of Scala is absent. The interface is simple and comprehensive.

    Talking about the readability of code, maintenance and familiarity with Python API for Apache Spark is far better than Scala.

    Python comes with several libraries related to machine learning and natural language processing. This aids in data analysis and also has statistics that are much mature and time-tested. For instance, numpy, pandas, scikit-learn, seaborn and matplotlib.

    Note: Most data scientists use a hybrid approach where they use the best of both the APIs.

    Lastly, Scala community often turns out to be lot less helpful to programmers. This makes Python a much valuable learning. If you have enough experience with any statically typed programming language like Java, you can stop worrying about not using Scala altogether.

      January 11, 2022 3:54 PM IST
    0
  • Scala is frequently over 10 times faster than Python. Scala uses Java Virtual Machine (JVM) during runtime which gives is some speed over Python in most cases. Python is dynamically typed and this reduces the speed. Compiled languages are faster than interpreted. In case of Python, Spark libraries are called which require a lot of code processing and hence slower performance. In this scenario Scala works well for limited cores. Moreover Scala is native for Hadoop as its based on JVM. Hadoop is important because Spark was made on the top of the Hadoop's filesystem HDFS. Python interacts with Hadoop services very badly, so developers have to use 3rd party libraries (like hadoopy). Scala interacts with Hadoop via native Hadoop's API in Java. That's why it's very easy to write native Hadoop applications in Scala.
      January 17, 2022 1:49 PM IST
    0