Spark performance for Scala vs Python

val input = sc.textFile("train.csv", minPartitions=6) val input2 = input.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter } val delim1 = "\001" def separateCols(line: String): Array[String] = { val line2 = line.replaceAll("true", "1") val line3 = line2.replaceAll("false", "0") val vals: Array[String] = line3.split(",") for((x,i) <- vals.view.zipWithIndex) { vals(i) = "VAR_%04d".format(i) + delim1 + x } vals } val input3 = input2.flatMap(separateCols) def toKeyVal(line: String): (String, String) = { val vals = line.split(delim1) (vals(0), vals(1)) } val input4 = input3.map(toKeyVal) def valsConcat(val1: String, val2: String): String = { val1 + "," + val2 } val input5 = input4.reduceByKey(valsConcat) input5.saveAsTextFile("output") Python Code input = sc.textFile('train.csv', minPartitions=6) DELIM_1 = '\001' def drop_first_line(index, itr): if index == 0: return iter(list(itr)[1:]) else: return itr input2 = input.mapPartitionsWithIndex(drop_first_line) def separate_cols(line): line = line.replace('true', '1').replace('false', '0') vals = line.split(',') vals2 = ['VAR_%04d%s%s' %(e, DELIM_1, val.strip('\"')) for e, val in enumerate(vals)] return vals2 input3 = input2.flatMap(separate_cols) def to_key_val(kv): key, val = kv.split(DELIM_1) return (key, val) input4 = input3.map(to_key_val) def vals_concat(v1, v2): return v1 + ',' + v2 input5 = input4.reduceByKey(vals_concat) input5.saveAsTextFile('output')

In general, I agree with this overview, but there are a few things I have a problem with that make me feel like the real problem is that the author hasn't actually worked with Scala or tried to learn from a good resource.

First, in the info-graphic, they say two things that don't work together. First, they say that Scala has an arcane syntax, then they say it is verbose. You can write verbose Scala, it will look just like Java. If you write good Scala though, it won't look like Java. Indeed, I would argue that it looks more like Python. So if you use the full power of the Scala syntax, it will look arcane to those who aren't used to it (granted, the same is true of list comprehensions in Python), but then it will be very concise, at least on par with Python. If you make it verbose, you clearly aren't using anything arcane.

They also say that Python is easier for a Java developer to pick up than Scala. That's just plain silly given that the primary version of Scala runs on the JVM and has the same memory model as Java. As a result, you really can translate Java pretty much line for line into Scala. The same isn't true for Python.

Lastly, they call Python functional. I feel confident in saying that Python is just as functional as current versions of Java and far less functional than Scala. I know that people often like to say that any language that has lambdas is functional, but they ignore how significant immutability is to being functional, and Python doesn't do much of anything on that front.

Member Sign In

Member Sign In

Create Account

Spark performance for Scala vs Python

Connect With Us