Let's assume for the following that only one Spark job is running at every point in time.
What I get so far
Here is what I understand what happens in... moreLet's assume for the following that only one Spark job is running at every point in time.
What I get so far
Here is what I understand what happens in Spark:
When a SparkContext is created, each worker node starts an executor. Executors are separate processes (JVM), that connects back to the driver program. Each executor has the jar of the driver program. Quitting a driver, shuts down the executors. Each executor can hold some partitions.
When a job is executed, an execution plan is created according to the lineage graph.
The execution job is split into stages, where stages containing as many neighbouring (in the lineage graph) transformations and action, but no shuffles. Thus stages are separated by shuffles.
I understand that
A task is a command sent from the driver to an executor by serializing the Function object.
The executor deserializes (with the driver jar) the command (task) and executes it on a partition.
but
Question(s)
How do I split the stage into those tasks?
Specifically:
Assume df1 and df2 are two DataFrames in Apache Spark, computed using two different mechanisms, e.g., Spark SQL vs. the Scala/Java/Python API.Is there an idiomatic way to... moreAssume df1 and df2 are two DataFrames in Apache Spark, computed using two different mechanisms, e.g., Spark SQL vs. the Scala/Java/Python API.Is there an idiomatic way to determine whether the two data frames are equivalent (equal, isomorphic), where equivalence is determined by the data (column names and column values for each row) being identical save for the ordering of rows & columns?The motivation for the question is that there are often many ways to compute some big data result, each with its own trade-offs. As one explores these trade-offs, it is important to maintain correctness and hence the need to check for the equivalence/equality on a meaningful test data set. less
I prefer Python over Scala. But, as Spark is natively written in Scala, I was expecting my code to run faster in the Scala than the Python version for obvious reasons.With that... moreI prefer Python over Scala. But, as Spark is natively written in Scala, I was expecting my code to run faster in the Scala than the Python version for obvious reasons.With that assumption, I thought to learn & write the Scala version of some very common preprocessing code for some 1 GB of data. Data is picked from the SpringLeaf competition on Kaggle. Just to give an overview of the data (it contains 1936 dimensions and 145232 rows). Data is composed of various types e.g. int, float, string, boolean. I am using 6 cores out of 8 for Spark processing; that's why I used minPartitions=6 so that every core has something to process.Scala Code
val input = sc.textFile("train.csv", minPartitions=6)
val input2 = input.mapPartitionsWithIndex { (idx, iter) =>
if (idx == 0) iter.drop(1) else iter }
val delim1 = "\001"
def separateCols(line: String): Array = {
val line2 = line.replaceAll("true", "1")
val line3 = line2.replaceAll("false", "0")
val vals: Array = line3.split(",")
I am using CDH 5.2. I am able to use spark-shell to run the commands.How can I run the file(file.spark) which contain spark commands.Is there any way to run/compile the scala... moreI am using CDH 5.2. I am able to use spark-shell to run the commands.How can I run the file(file.spark) which contain spark commands.Is there any way to run/compile the scala programs in CDH 5.2 without sbt?Thanks in advance
I am using spark-csv to load data into a DataFrame. I want to do a simple query and display the content:
val df =... moreI am using spark-csv to load data into a DataFrame. I want to do a simple query and display the content:
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("my.csv")
df.registerTempTable("tasks")
results = sqlContext.sql("select col from tasks");
results.show()
I'm new to Spark and I'm trying to read CSV data from a file with Spark. Here's what I am doing :sc.textFile('file.csv') .map(lambda line: (line.split(','), line.split(',')))... moreI'm new to Spark and I'm trying to read CSV data from a file with Spark. Here's what I am doing :sc.textFile('file.csv') .map(lambda line: (line.split(','), line.split(','))) .collect()I would expect this call to give me a list of the two first columns of my file but I'm getting this error :File "", line 1, in IndexError: list index out of rangealthough my CSV file as more than one column.
I am new at this concept, and still learning. I have total 10 TB json files in AWS S3, 4 instances(m3.xlarge) in AWS EC2 (1 master, 3 worker). I am currently using spark with... moreI am new at this concept, and still learning. I have total 10 TB json files in AWS S3, 4 instances(m3.xlarge) in AWS EC2 (1 master, 3 worker). I am currently using spark with python on Apache Zeppelin.I am reading files with the following command;hcData=sqlContext.read.option("inferSchema","true").json(path)In zeppelin interpreter settings:
master = yarn-client
spark.driver.memory = 10g
spark.executor.memory = 10g
spark.cores.max = 4
It takes 1 minute to read 1GB approximately. What can I do more for reading big data more efficiently?
Should I do more on coding?
Should I increase instances?
Should I use another notebook platform?
I'm trying to implement a Lambda Architecture using the following tools: Apache Kafka to receive all the datapoints, Spark for batch processing (Big Data), Spark Streaming for... moreI'm trying to implement a Lambda Architecture using the following tools: Apache Kafka to receive all the datapoints, Spark for batch processing (Big Data), Spark Streaming for real time (Fast Data) and Cassandra to store the results.
Also, all the datapoints I receive are related to a user session, and therefore, for the batch processing I'm only interested to process the datapoints once the session finishes. So, since I'm using Kafka, the only way to solve this (assuming that all the datapoints are stored in the same topic) is for the batch to fetch all the messages in the topic, and then ignore those that correspond to sessions that have not yet finished.
So, what I'd like to ask is:
Is this a good approach to implement the Lambda Architecture? Or should use Haddop and Storm instead? (I can't find information about people using Kafka and Apache Spark for batch processing, Map Reduce)
Is there a better approach to solve the user sessions problem?
I am using https://github.com/databricks/spark-csv , I am trying to write a single CSV, but not able to, it is making a folder.
Need a Scala function which will take parameter... moreI am using https://github.com/databricks/spark-csv , I am trying to write a single CSV, but not able to, it is making a folder.
Need a Scala function which will take parameter like path and file name and write that CSV file.
Hi In the University in the data science area we learned that if we wanted to work with small data we should use pandas and if we work with Big Data we schould use spark, in the... moreHi In the University in the data science area we learned that if we wanted to work with small data we should use pandas and if we work with Big Data we schould use spark, in the case of Python programmers PySpark.
Recently I saw in a Hackaton in the cloud (azure Synapse, that work inside in Spark) importing pandas in the notebook ( I suppose the code is good cause was made from Microsoft people)
import pandas
from azureml.core import Dataset
training_pd = training_data.toPandas().to_csv('training_pd.csv', index=False)
I already have a cluster of 3 machines (ubuntu1,ubuntu2,ubuntu3 by VM virtualbox) running Hadoop 1.0.0. I installed spark on each of these machines. ub1 is my master node and the... moreI already have a cluster of 3 machines (ubuntu1,ubuntu2,ubuntu3 by VM virtualbox) running Hadoop 1.0.0. I installed spark on each of these machines. ub1 is my master node and the other nodes are working as slave. My question is what exactly a spark driver is? and should we set a IP and port to spark driver by spark.driver.host and where it will be executed and located? (master or slave)
How can I convert an RDD (org.apache.spark.rdd.RDD) to a Dataframe org.apache.spark.sql.DataFrame. I converted a dataframe to rdd using .rdd. After processing it I want it back in... moreHow can I convert an RDD (org.apache.spark.rdd.RDD) to a Dataframe org.apache.spark.sql.DataFrame. I converted a dataframe to rdd using .rdd. After processing it I want it back in dataframe. How can I do this ?
Having read this question, I would like to ask additional questions:
The Cluster Manager is a long-running service, on which node it is... moreHaving read this question, I would like to ask additional questions:
The Cluster Manager is a long-running service, on which node it is running?
Is it possible that the Master and the Driver nodes will be the same machine? I presume that there should be a rule somewhere stating that these two nodes should be different?
In case where the Driver node fails, who is responsible of re-launching the application? and what will happen exactly? i.e. how the Master node, Cluster Manager and Workers nodes will get involved (if they do), and in which order?
Similarly to the previous question: In case where the Master node fails, what will happen exactly and who is responsible of recovering from the failure? less
True ... it has been discussed quite a lot.
However there is a lot of ambiguity and some of the answers provided ... including duplicating jar references in the... moreTrue ... it has been discussed quite a lot.
However there is a lot of ambiguity and some of the answers provided ... including duplicating jar references in the jars/executor/driver configuration or options.
The ambiguous and/or omitted details
Following ambiguity, unclear, and/or omitted details should be clarified for each option:
How ClassPath is affected
Driver
Executor (for tasks running)
Both
not at all
Separation character: comma, colon, semicolon
If provided files are automatically distributed
for the tasks (to each executor)
for the remote Driver (if ran in cluster mode)
type of URI accepted: local file, hdfs, http, etc
If copied into a common location, where that location is (hdfs, local?)
The options to which it affects :
--jarsSparkContext.addJar(...) methodSparkContext.addFile(...) method--conf spark.driver.extraClassPath=... or --driver-class-path ...--conf spark.driver.extraLibraryPath=..., or --driver-library-path ...--conf spark.executor.extraClassPath=...--conf... less