I'm just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset) in Apache Spark?Can you convert one to the other?
I did come across a mini tutorial for data preprocessing using spark here: http://ampcamp.berkeley.edu/big-data-mini-course/featurization.html
However, this discusses only about... moreI did come across a mini tutorial for data preprocessing using spark here: http://ampcamp.berkeley.edu/big-data-mini-course/featurization.html
However, this discusses only about textfile parsing. Is there a way to parse xml files from spark system?
I would like to read a CSV in spark and convert it as DataFrame and store it in HDFS with df.registerTempTable("table_name")I have tried:
scala> val df =... moreI would like to read a CSV in spark and convert it as DataFrame and store it in HDFS with df.registerTempTable("table_name")I have tried:
scala> val df = sqlContext.load("hdfs:///csv/file/dir/file.csv")
Error which I got:
java.lang.RuntimeException: hdfs:///csv/file/dir/file.csv is not a Parquet file. expected magic number at tail but found
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:418)
at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:277)
at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:276)
at scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
at... less
I have a spark streaming application which produces a dataset for every minute. I need to save/overwrite the results of the processed data.
When I tried to overwrite the dataset... moreI have a spark streaming application which produces a dataset for every minute. I need to save/overwrite the results of the processed data.
When I tried to overwrite the dataset org.apache.hadoop.mapred.FileAlreadyExistsException stops the execution.
I set the Spark property set("spark.files.overwrite","true") , but there is no luck.
How to overwrite or Predelete the files from spark?
I'd like to stop various messages that are coming on spark shell.I tried to edit the log4j.properties file in order to stop these message.Here are the contents of... moreI'd like to stop various messages that are coming on spark shell.I tried to edit the log4j.properties file in order to stop these message.Here are the contents of log4j.properties
# Define the root logger with appender file
log4j.rootCategory=WARN, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Settings to quiet third party logs that are too verbose
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
But messages are still getting displayed on the console.
Here are some example messages
15/01/05 15:11:45 INFO SparkEnv: Registering BlockManagerMaster
15/01/05 15:11:45 INFO DiskBlockManager: Created local... less
I'm trying to understand the relationship of the number of cores and the number of executors when running a Spark job on YARN.
The test environment is as follows:
Number of data... moreI'm trying to understand the relationship of the number of cores and the number of executors when running a Spark job on YARN.
The test environment is as follows:
Number of data nodes: 3
Data node machine spec:
CPU: Core i7-4790 (# of cores: 4, # of threads: 8)
RAM: 32GB (8GB x 4)
HDD: 8TB (2TB x 4)
What are the differences between Apache Spark SQLContext and HiveContext ?
Some sources say that since the HiveContext is a superset of SQLContext developers should always use... moreWhat are the differences between Apache Spark SQLContext and HiveContext ?
Some sources say that since the HiveContext is a superset of SQLContext developers should always use HiveContext which has more features than SQLContext. But the current APIs of each contexts are mostly same.
What are the scenarios which SQLContext/HiveContext is more useful ?.
Is HiveContext more useful only when working with Hive ?.
Or does the SQLContext is all that needs in implementing a Big Data app using Apache Spark ?
Quoting the Spark DataFrames, Datasets and SQL manual:
A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are less important due to... moreQuoting the Spark DataFrames, Datasets and SQL manual:
A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are less important due to Spark SQL’s in-memory computational model. Others are slotted for future releases of Spark SQL.
Being new to Spark, I'm a bit baffled by this for two reasons:
Spark SQL is designed to process Big Data, and at least in my use case the data size far exceeds the size of available memory. Assuming this is not uncommon, what is meant by "Spark SQL’s in-memory computational model"? Is Spark SQL recommended only for cases where the data fits in memory?
Even assuming the data fits in memory, a full scan over a very large dataset can take a long time. I read this argument against indexing in in-memory database, but I was not convinced. The example there discusses a scan of a 10,000,000 records table, but that's not really big data. Scanning a table with billions of records can cause simple queries of the "SELECT x WHERE y=z"... less
How can I increase the memory available for Apache spark executor nodes?
I have a 2 GB file that is suitable to loading in to Apache Spark. I am running apache spark for the... moreHow can I increase the memory available for Apache spark executor nodes?
I have a 2 GB file that is suitable to loading in to Apache Spark. I am running apache spark for the moment on 1 machine, so the driver and executor are on the same machine. The machine has 8 GB of memory.
When I try count the lines of the file after setting the file to be cached in memory I get these errors:
2014-10-25 22:25:12 WARN CacheManager:71 - Not enough space to cache partition rdd_1_1 in memory! Free memory is 278099801 bytes.
I looked at the documentation here and set spark.executor.memory to 4g in $spark.home
The UI shows this variable is set in the Spark Environment. You can find screenshot here
However when I go to the Executor tab the memory limit for my single Executor is still set to 265.4 MB. I also still get the same error.
I tried various things mentioned here but I still get the error and don't have a clear idea where I should change the setting.
I am running my code interactively from the spark-shell less
I tried to start spark 1.6.0 (spark-1.6.0-bin-hadoop2.4) on Mac OS Yosemite 10.10.5 using
"./bin/spark-shell".... moreI tried to start spark 1.6.0 (spark-1.6.0-bin-hadoop2.4) on Mac OS Yosemite 10.10.5 using
"./bin/spark-shell".
It has the error below. I also tried to install different versions of Spark but all have the same error. This is the second time I'm running Spark. My previous run works fine.
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.6.0
/_/
Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_79)
Type in expressions to have them evaluated.
Type :help for more information.
16/01/04 13:49:40 WARN Utils: Service... less
I am trying to get live JSON data from RabbitMQ to Apache Spark using Java and do some realtime analytics out of it.
I am able to get the data and also do some basic SQL queries... moreI am trying to get live JSON data from RabbitMQ to Apache Spark using Java and do some realtime analytics out of it.
I am able to get the data and also do some basic SQL queries on it, but I am not able to figure out the grouping part.
Below is the JSON I have
{"DeviceId":"MAC-101","DeviceType":"Simulator-1","data":{"TimeStamp":"26-06-2017 16:43:41","FR":10,"ASSP":20,"Mode":1,"EMode":2,"ProgramNo":2,"Status":3,"Timeinmillisecs":636340922213668165}}
{"DeviceId":"MAC-101","DeviceType":"Simulator-1","data":{"TimeStamp":"26-06-2017 16:43:41","FR":10,"ASSP":20,"Mode":1,"EMode":2,"ProgramNo":2,"Status":3,"Timeinmillisecs":636340922213668165}}
{"DeviceId":"MAC-102","DeviceType":"Simulator-1","data":{"TimeStamp":"26-06-2017 16:43:41","FR":10,"ASSP":20,"Mode":1,"EMode":2,"ProgramNo":2,"Status":3,"Timeinmillisecs":636340922213668165}}
{"DeviceId":"MAC-102","DeviceType":"Simulator-1","data":{"TimeStamp":"26-06-2017... less
I'm new to big data processing and I'm reading about tools for stream processing and building data pipelines. I found Apache Spark and Spring Cloud Data Flow. I want to know the... moreI'm new to big data processing and I'm reading about tools for stream processing and building data pipelines. I found Apache Spark and Spring Cloud Data Flow. I want to know the main differences and the pros and cons of them. Could anybody help me?
I've got big RDD(1gb) in yarn cluster. On local machine, which use this cluster I have only 512 mb. I'd like to iterate over values in RDD on my local machine. I can't use... moreI've got big RDD(1gb) in yarn cluster. On local machine, which use this cluster I have only 512 mb. I'd like to iterate over values in RDD on my local machine. I can't use collect(), because it would create too big array locally which more then my heap. I need some iterative way. There is method iterator(), but it requires some additional information, I can't provide.
I am confused as to where Talend and Apache spark fit in the big data ecosystem as both Apache Spark and Talend can be used for ETL.
Could someone please explain this with an example?
I read Cluster Mode Overview and I still can't understand the different processes in the Spark Standalone cluster and the parallelism.
Is the worker a JVM process or not? I... moreI read Cluster Mode Overview and I still can't understand the different processes in the Spark Standalone cluster and the parallelism.
Is the worker a JVM process or not? I ran the bin\start-slave.sh and found that it spawned the worker, which is actually a JVM.
As per the above link, an executor is a process launched for an application on a worker node that runs tasks. An executor is also a JVM.
These are my questions:
Executors are per application. Then what is the role of a worker? Does it co-ordinate with the executor and communicate the result back to the driver? Or does the driver directly talks to the executor? If so, what is the worker's purpose then?
How to control the number of executors for an application?
Can the tasks be made to run in parallel inside the executor? If so, how to configure the number of threads for an executor?
What is the relation between a worker, executors and executor cores ( --total-executor-cores)?
What does it mean to have more workers per... less