I am using https://github.com/databricks/spark-csv , I am trying to write a single CSV, but not able to, it is making a folder.
Need a Scala function which will take parameter... moreI am using https://github.com/databricks/spark-csv , I am trying to write a single CSV, but not able to, it is making a folder.
Need a Scala function which will take parameter like path and file name and write that CSV file.
True ... it has been discussed quite a lot.
However there is a lot of ambiguity and some of the answers provided ... including duplicating jar references in the... moreTrue ... it has been discussed quite a lot.
However there is a lot of ambiguity and some of the answers provided ... including duplicating jar references in the jars/executor/driver configuration or options.
The ambiguous and/or omitted details
Following ambiguity, unclear, and/or omitted details should be clarified for each option:
How ClassPath is affected
Driver
Executor (for tasks running)
Both
not at all
Separation character: comma, colon, semicolon
If provided files are automatically distributed
for the tasks (to each executor)
for the remote Driver (if ran in cluster mode)
type of URI accepted: local file, hdfs, http, etc
If copied into a common location, where that location is (hdfs, local?)
The options to which it affects :
--jarsSparkContext.addJar(...) methodSparkContext.addFile(...) method--conf spark.driver.extraClassPath=... or --driver-class-path ...--conf spark.driver.extraLibraryPath=..., or --driver-library-path ...--conf spark.executor.extraClassPath=...--conf... less
I would like to read a CSV in spark and convert it as DataFrame and store it in HDFS with df.registerTempTable("table_name")I have tried:
scala> val df =... moreI would like to read a CSV in spark and convert it as DataFrame and store it in HDFS with df.registerTempTable("table_name")I have tried:
scala> val df = sqlContext.load("hdfs:///csv/file/dir/file.csv")
Error which I got:
java.lang.RuntimeException: hdfs:///csv/file/dir/file.csv is not a Parquet file. expected magic number at tail but found
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:418)
at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:277)
at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:276)
at scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
at... less
Assume df1 and df2 are two DataFrames in Apache Spark, computed using two different mechanisms, e.g., Spark SQL vs. the Scala/Java/Python API.Is there an idiomatic way to... moreAssume df1 and df2 are two DataFrames in Apache Spark, computed using two different mechanisms, e.g., Spark SQL vs. the Scala/Java/Python API.Is there an idiomatic way to determine whether the two data frames are equivalent (equal, isomorphic), where equivalence is determined by the data (column names and column values for each row) being identical save for the ordering of rows & columns?The motivation for the question is that there are often many ways to compute some big data result, each with its own trade-offs. As one explores these trade-offs, it is important to maintain correctness and hence the need to check for the equivalence/equality on a meaningful test data set. less
I installed Spark using the AWS EC2 guide and I can launch the program fine using the bin/pyspark script to get to the spark prompt and can also do the Quick Start quide... moreI installed Spark using the AWS EC2 guide and I can launch the program fine using the bin/pyspark script to get to the spark prompt and can also do the Quick Start quide successfully.However, I cannot for the life of me figure out how to stop all of the verbose INFO logging after each command.I have tried nearly every possible scenario in the below code (commenting out, setting to OFF) within my log4j.properties file in the conf folder in where I launch the application from as well as on each node and nothing is doing anything. I still get the logging INFO statements printing after executing each statement.I am very confused with how this is supposed to work.
#Set everything to be logged to the console log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout... less
Hi :) I have this code in spark /scala that partitions big data ( more than 50GB) by category into csv files.
df.write
.mode(SaveMode.Overwrite)
... moreHi :) I have this code in spark /scala that partitions big data ( more than 50GB) by category into csv files.
df.write
.mode(SaveMode.Overwrite)
.partitionBy("CATEGORY_ID")
.format("csv")
.option("header", "true")
.option("sep", "|")
.option("quoteAll", true)
.csv("output/inventory_backup")
The dataframe df is the result of aggregations on imported data from a csv file :
df.groupBy("PRODUCT_ID","LOC_ID","DAY_ID")
.agg(
functions.sum("ASSORTED_STOCK_UNIT").as("ASSORTED_STOCK_UNIT_sum"),
functions.sum("SOLID_STOCK_UNIT").as("SOLID_STOCK_UNIT_sum")
)
I would like to tune the performance of this program. Through Spark UI , I was able to see that the bottleneck of performance occurs at the stage that exports the data into csv files. enter image description here
More details = I'm using a 16cores/120GB RAM instance
Do you guys have any ideas on how to tune the performance ? (It currently takes more than 17min). Any help will be much appreciated. Thank you less
Getting strange behavior when calling function outside of a closure:
when function is in a object everything is working
when function is in a class get... moreGetting strange behavior when calling function outside of a closure:
when function is in a object everything is working
when function is in a class get :
Task not serializable: java.io.NotSerializableException: testing
The problem is I need my code in a class and not an object. Any idea why this is happening? Is a Scala object serialized (default?)?
This is a working code example:
object working extends App {
val list = List(1,2,3)
val rddList = Spark.ctx.parallelize(list)
//calling function outside closure
val after = rddList.map(someFunc(_))
def someFunc(a:Int) = a+1
after.collect().map(println(_))
}
This is the non-working example :
object NOTworking extends App {
new testing().doIT
}
//adding extends Serializable wont help
class testing {
val list = List(1,2,3)
val rddList = Spark.ctx.parallelize(list)
def doIT = {
//again calling the fucntion someFunc
val after = rddList.map(someFunc(_))
//this will crash (spark lazy)
... less
I found many options recently, and interesting in their comparisons primarely by maturity and stability.
Crunch - https://github.com/cloudera/crunch
Scrunch... moreI found many options recently, and interesting in their comparisons primarely by maturity and stability.
How can I convert an RDD (org.apache.spark.rdd.RDD) to a Dataframe org.apache.spark.sql.DataFrame. I converted a dataframe to rdd using .rdd. After processing it I want it back in... moreHow can I convert an RDD (org.apache.spark.rdd.RDD) to a Dataframe org.apache.spark.sql.DataFrame. I converted a dataframe to rdd using .rdd. After processing it I want it back in dataframe. How can I do this ?