How to read XML files from apache spark framework?

QBoard » Big Data » Big Data - Spark » How to read XML files from apache spark framework?

User Dashboard

How to read XML files from apache spark framework?

Back To Topics

Tags : apache-spark xml

Advika Banerjee

319 1

I did come across a mini tutorial for data preprocessing using spark here: http://ampcamp.berkeley.edu/big-data-mini-course/featurization.html

However, this discusses only about textfile parsing. Is there a way to parse xml files from spark system?

October 20, 2021 1:12 PM IST

0
Viaan Prakash

461

I have not used it myself, but the way would be same as you do it for hadoop. For example you can use StreamXmlRecordReader and process the xmls. The reason you need a record reader is you would like to control the record boundries for each element processed otherwise the default used would process line because it uses LineRecordReader. It would be helpful to get yourself more familiar with concept of recordReader in hadoop.

And ofcourse you will have to use SparkContext's hadoopRDD or hadoopFile methods with option to pass a InputFormatClass. Incase java is your preferred language, similar alternatives exist.

December 9, 2021 12:33 PM IST

0
Maryam Bains

317
It looks like somebody made an xml datasource for apache-spark.

https://github.com/databricks/spark-xml

This supports to read XML files by specifying tags and infer types e.g.
```
import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)
val df = sqlContext.read
    .format("com.databricks.spark.xml")
    .option("rowTag", "book")
    .load("books.xml")
```
You can also use it with spark-shell as below:
```
$ bin/spark-shell --packages com.databricks:spark-xml_2.11:0.3.0
```
October 21, 2021 2:27 PM IST

0
Sindhuja Martha

181

If you are looking for pulling out individual sub-records with in an xml then you can use XmlInputFormat to achieve this, I have written a blog on the same http://baahu.in/spark-read-xml-files-using-xmlinputformat/

October 23, 2021 4:14 PM IST

0

Member Sign In

Member Sign In

Create Account

How to read XML files from apache spark framework?

Connect With Us