QBoard » Big Data » Big Data - Spark » How to convert rdd object to dataframe in spark

How to convert rdd object to dataframe in spark

  • How can I convert an RDD (org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]) to a Dataframe org.apache.spark.sql.DataFrame. I converted a dataframe to rdd using .rdd. After processing it I want it back in dataframe. How can I do this ?
      August 7, 2020 4:58 PM IST
    0
  • Assuming your RDD[row] is called rdd, you can use:

    val sqlContext = new SQLContext(sc) 
    import sqlContext.implicits._
    rdd.toDF()
      December 22, 2021 1:33 PM IST
    0
  • You can specify whole directories, use wildcards and even CSV of directories and wildcards. E.g.:

    sc.textFile("/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file")​

    As Nick Chammas points out this is an exposure of Hadoop's FileInputFormat and therefore this also works with Hadoop (and Scalding).
      August 7, 2020 4:59 PM IST
    0
  • Suppose you have a DataFrame and you want to do some modification on the fields data by converting it to RDD[Row].

    val aRdd = aDF.map(x=>Row(x.getAs[Long]("id"),x.getAs[list[String]]("role").head))
    ​

    To convert back to DataFrame from RDD we need to define the structure type of the RDD.

    If the datatype was Long then it will become as LongType in structure.

    If String then StringType in structure.

    val aStruct = new StructType(Array(StructField("id",LongType,nullable = true),StructField("role",StringType,nullable = true)))
    ​


    Now you can convert the RDD to DataFrame using the createDataFrame method.

    val aNamedDF = sqlContext.createDataFrame(aRdd,aStruct)
    
      August 28, 2021 1:41 PM IST
    0