QBoard » Big Data » Big Data - Spark
  • Vaibhav Mali
    Hi :) I have this code in spark /scala that partitions big data ( more than 50GB) by category into csv files.
    df.write
    .mode(SaveMode.Overwrite)
    ...  more
    Last post by Advika Banerjee - October 18, 2021
    1,041 views 0 likes
    3
  • Sai Anirudh
    Getting strange behavior when calling function outside of a closure:

    when function is in a object everything is working
    when function is in a class get...  more
    Last post by Advika Banerjee - October 18, 2021
    677 views 0 likes
    3
  • Jasmine Chacko
    I'm just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset) in Apache Spark?
    Can you convert one to the other?
    Last post by Advika Banerjee - October 16, 2021
    226 views 0 likes
    3
  • Vaibhav Mali
    I installed Spark using the AWS EC2 guide and I can launch the program fine using the bin/pyspark script to get to the spark prompt and can also do the Quick Start quide...  more
    Last post by Advika Banerjee - October 9, 2021
    900 views 0 likes
    3
  • Vaibhav Mali
    I am doing PoC on Spark's Map Reduce performance for calculating weighted average over 5000 to 200,000 data and it appears to be very slow. So, just wanted to check whether I am...  more
    Last post by Samar Patil - September 11, 2021
    878 views 0 likes
    3
  • Raji Reddy A
     

    I am using Apache Spark to perform sentiment analysis.I am using Naive Bayes algorithm to classify the text. I don't know how to find out the probability of labels. I would be...  more
    Last post by Samar Patil - September 11, 2021
    1,253 views 0 likes
    3
  • Shivakumar Kota
    I have got big Data file loaded in Spark but wish to work on a small portion of it to run the analysis, is there any way to do that ?. I tried doing repartition but it brings a...  more
    Last post by Vaibhav Mali - August 28, 2021
    1,461 views 0 likes
    2
  • Vaibhav Mali
    I need to implement a big data storage + processing system.
    The data increases in a daily basis ( about max 50 million rows / day) , data complies of a very simple JSON document...  more
    Last post by Samar Patil - August 13, 2021
    203 views 0 likes
    2
  • Viaan Prakash
    According to Learning SparkKeep in mind that repartitioning your data is a fairly expensive operation. Spark also has an optimized version of repartition() called coalesce() that...  more
    Last post by Tarun Reddy - December 23, 2020
    453 views 0 likes
    3

QBoard Statistics

Topics 39
Posts 158
Total Users 7406
Active Users 17