QBoard » Artificial Intelligence & ML » AI and ML - R » Big Data Process and Analysis in R

Big Data Process and Analysis in R

  • I know this is not a new concept by any stretch in R, and I have browsed the High Performance and Parallel Computing Task View. With that said, I am asking this question from a point of ignorance as I have no formal training in Computer Science and am entirely self taught.

    Recently I collected data from the Twitter Streaming API and currently the raw JSON sits in a 10 GB text file. I know there have been great strides in adapting R to handle big data, so how would you go about this problem? Here are just a handful of the tasks that I am looking to do:

    1. Read and process the data into a data frame
    2. Basic descriptive analysis, including text mining (frequent terms, etc.)
    3. Plotting

    Is it possible to use R entirely for this, or will I have to write some Python to parse the data and throw it into a database in order to take random samples small enough to fit into R.

    Simply, any tips or pointers that you can provide will be greatly appreciated. Again, I won't take offense if you describe solutions at a 3rd grade level either.

    Thanks in advance.

      September 3, 2021 1:46 PM IST
    0
  • If you need to operate on the entire 10GB file at once, then I second @Chase's point about getting a larger, possibly cloud-based computer.
    (The Twitter streaming API returns a pretty rich object: a single 140-character tweet could weigh a couple kb of data. You might reduce memory overhead if you preprocess the data outside of R to extract only the content you need, such as author name and tweet text.)
    On the other hand, if your analysis is amenable to segmenting the data -- for example, you want to first group the tweets by author, date/time, etc -- you could consider using Hadoop to drive R.
    Granted, Hadoop will incur some overhead (both cluster setup and learning about the underlying MapReduce model); but if you plan to do a lot of big-data work, you probably want Hadoop in your toolbox anyway.
    A couple of pointers:
    • an example in chapter 7 of Parallel R shows how to setup R and Hadoop for large-scale tweet analysis. The example uses the RHIPE package, but the concepts apply to any Hadoop/MapReduce work.
    • you can also get a Hadoop cluster via AWS/EC2. Check out Elastic MapReduce for an on-demand cluster, or use Whirr if you need more control over your Hadoop deployment.
      September 3, 2021 5:25 PM IST
    0
  • There's a brand-new package called colbycol that lets you read in only the variables you want from enormous text files:
    read.table function remains the main data import function in R. This function is memory inefficient and, according to some estimates, it requires three times as much memory as the size of a dataset in order to read it into R.
    The reason for such inefficiency is that R stores data.frames in memory as columns (a data.frame is no more than a list of equal length vectors) whereas text files consist of rows of records. Therefore, R's read.table needs to read whole lines, process them individually breaking into tokens and transposing these tokens into column oriented data structures.
    ColByCol approach is memory efficient. Using Java code, tt reads the input text file and outputs it into several text files, each holding an individual column of the original dataset. Then, these files are read individually into R thus avoiding R's memory bottleneck.
    The approach works best for big files divided into many columns, specially when these columns can be transformed into memory efficient types and data structures: R representation of numbers (in some cases), and character vectors with repeated levels via factors occupy much less space than their character representation.
    Package ColByCol has been successfully used to read multi-GB datasets on a 2GB laptop.
      September 12, 2021 12:57 AM IST
    0
  • R analytics is data analytics using R programming language, an open-source language used for statistical computing or graphics. This programming language is often used in statistical analysis and data mining. It can be used for analytics to identify patterns and build practical models. R not only can help analyze organizations’ data, but also be used to help in the creation and development of software applications that perform statistical analysis.

    R Analytics

    With a graphical user interface for developing programs, R supports a variety of analytical modeling techniques such as classical statistical tests, clustering, time-series analysis, linear and nonlinear modeling, and more. The interface has four windows: the script window, console window, workspace and history window, and tabs of interest (help, packages, plots, and files). R allows for publication-ready plots and graphics and for storage of reusable analytics for future data.

    R has become increasingly popular over many years and remains a top analytics language for many universities and colleges. It is well established today within academia as well as among corporations around the world for delivering robust, reliable, and accurate analytics. While R programming was originally seen as difficult for non-statisticians to learn, the user interface has become more user-friendly in recent years. It also now allows for extensions and other plugins like R Studio and R Excel, making the learning process easier and faster for new business analysts and other users. It has become the industry standard for statistical analysis and data mining projects and is due to grow in use as more graduates enter the workforce as R-trained analysts.

      September 17, 2021 1:06 PM IST
    0