QBoard » Artificial Intelligence & ML » AI and ML - R » Recommended package for very large dataset processing and machin

Recommended package for very large dataset processing and machin

  •  

    It seems like R is really designed to handle datasets that it can pull entirely into memory. What R packages are recommended for signal processing and machine learning on very large datasets that can not be pulled into memory?

    If R is simply the wrong way to do this, I am open to other robust free suggestions (e.g. scipy if there is some nice way to handle very large datasets)

      May 31, 2019 11:40 AM IST
    0
  • Have a look at the "Large memory and out-of-memory data" subsection of the high performance computing task view on CRAN. bigmemory and ff are two popular packages. For bigmemory (and the related biganalytics, and bigtabulate), the bigmemory website has a few very good presentations, vignettes, and overviews from Jay Emerson. For ff, I recommend reading Adler Oehlschlägel and colleagues' excellent slide presentations on the ff website.

    Also, consider storing data in a database and reading in smaller batches for analysis. There are likely any number of approaches to consider. To get started, consdier looking through some of the examples in the biglmpackage, as well as this presentation from Thomas Lumley.

    And do investigate the other packages on the high-performance computing task view and mentioned in the other answers. The packages I mention above are simply the ones I've happened to have more experience with.

      May 31, 2019 11:43 AM IST
    0
  • I think the amount of data you can process is more limited by ones programming skills than anything else. Although a lot of standard functionality is focused on in memory analysis, cutting your data into chunks already helps a lot. Ofcourse, this takes more time to program than picking up standard R code, but often times it is quite possible.

    Cutting up data can for exale be done using read.table or readBin which support only reading a subset of the data. Alternatively, you can take a look at the high performance computing task view for packages which deliver out of the box out of memory functionality. You could also put your data in a database. For spatial raster data, the excellent raster package provides out of memory analysis.

      June 14, 2019 12:18 PM IST
    0
  • For machine learning tasks I can recommend using biglm package, used to do "Regression for data too large to fit in memory". For using R with really big data, one can use Hadoop as a backend and then use package rmr to perform statistical (or other) analysis via MapReduce on a Hadoop cluster.
      June 14, 2019 12:23 PM IST
    0
  • You should have a look at the "Large memory and out-of-memory data" subsection of the high performance computing task view on CRANbigmemory and ff are two popular packages. The bigmemory website has quite good presentations, vignettes, and overviews from Jay Emerson.

    It is like storing data in a database and reading in smaller batches for analysis. There are many approaches to this problem. You can go through some of the examples in the biglm package.

      September 2, 2021 1:42 PM IST
    0