QBoard » Artificial Intelligence & ML » AI and ML - R » Recommended package for very large dataset processing and machine learning in R [closed]

Recommended package for very large dataset processing and machine learning in R [closed]

  • It seems like R is really designed to handle datasets that it can pull entirely into memory. What R packages are recommended for signal processing and machine learning on very large datasets that can not be pulled into memory?

    If R is simply the wrong way to do this, I am open to other robust free suggestions (e.g. scipy if there is some nice way to handle very large datasets)

      October 14, 2021 1:12 PM IST
    0
  • It all depends on algorithms you need. If they may be translated into incremental form (when only small part of data is needed at any given moment, e.g. for Naive Bayes you can hold in memory only the model itself and current observation being processed), then the best suggestion is to perform machine learning incrementally, reading new batches of data from disk.

    However, many algorithms and especially their implementations really require the whole dataset. If size of the dataset fits you disk (and file system limitations), you can use mmap package that allows to map file on disk to memory and use it in the program. Note however, that read-writes to disk are expensive, and R sometimes likes to move data back and forth frequently. So be careful.

    If your data can't be stored even on you hard drive, you will need to use distributed machine learning systems. One such R-based system is Revolution R which is designed to handle really large datasets. Unfortunately, it is not open source and costs quite a lot of money, but you may try to get free academic license. As alternative, you may be interested in Java-based Apache Mahout - not so elegant, but very efficient solution, based on Hadoop and including many important algorithms.

      November 1, 2021 2:29 PM IST
    0
  • Have a look at the "Large memory and out-of-memory data" subsection of the high performance computing task view on CRAN. bigmemory and ff are two popular packages. For bigmemory (and the related biganalytics, and bigtabulate), the bigmemory website has a few very good presentations, vignettes, and overviews from Jay Emerson. For ff, I recommend reading Adler Oehlschlägel and colleagues' excellent slide presentations on the ff website.
    Also, consider storing data in a database and reading in smaller batches for analysis. There are likely any number of approaches to consider. To get started, consdier looking through some of the examples in the biglm package, as well as this presentation from Thomas Lumley.
    And do investigate the other packages on the high-performance computing task view and mentioned in the other answers. The packages I mention above are simply the ones I've happened to have more experience with.
      October 16, 2021 2:41 PM IST
    0
  • I think the amount of data you can process is more limited by ones programming skills than anything else. Although a lot of standard functionality is focused on in memory analysis, cutting your data into chunks already helps a lot. Ofcourse, this takes more time to program than picking up standard R code, but often times it is quite possible.
    Cutting up data can for exale be done using read.table or readBin which support only reading a subset of the data. Alternatively, you can take a look at the high performance computing task view for packages which deliver out of the box out of memory functionality. You could also put your data in a database. For spatial raster data, the excellent raster package provides out of memory analysis.
      October 20, 2021 3:33 PM IST
    0