QBoard » Artificial Intelligence & ML » AI and ML - R » Work in R with very large data set

Work in R with very large data set

  • I am working with a very large data set which I am downloading from an Oracle data base. The Data frame has about 21 millions rows and 15 columns. My OS is windows xp (32-bit), I have 2GB RAM. Short-term I cannot upgrade my RAM or my OS (it is at work, it will take months before I get a decent pc).
    library(RODBC) sqlQuery(Channel1,"Select * from table1",stringsAsFactor=FALSE)
    I get here already stuck with the usual "Cannot allocate xMb to vector". I found some suggestion about using the ff package. I would appreciate to know if anybody familiar with the ff package can tell me if it would help in my case. Do you know another way to get around the memory problem? Would a 64-bit solution help? Thanks for your suggestions.
      October 16, 2021 3:10 PM IST
    0
  • If you are working with package ff and have your data in SQL, you can easily get them in ff using package ETLUtils, see the documentation for an example when using ROracle.

    In my experience, ff is perfectly suited for the type of dataset you are working with (21 Mio rows and 15 columns) - in fact your setup is kind of small to ff unless your columns contain a lot of character data which will be converted to factors (meaning all your factor levels should be able to fit in your RAM). Packages ETLUtils, ff and the package ffbase allow you to get your data in R using ff and do some basic statistics on it. Depending on what you will do with your data, your hardware, you might have to consider sampling when you build models. I prefer having my data in R, building a model based on a sample and score using the tools in ff (like chunking) or from package ffbase.

    The drawback is that you have to get used to the fact that your data are ffdf objects and that might take some time - especially if you are new to R

      November 1, 2021 2:28 PM IST
    0
  • You could try specifying the data type in the read.csv call using colClasses.
    data<-read.csv("UserDailyStats.csv", sep=",", header=T, na.strings="-", stringsAsFactors=FALSE, colClasses=c("character","character","factor",rep("numeric",6)))
    Though with a dataset of this size it may still be problematic and there isn't a great deal of memory left for any analysis you may want to do. Adding RAM & using 64-bit computing would provide more flexibility.
      October 18, 2021 1:57 PM IST
    0
  • In my experience, processing your data in chunks can almost always help greatly in processing big data. For example, if you calculate a temporal mean only one timestep needs to be in memory at any given time. You already have your data in a database, so obtaining the subset is easy. Alternatively, if you cannot easily process in chunks, you could always try and take a subset of your data. Repeat the analysis a few times to see if your results are sensitive to which subset you take. The bottomline is that some smart thinking can get you a long way with 2 Gb of RAM. If you need more specific advice, you need to ask more specific questions.

     
      October 22, 2021 2:34 PM IST
    0