QBoard » Big Data » Big Data - Others » Opening a 20GB file for analysis with pandas

Opening a 20GB file for analysis with pandas

  •  

    I am currently trying to open a file with pandas and python for machine learning purposes it would be ideal for me to have them all in a DataFrame. Now The file is 18GB large and my RAM is 32 GB but I keep getting memory errors.

    From your experience is it possible? If not do you know of a better way to go around this? (hive table? increase the size of my RAM to 64? create a database and access it from python)

      June 11, 2019 4:55 PM IST
    0
  • There are two possibilities: either you need to have all your data in memory for processing (e.g. your machine learning algorithm would want to consume all of it at once), or you can do without it (e.g. your algorithm only needs samples of rows or columns at once).

    In the first case, you'll need to solve a memory problem. Increase your memory size, rent a high-memory cloud machine, use inplace operations, provide information about the type of data you are reading in, make sure to delete all unused variables and collect garbage, etc.

    It is very probable that 32GB of RAM would not be enough for Pandas to handle your data. Note that the integer "1" is just one byte when stored as text but 8 bytes when represented as int64 (which is the default when Pandas reads it in from text). You can make the same example with a floating point number "1.0" which expands from a 3-byte string to an 8-byte float64 by default. You may win some space by letting Pandas know precisely which types to use for each column and forcing the smallest possible representations, but we did not even start speaking of Python's data structure overhead here, which may add an extra pointer or two here or there easily, and pointers are 8 bytes each on a 64-bit machine.

    To summarize: no, 32GB RAM is probably not enough for Pandas to handle a 20GB file.

    In the second case (which is more realistic and probably applies to you), you need to solve a data management problem. Indeed, having to load all of the data when you really only need parts of it for processing, may be a sign of bad data management. There are multiple options here:

    Use an SQL database. If you can, it is nearly always the first choice and a decently comfortable solution. 20GB sounds like the size most SQL databases would handle well without the need to go distributed even on a (higher-end) laptop. You'll be able to index columns, do basic aggregations via SQL, and get the needed subsamples into Pandas for more complex processing using a simple pd.read_sql. Moving the data to a database will also provide you with an opportunity to think about the actual data types and sizes of your columns.

    If your data is mostly numeric (i.e. arrays or tensors), you may consider holding it in a HDF5 format (see PyTables), which lets you conveniently read only the necessary slices of huge arrays from disk. Basic numpy.save and numpy.load achieve the same effect via memory-mapping the arrays on disk as well. For GIS and related raster data there are dedicated databases, which might not connect to pandas as directly as SQL, but should also let you do slices and queries reasonably conveniently.

    Pandas does not support such "partial" memory-mapping of HDF5 or numpy arrays, as far as I know. If you still want a kind of a "pure-pandas" solution, you can try to work around by "sharding": either storing the columns of your huge table separately (e.g. in separate files or in separate "tables" of a single HDF5 file) and only loading the necessary ones on-demand, or storing the chunks of rows separately. However, you'd then need to implement the logic for loading the necessary chunks, thus reinventing the bicycles already imlpemented in most SQL databases, so perhaps option 1 would still be easier here. If your data comes in a CSV, though, you can process it in chunks by specifying the chunksize parameter to pd.read_csv.
      September 11, 2021 1:46 PM IST
    0
  • If it's a csv file and you do not need to access all of the data at once when training your algorithm, you can read it in chunks. The pandas.read_csv method allows you to read a file in chunks like this:

    import pandas as pd
    for chunk in pd.read_csv(, chunksize=)
    do_processing()
    train_algorithm()
      June 11, 2019 5:00 PM IST
    0
  • In my experience, initializing read_csv() with parameter low_memory=False tends to help when reading in large files. I don't think you have mentioned the file type you are reading in, so I am not sure how applicable this is to your situation though.
      September 15, 2021 3:00 PM IST
    0