QBoard » Artificial Intelligence & ML » AI and ML - Python » How to reversibly store and load a Pandas dataframe to/from disk

How to reversibly store and load a Pandas dataframe to/from disk

  • Right now I'm importing a fairly large CSV as a dataframe every time I run the script. Is there a good solution for keeping that dataframe constantly available in between runs so I don't have to spend all that time waiting for the script to run?
      August 31, 2021 3:37 PM IST
    0
  • You can use feather format file. It is extremely fast.

    df.to_feather('filename.ft')

     

      September 2, 2021 10:01 PM IST
    0
  • As already mentioned there are different options and file formats (HDF5, JSON, CSV, parquet, SQL) to store a data frame. However, pickle is not a first-class citizen (depending on your setup), because:

    pickle is a potential security risk. Form the Python documentation for pickle:
    Warning The pickle module is not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.

    pickle is slow. Find here and here benchmarks.
    Depending on your setup/usage both limitations do not apply, but I would not recommend pickle as the default persistence for pandas data frames.
      September 29, 2021 2:34 PM IST
    0
  • Although there are already some answers I found a nice comparison in which they tried several ways to serialize Pandas DataFrames: Efficiently Store Pandas DataFrames.

    They compare:

    • pickle: original ASCII data format
    • cPickle, a C library
    • pickle-p2: uses the newer binary format
    • json: standardlib json library
    • json-no-index: like json, but without index
    • msgpack: binary JSON alternative
    • CSV
    • hdfstore: HDF5 storage format

    In their experiment, they serialize a DataFrame of 1,000,000 rows with the two columns tested separately: one with text data, the other with numbers. Their disclaimer says:

    You should not trust that what follows generalizes to your data. You should look at your own data and run benchmarks yourself

    The source code for the test which they refer to is available online. Since this code did not work directly I made some minor changes, which you can get here: serialize.py I got the following results:

    time comparison results


    They also mention that with the conversion of text data to categorical data the serialization is much faster. In their test about 10 times as fast (also see the test code).

    Edit: The higher times for pickle than CSV can be explained by the data format used. By default pickle uses a printable ASCII representation, which generates larger data sets. As can be seen from the graph however, pickle using the newer binary data format (version 2, pickle-p2) has much lower load times.

    Some other references:

    In the question Fastest Python library to read a CSV file there is a very detailed answer which compares different libraries to read csv files with a benchmark. The result is that for reading csv files numpy.fromfile is the fastest.
    Another serialization test shows msgpack, ujson, and cPickle to be the quickest in serializing.

      October 2, 2021 2:22 PM IST
    0
  • If I understand correctly, you're already using pandas.read_csv() but would like to speed up the development process so that you don't have to load the file in every time you edit your script, is that right? I have a few recommendations:

    you could load in only part of the CSV file using pandas.read_csv(..., nrows=1000) to only load the top bit of the table, while you're doing the development

    use ipython for an interactive session, such that you keep the pandas table in memory as you edit and reload your script.

    convert the csv to an HDF5 table

    updated use DataFrame.to_feather() and pd.read_feather() to store data in the R-compatible feather binary format that is super fast (in my hands, slightly faster than pandas.to_pickle() on numeric data and much faster on string data).
      September 1, 2021 1:43 PM IST
    0
  • https://docs.python.org/3/library/pickle.html

    The pickle protocol formats:

    Protocol version 0 is the original “human-readable” protocol and is backwards compatible with earlier versions of Python.

    Protocol version 1 is an old binary format which is also compatible with earlier versions of Python.

    Protocol version 2 was introduced in Python 2.3. It provides much more efficient pickling of new-style classes. Refer to PEP 307 for information about improvements brought by protocol 2.

    Protocol version 3 was added in Python 3.0. It has explicit support for bytes objects and cannot be unpickled by Python 2.x. This is the default protocol, and the recommended protocol when compatibility with other Python 3 versions is required.

    Protocol version 4 was added in Python 3.4. It adds support for very large objects, pickling more kinds of objects, and some data format optimizations. Refer to PEP 3154 for information about improvements brought by protocol 4.

      October 7, 2021 1:10 PM IST
    0