QBoard » Artificial Intelligence & ML » AI and ML - Python » What are the standard, stable file formats used in Python for Data Science?

What are the standard, stable file formats used in Python for Data Science?

  • I often want to quickly save some Python data, but I would also like to save it in a stable file format in case the date lingers for a long time. And so I have the question, how can I save my data?

    In data science, there are three kinds of data I want to store -- arbitrary Python objects, numpy arrays, and Pandas dataframes. -- what are the stable ways of storing these?

      January 4, 2022 1:27 PM IST
    0
  • Arbitrary Python data and code can be stored in the .pkl pickle format. While pickle files have security concerns because loading them can execute arbitrary code, if you can trust the source of a pickle file, it is a stable format. The Python standard library's pickle page: The pickle serialization format is guaranteed to be backwards compatible across Python releases provided a compatible pickle protocol is chosen and pickling and unpickling code deals with Python 2 to Python 3 type differences if your data is crossing that unique breaking change language boundary. Most python data can also be stored in the json format. I haven't used this format much myself, but dawg recommends it. Like the CSV and tab-delimited format I recommend for Pandas, the json format is a plain-text format that is very stable. Numpy arrays can be stored in the .npy or .npz numpy formats. The npy format is a very simple format that stores a single numpy array. I imagine it would be easy to read this format in any language. The npz format allows the storing of multiple arrays in the same file. Adapted from the docs,

    x = np.arange(10)
    np.save('example.npy',x)
    y = np.load('example.npy') ​


    If the integrity of the file being loaded is not guaranteed, be sure to use allow_pickle=False to avoid arbitrary code execution.

    Pandas dataframes can be stored in a variety of formats. As I wrote in a previous answer, Pandas offers a wide variety of formats. For small datasets, I find plaintext file formats such as CSV and tab-delimited to work well for most purposes. These formats are readable in a wide variety of languages and I have had no issues in working in a bilingual R and Python environment where both environments read from these files.

    Format Type Data Description     Reader         Writer
    text        CSV                  read_csv       to_csv
    text        JSON                 read_json      to_json
    text        HTML                 read_html      to_html
    text        Local clipboard      read_clipboard to_clipboard
    binary      MS Excel             read_excel     to_excel
    binary      HDF5 Format          read_hdf       to_hdf
    binary      Feather Format       read_feather   to_feather
    binary      Parquet Format       read_parquet   to_parquet
    binary      Msgpack              read_msgpack   to_msgpack
    binary      Stata                read_stata     to_stata
    binary      SAS                  read_sas    
    binary      Python Pickle Format read_pickle    to_pickle
    SQL         SQL                  read_sql       to_sql
    SQL         Google Big Query     read_gbq       to_gbq


    When writing csv and tab files from pandas, I often use the index=False option to avoid saving the index, which loads as an oddly-named column by default.

     
      January 5, 2022 2:04 PM IST
    0