When writing csv and tab files from pandas, I often use the index=False option to avoid saving the index, which loads as an oddly-named column by default.
I often want to quickly save some Python data, but I would also like to save it in a stable file format in case the date lingers for a long time. And so I have the question, how can I save my data?
In data science, there are three kinds of data I want to store -- arbitrary Python objects, numpy arrays, and Pandas dataframes. -- what are the stable ways of storing these?
x = np.arange(10)
np.save('example.npy',x)
y = np.load('example.npy')
If the integrity of the file being loaded is not guaranteed, be sure to use allow_pickle=False
to avoid arbitrary code execution.
Pandas dataframes can be stored in a variety of formats. As I wrote in a previous answer, Pandas offers a wide variety of formats. For small datasets, I find plaintext file formats such as CSV and tab-delimited to work well for most purposes. These formats are readable in a wide variety of languages and I have had no issues in working in a bilingual R and Python environment where both environments read from these files.
Format Type Data Description Reader Writer
text CSV read_csv to_csv
text JSON read_json to_json
text HTML read_html to_html
text Local clipboard read_clipboard to_clipboard
binary MS Excel read_excel to_excel
binary HDF5 Format read_hdf to_hdf
binary Feather Format read_feather to_feather
binary Parquet Format read_parquet to_parquet
binary Msgpack read_msgpack to_msgpack
binary Stata read_stata to_stata
binary SAS read_sas
binary Python Pickle Format read_pickle to_pickle
SQL SQL read_sql to_sql
SQL Google Big Query read_gbq to_gbq
When writing csv and tab files from pandas, I often use the index=False option to avoid saving the index, which loads as an oddly-named column by default.