QBoard » Artificial Intelligence & ML » AI and ML - Python » “Large data” workflows using pandas

“Large data” workflows using pandas

  • I have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it's out-of-core support. However, SAS is horrible as a piece of software for numerous other reasons.

    One day I hope to replace my use of SAS with python and pandas, but I currently lack an out-of-core workflow for large datasets. I'm not talking about "big data" that requires a distributed network, but rather files too large to fit in memory but small enough to fit on a hard-drive.

    My first thought is to use HDF store  to hold large datasets on disk and pull only the pieces I need into dataframes for analysis. Others have mentioned MongoDB as an easier to use alternative. My question is this:

    What are some best-practice workflows for accomplishing the following:

    1. Loading flat files into a permanent, on-disk database structure
    2. Querying that database to retrieve data to feed into a pandas data structure
    3. Updating the database after manipulating pieces in pandas

    Real-world examples would be much appreciated, especially from anyone who uses pandas on "large data".

    Edit -- an example of how I would like this to work:

    1. Iteratively import a large flat-file and store it in a permanent, on-disk database structure. These files are typically too large to fit in memory.
    2. In order to use Pandas, I would like to read subsets of this data (usually just a few columns at a time) that can fit in memory.
    3. I would create new columns by performing various operations on the selected columns.
    4. I would then have to append these new columns into the database structure.

    I am trying to find a best-practice way of performing these steps. Reading links about pandas and pytables it seems that appending a new column could be a problem.

    Edit -- Responding to Jeff's questions specifically:

    1. I am building consumer credit risk models. The kinds of data include phone, SSN and address characteristics; property values; derogatory information like criminal records, bankruptcies, etc... The datasets I use every day have nearly 1,000 to 2,000 fields on average of mixed data types: continuous, nominal and ordinal variables of both numeric and character data. I rarely append rows, but I do perform many operations that create new columns.
    2. Typical operations involve combining several columns using conditional logic into a new, compound column. The result of these operations is a new column for every record in my dataset.
    3. Finally, I would like to append these new columns into the on-disk data structure. I would repeat step 2, exploring the data with crosstabs and descriptive statistics trying to find interesting, intuitive relationships to model.
    4. A typical project file is usually about 1GB. Files are organized into such a manner where a row consists of a record of consumer data. Each row has the same number of columns for every record. This will always be the case.
    5. It's pretty rare that I would subset by rows when creating a new column. However, it's pretty common for me to subset on rows when creating reports or generating descriptive statistics. For example, I might want to create a simple frequency for a specific line of business, say Retail credit cards. To do this, I would select only those records where the line of business = retail in addition to whichever columns I want to report on. When creating new columns, however, I would pull all rows of data and only the columns I need for the operations.
    6. The modeling process requires that I analyze every column, look for interesting relationships with some outcome variable, and create new compound columns that describe those relationships. The columns that I explore are usually done in small sets. For example, I will focus on a set of say 20 columns just dealing with property values and observe how they relate to defaulting on a loan. Once those are explored and new columns are created, I then move on to another group of columns, say college education, and repeat the process. What I'm doing is creating candidate variables that explain the relationship between my data and some outcome. At the very end of this process, I apply some learning techniques that create an equation out of those compound columns.

    It is rare that I would ever add rows to the dataset. I will nearly always be creating new columns (variables or features in statistics/machine learning parlance).

    This post was edited by Samar Patil at August 30, 2021 12:42 PM IST
      August 30, 2021 12:41 PM IST
    0
  • I'd like to point out the Vaex package.

    Vaex is a python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid up to a billion (109) objects/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted).

    Have a look at the documentation: https://vaex.readthedocs.io/en/latest/ The API is very close to the API of pandas.

      October 12, 2021 1:15 PM IST
    0
  • There is now, two years after the question, an 'out-of-core' pandas equivalent: dask. It is excellent! Though it does not support all of pandas functionality, you can get really far with it. Update: in the past two years it has been consistently maintained and there is substantial user community working with Dask.
    And now, four years after the question, there is another high-performance 'out-of-core' pandas equivalent in Vaex. It "uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted)." It can handle data sets of billions of rows and does not store them into memory (making it even possible to do analysis on suboptimal hardware).
      October 12, 2021 4:59 PM IST
    0
  • I think the answers above are missing a simple approach that I've found very useful.

    When I have a file that is too large to load in memory, I break up the file into multiple smaller files (either by row or cols)

    Example: In case of 30 days worth of trading data of ~30GB size, I break it into a file per day of ~1GB size. I subsequently process each file separately and aggregate results at the end

    One of the biggest advantages is that it allows parallel processing of the files (either multiple threads or processes)

    The other advantage is that file manipulation (like adding/removing dates in the example) can be accomplished by regular shell commands, which is not be possible in more advanced/complicated file formats

    This approach doesn't cover all scenarios, but is very useful in a lot of them

      August 31, 2021 12:27 PM IST
    0
  • There is now, two years after the question, an 'out-of-core' pandas equivalent: dask. It is excellent! Though it does not support all of pandas functionality, you can get really far with it. Update: in the past two years it has been consistently maintained and there is substantial user community working with Dask.
    And now, four years after the question, there is another high-performance 'out-of-core' pandas equivalent in Vaex. It "uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted)." It can handle data sets of billions of rows and does not store them into memory (making it even possible to do analysis on suboptimal hardware).
      August 31, 2021 3:35 PM IST
    0
  • This is the case for pymongo. I have also prototyped using sql server, sqlite, HDF, ORM (SQLAlchemy) in python. First and foremost pymongo is a document based DB, so each person would be a document (dict of attributes). Many people form a collection and you can have many collections (people, stock market, income).

    pd.dateframe -> pymongo Note: I use the chunksize in read_csv to keep it to 5 to 10k records(pymongo drops the socket if larger)

    aCollection.insert((a[1].to_dict() for a in df.iterrows()))

    querying: gt = greater than...

    pd.DataFrame(list(mongoCollection.find({'anAttribute':{'$gt':2887000, '$lt':2889000}})))
    .find() returns an iterator so I commonly use ichunked to chop into smaller iterators.

    How about a join since I normally get 10 data sources to paste together:

    aJoinDF = pandas.DataFrame(list(mongoCollection.find({'anAttribute':{'$in':Att_Keys}})))
    then (in my case sometimes I have to agg on aJoinDF first before its "mergeable".)

    df = pandas.merge(df, aJoinDF, on=aKey, how='left')
    And you can then write the new info to your main collection via the update method below. (logical collection vs physical datasources).

    collection.update({primarykey:foo},{key:change})
    On smaller lookups, just denormalize. For example, you have code in the document and you just add the field code text and do a dict lookup as you create documents.

    Now you have a nice dataset based around a person, you can unleash your logic on each case and make more attributes. Finally you can read into pandas your 3 to memory max key indicators and do pivots/agg/data exploration. This works for me for 3 million records with numbers/big text/categories/codes/floats/...

    You can also use the two methods built into MongoDB (MapReduce and aggregate framework). See here for more info about the aggregate framework, as it seems to be easier than MapReduce and looks handy for quick aggregate work. Notice I didn't need to define my fields or relations, and I can add items to a document. At the current state of the rapidly changing numpy, pandas, python toolset, MongoDB helps me just get to work :)
      September 1, 2021 1:48 PM IST
    0