QBoard » Artificial Intelligence & ML » AI and ML - Python » "Large data" workflows using pandas

"Large data" workflows using pandas

  • I have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it's out-of-core support. However, SAS is horrible as a piece of software for numerous other reasons.

    One day I hope to replace my use of SAS with python and pandas, but I currently lack an out-of-core workflow for large datasets. I'm not talking about "big data" that requires a distributed network, but rather files too large to fit in memory but small enough to fit on a hard-drive.

    My first thought is to use HDFS store to hold large datasets on disk and pull only the pieces I need into dataframes for analysis. Others have mentioned MongoDB as an easier to use alternative. My question is this:

    What are some best-practice workflows for accomplishing the following:

    1. Loading flat files into a permanent, on-disk database structure
    2. Querying that database to retrieve data to feed into a pandas data structure
    3. Updating the database after manipulating pieces in pandas

    Real-world examples would be much appreciated, especially from anyone who uses pandas on "large data".

    Edit -- an example of how I would like this to work:

    1. Iteratively import a large flat-file and store it in a permanent, on-disk database structure. These files are typically too large to fit in memory.
    2. In order to use Pandas, I would like to read subsets of this data (usually just a few columns at a time) that can fit in memory.
    3. I would create new columns by performing various operations on the selected columns.
    4. I would then have to append these new columns into the database structure.

    I am trying to find a best-practice way of performing these steps. Reading links about pandas and pytables it seems that appending a new column could be a problem.

    Edit -- Responding to Jeff's questions specifically:

    1. I am building consumer credit risk models. The kinds of data include phone, SSN and address characteristics; property values; derogatory information like criminal records, bankruptcies, etc... The datasets I use every day have nearly 1,000 to 2,000 fields on average of mixed data types: continuous, nominal and ordinal variables of both numeric and character data. I rarely append rows, but I do perform many operations that create new columns.
    2. Typical operations involve combining several columns using conditional logic into a new, compound column. For example, if var1 > 2 then newvar = 'A' elif var2 = 4 then newvar = 'B'. The result of these operations is a new column for every record in my dataset.
    3. Finally, I would like to append these new columns into the on-disk data structure. I would repeat step 2, exploring the data with crosstabs and descriptive statistics trying to find interesting, intuitive relationships to model.
    4. A typical project file is usually about 1GB. Files are organized into such a manner where a row consists of a record of consumer data. Each row has the same number of columns for every record. This will always be the case.
    5. It's pretty rare that I would subset by rows when creating a new column. However, it's pretty common for me to subset on rows when creating reports or generating descriptive statistics. For example, I might want to create a simple frequency for a specific line of business, say Retail credit cards. To do this, I would select only those records where the line of business = retail in addition to whichever columns I want to report on. When creating new columns, however, I would pull all rows of data and only the columns I need for the operations.
    6. The modeling process requires that I analyze every column, look for interesting relationships with some outcome variable, and create new compound columns that describe those relationships. The columns that I explore are usually done in small sets. For example, I will focus on a set of say 20 columns just dealing with property values and observe how they relate to defaulting on a loan. Once those are explored and new columns are created, I then move on to another group of columns, say college education, and repeat the process. What I'm doing is creating candidate variables that explain the relationship between my data and some outcome. At the very end of this process, I apply some learning techniques that create an equation out of those compound columns.

    It is rare that I would ever add rows to the dataset. I will nearly always be creating new columns (variables or features in statistics/machine learning parlance).

      December 6, 2021 2:22 PM IST
    0
  • I think the answers above are missing a simple approach that I've found very useful.

    When I have a file that is too large to load in memory, I break up the file into multiple smaller files (either by row or cols)

    Example: In case of 30 days worth of trading data of ~30GB size, I break it into a file per day of ~1GB size. I subsequently process each file separately and aggregate results at the end

    One of the biggest advantages is that it allows parallel processing of the files (either multiple threads or processes)

    The other advantage is that file manipulation (like adding/removing dates in the example) can be accomplished by regular shell commands, which is not be possible in more advanced/complicated file formats

    This approach doesn't cover all scenarios, but is very useful in a lot of them

      December 7, 2021 1:43 PM IST
    0
  • There is now, two years after the question, an 'out-of-core' pandas equivalent: dask. It is excellent! Though it does not support all of pandas functionality, you can get really far with it. Update: in the past two years it has been consistently maintained and there is substantial user community working with Dask.
    And now, four years after the question, there is another high-performance 'out-of-core' pandas equivalent in Vaex. It "uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted)." It can handle data sets of billions of rows and does not store them into memory (making it even possible to do analysis on suboptimal hardware).
      December 7, 2021 5:25 PM IST
    0
  • If your datasets are between 1 and 20GB, you should get a workstation with 48GB of RAM. Then Pandas can hold the entire dataset in RAM. I know its not the answer you're looking for here, but doing scientific computing on a notebook with 4GB of RAM isn't reasonable.

     
      December 8, 2021 10:20 AM IST
    0
  • One trick I found helpful for large data use cases is to reduce the volume of the data by reducing float precision to 32-bit. It's not applicable in all cases, but in many applications 64-bit precision is overkill and the 2x memory savings are worth it. To make an obvious point even more obvious:

    >>> df = pd.DataFrame(np.random.randn(int(1e8), 5))
    >>> df.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 100000000 entries, 0 to 99999999
    Data columns (total 5 columns):
    ...
    dtypes: float64(5)
    memory usage: 3.7 GB
    
    >>> df.astype(np.float32).info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 100000000 entries, 0 to 99999999
    Data columns (total 5 columns):
    ...
    dtypes: float32(5)
    memory usage: 1.9 GB
      December 9, 2021 12:56 PM IST
    0
  • As noted by others, after some years an 'out-of-core' pandas equivalent has emerged: dask. Though dask is not a drop-in replacement of pandas and all of its functionality it stands out for several reasons:

    Dask is a flexible parallel computing library for analytic computing that is optimized for dynamic task scheduling for interactive computational workloads of “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments and scales from laptops to clusters.

    Dask emphasizes the following virtues:

    • Familiar: Provides parallelized NumPy array and Pandas DataFrame objects
    • Flexible: Provides a task scheduling interface for more custom workloads and integration with other projects.
    • Native: Enables distributed computing in Pure Python with access to the PyData stack.
    • Fast: Operates with low overhead, low latency, and minimal serialization necessary for fast numerical algorithms
    • Scales up: Runs resiliently on clusters with 1000s of cores Scales down: Trivial to set up and run on a laptop in a single process
    • Responsive: Designed with interactive computing in mind it provides rapid feedback and diagnostics to aid humans

    and to add a simple code sample:

    import dask.dataframe as dd
    df = dd.read_csv('2015-*-*.csv')
    df.groupby(df.user_id).value.mean().compute()

     

    replaces some pandas code like this:

    import pandas as pd
    df = pd.read_csv('2015-01-01.csv')
    df.groupby(df.user_id).value.mean()

     

    and, especially noteworthy, provides through the concurrent.futures interface a general infrastructure for the submission of custom tasks:

    from dask.distributed import Client
    client = Client('scheduler:port')
    
    futures = []
    for fn in filenames:
        future = client.submit(load, fn)
        futures.append(future)
    
    summary = client.submit(summarize, futures)
    summary.result()
      December 10, 2021 11:01 AM IST
    0