QBoard » Big Data » Big Data - Data Processing and ETL » Structure within staging area of data warehouse

Structure within staging area of data warehouse

  • We are working on a datawarehouse for a bank and have pretty much followed the standard Kimball model of staging tables, a star schema and an ETL to pull the data through the process.

    Kimball talks about using the staging area for import, cleaning, processing and everything until you are ready to put the data into the star schema. In practice this typically means uploading data from the sources into a set of tables with little or no modification, followed by taking data optionally through intermediate tables until it is ready to go into the star schema. That's a lot of work for a single entity, no single responsibility here.

    Previous systems I have worked on have made a distinction between the different sets of tables, to the extent of having:

    • Upload tables: raw source system data, unmodified
    • Staging tables: intermediate processing, typed and cleansed
    • Warehouse tables

    You can stick these in separate schemas and then apply differing policies for archive/backup/security etc. One of the other guys has worked on a warehouse where there is a StagingInput and a StagingOutput, similar story. The team as a whole has a lot of experience, both datawarehouse and otherwise.

    However, despite all this, looking through Kimball and the web there seems to be absolutely nothing in writing about giving any kind of structure to the staging database. One would be forgiven for believing that Mr Kimball would have us all work with staging as this big deep dark unstructured pool of data.

    Whilst of course it is pretty obvious how to go about it if we want to add some more structure to the staging area, it seems very odd that there seems to be nothing written about it.

    So, what is everyone else out there doing? Is staging just this big unstructured mess or do folk have some interesting designs on it?

      August 19, 2021 2:08 PM IST
    0
  • There can be sub areas in Staging. Called staging1, staging2, for example.

    Staging1 can be a directly pull from data sources with no transformation. And Staging1 only keeps the latest data.

    Staging2 keeps data transformed and ready to go to warehouse. Staging2 keeps all historical data.

      October 9, 2021 1:20 PM IST
    0
  • Personally, I don't go looking for trouble, in Kimball, or elsewhere.

    What kind of "structure" are you looking for? What kind of "structure" do you feel is needed? What problems are you seeing from the lack of "structure" you have today?

    I may be leaving you with the impression that I don't think much of Kimball. Not so - I haven't read Kimball. I just don't think much of changing things for no reason beyond fitting some pattern. Change to solve some real-world problem would be fine. For instance, if you find you're backing up staging tables because a lack of structure caused the staging and warehouse tables to be treated the same, then this would be a reason to change the structure. But if that's the sort of thing you had in mind, then you should edit your question to indicate it.

      December 28, 2021 12:31 PM IST
    0
  • I have experienced the same problem. We have a large HR DataWarehouse and I am pulling data from systems all over the enterprise. I've got a nice collection of Fact and Dimension tables, but the staging area is a mess. I don't know of any standards for design of this. I would follow the same path you are on and come up with a standard set of names to keep things in order. Your suggestion is pretty good for the naming. I'd keep working with that.
      September 23, 2021 1:50 PM IST
    0
  • A staging area, or landing zone, is an intermediate storage area used for data processing during the extract, transform and load (ETL) process. The data staging area sits between the data source(s) and the data target(s), which are often data warehouses, data marts, or other data repositories.
      August 21, 2021 5:29 PM IST
    0
  • We are working on a large Insurance DWH project at the moment, its slightly complicated, but each of the source system tables are put into a separate schema in a STAGING database, then we have ETL that moves/cleanses/conforms(MDM) the data from the staging database into a STAGINGCLEAN database, then further ETL that moves the data into a Kimball DWH.

    The separation of the Staging and the StagingClean database we find very helpful in diagnosing issues particularly on data quality, as we have dirty staged data as well as the cleaned version before it is transformed into the DWH proper.

      November 19, 2021 12:10 PM IST
    0