QBoard » Big Data » Big Data - Data Processing and ETL » Handling Big Data in a Datawarehouse

Handling Big Data in a Datawarehouse

  •  

    I am a learner in Big data concepts. Based on my understanding Big Data is critical in handling unstructured data and high volume.When we look at the big data architecture for a datawarehouse (DW) the data from source is extracted through the Hadoop (HDFS and Mapreduce) and the relevant unstructured information is converted to a valid business information and finally data is injected to the DW or DataMart through ETL processing (along with the existing sturctured data processing).

    However i would like to know what are the new techniques/new dimensional model or storage requirements required at DW for an organization (due to the Big Data) as most of the tutorials/resources i try to learn only talks about Hadoop at source but not at target. How does the introduction of Big Data impacts the predefined reports/adhoc analysis of an organization due to this high volume of data

    Appreciate your response

      May 24, 2019 12:22 PM IST
    0
  • That is a very broad question, but I'll try to give some answers.

    Hadoop can be a data source, a data warehouse, or a "data lake", being a repository of data from which warehouses and marts may be drawn.

    The line between Hadoop and RDBMS-based data warehouses is increasingly blurred. As SQL-on-Hadoop becomes a reality, interacting with Hadoop-based data becomes increasingly easy. To be effective, though, there must be structure in the data.

    Some examples of Hadoop/DW interactions:

    Microsoft Application Platform System, with Polybase interaction between SQL Server and Hadoop
    Impala (Cloudera), Stinger (Hortonworks) and others providing SQL-on-Hadoop
    Actian and Vertica (HP) providing RDBMS-compatible MPP on Hadoop
    That said, Hadoop DW is still immature. It is not quite as performant as RDBMS-based DW, lacks many security and operational features, as well as lacking in SQL capability. Think carefully about your needs before taking this path.

    Another question you should ask is whether you actually need a platform of this type. Any RDBMS can handle 3-5Tb of data. SQL Server and PostgreSQL are two examples of platforms that would handle a DW on commodity hardware, and negligible administration.

    Those same RDBMS can handle 100Tb workloads, but they require much more care and feeding at this scale.

    MPP RDBMS appliances handle data workloads into the Petabyte range, with lower administrative and operational overhead as they scale. I doubt you get to that scale, very few companies do :) You might choose an MPP appliance for a much smaller data volume, if speed of complex queries was your most important factor. I've seen MPP appliances deployed on data volumes as small as 5Tb for this reason.

    Depending on the load technique, you will probably find that an RDBMS-based DW is faster to load than Hadoop. For example, I load hundreds of thousands of rows per second into PostgreSQL, and slightly less than that into SQL Server. It takes substantially longer to achieve the same result in Hadoop as I have to ingest the file, establish it in Hive, and move it to Parquet to get a similar level of output performance. Over time I expect this to change in Hadoop's favour, but it isn't quite there, yet.

    You mentioned Dimensional Modelling. If your star schema is comprised of transactional fact tables and SCD0-SCD1 dimensions, thus needing insert-only processing, you might have success with SQL-on-Hadoop. If you need to update the facts (accumulating snapshots) or dimensions (SCD2, SCD3) you might struggle with both capability and performance - a lot of implementations don't yet support UPDATE queries, and those that do are slow.

    Sorry that there isn't a simple "Do this!" answer, but this is a complex topic in an immature field. I hope these comments help your thinking.
      May 24, 2019 12:23 PM IST
    0
  • Big data refers to volume, variety, and velocity of the data. How big is the data, the speed at which it is coming and a variety of data determines so-called “Big Data”.  The 3 V’s of the big data was articulated by industry analyst Doug Laney in the early 2000s.

    • Volume. Organizations collect data from a variety of sources, including business transactions, social media, and information from sensor or machine-to-machine data. In the past, storing it would’ve been a problem – but new technologies (such as Hadoop) have eased the burden.
    • Velocity. Data streams in at an unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors, and smart metering are driving the need to deal with torrents of data in near-real-time.
    • Variety. Data comes in all types of formats – from structured, numeric data in traditional databases to unstructured text documents, email, video, audio, stock ticker data, and financial transactions.

    WHY DOES ANY ORGANIZATION WANT BIG DATA OR DATA WAREHOUSES?

    • Big Data: Organizations want a big data solution because in a lot of corporations there is a lot of data. And in those corporations that data – if unlocked properly – can contain much valuable information that can lead to better decisions that, in turn, can lead to more revenue, more profitability, and more customers. And that is what most corporations want.
    • Data Warehouse: Organizations need a data warehouse in order to make informed decisions. In order to really know what is going on in your corporation, you need data that is reliable, believable and accessible to everyone.

    Both the above look similar but there is a clear differenceBig data is a repository to hold lots of data but it is not sure what we want to do with it, whereas data warehouse is designed with the clear intention to make informed decisions. Further, a big data can be used for data warehousing purposes.

      September 11, 2021 1:41 PM IST
    0
  • Data Warehouse is mainly an architecture, not a technology. It extracting data from varieties SQL based data source (mainly relational database) and help for generating analytic reports. In terms of definition, data repository, which using for any analytic reports, has been generated from one process, which is nothing but the data warehouse.

    Data warehouse is an architecture used to organize the data.
      January 7, 2022 12:37 PM IST
    0
  • The process of data lake and data warehouse is not the same. Dimensional modeling in traditional sense starts with business process identification and star schema design where on data lake you don't commit any assumption about the business process.The data lake collects the data at a very granular level as possible, explore it and find the business process. You can read more about data lake on An Introduction to enterprise data lake - The myths and miracles
      September 15, 2021 3:04 PM IST
    0
  • Big Data: Big Data basically refers to the data which is in large volume and has complex data sets. This large amount of data can be structured, semi-structured, or non-structured and cannot be processed by traditional data processing software and databases. Various operations like analysis, manipulation, changes, etc are performed on data and then it is used by companies for intelligent decision making. Big data is a very powerful asset in today’s world. Big data can also be used to tackle business problems by providing intelligent decision making.

    Data Warehouse: Data Warehouse is basically the collection of data from various heterogeneous sources. It is the main component of the business intelligence system where analysis and management of data are done which is further used to improve decision making. It involves the process of extraction, loading, and transformation for providing the data for analysis. Data warehouses are also used to perform queries on a large amount of data. It uses data from various relational databases and application log files.

      November 20, 2021 12:36 PM IST
    0