QBoard » Big Data » Big Data - Data Processing and ETL » Successful ETL Automation: Libraries, Review papers, Use Cases

Successful ETL Automation: Libraries, Review papers, Use Cases

  • I'm curious if anyone can point to some successful extract, transform, load (ETL) automation libraries, papers, or use cases for somewhat inhomogenious data?

    I would be interested to see any existing libraries dealing with scalable ETL solutions. Ideally these would be capable of ingesting 1-5 petabytes of data containing 50 billion records from 100 inhomogenious data sets in tens or hundreds of hours running on 4196 cores (256 I2.8xlarge AWS machines). I really do mean ideally, as I would be interested to hear about a system with 10% of this functionality to help reduce our team's ETL load.

    Otherwise, I would be interested to see any books or review articles on the subject or high quality research papers. I have done a literature review and have only found lower quality conference proceedings with dubious claims.

    I've seen a few commercial products advertised, but again, these make dubious claims without much evidence of their efficacy.

    The datasets are rectangular and can take the form of fixed width files, CSV, TSV, and PSV. Number of fields range from 6 to 150 and contain mostly text based information about entities. Cardinality is large for individual information (address), but smaller for specific details like car type (van, suv, sedan).

    Mappings from abbreviated data to human readable formats is commonly needed, as is transformation of records to first-normal-form.

    As is likely obvious to the cognoscenti, I am looking for techniques that move beyond deterministic methods, using some sort of semi-supervised or supervised learning model.

    I know this is a tall order, but I'm curious to assess the state-of-the-art before embarking on some ETL automation tasks to help guide how far to set our sights.

    Thanks for your help!

      June 12, 2019 12:00 PM IST
    0
  • I don't think you'll find anything that checks all of your requirements, but here are some things to look at:

    • Automated ETL mapping: There is a tool called Karma started by a team at USC's Information Sciences Institute. It learns from your ETL mappings and helps automate future mappings. It's the only open source tool I'm aware of that helps automate the ETL process, but I would be very interested if there are others out there.
    • Large scale ETL: There are many many tools you could look at for the scalability you are looking for. I can personally recommend looking at Storm and Spark. Storm is excellent for stringing a connection of processing steps together that, given enough resources, can compute in near real time on streaming data. Not too dissimilar Spark has a streaming component with a similar use case, but standard Spark may fit your needs better if the data you are needing to ETL is a fixed set to be processed once.
    • Data storage: You may also need to consider where all this data will live during the ETL lifetime. You may need something like Kafka to deal with large streams of data. Or maybe HDFS to store a static collection of files.
      June 12, 2019 12:02 PM IST
    0
  • As usually in business, we often deal with large amounts of data for very diverse purposes. Building robust pipelines can get quite complex, especially when data is scarce and transformation processes involve a lot of technicalities. 

    Efficient data transportation relies on three important blocks:

    1. Data Producers

    Data source points where raw data lies ready to be fetched.

    1. Transformation and Transportation workflows

    ETL sub processes that involve stacks of extraction, transformation and data loading layers to route the data to its corresponding end points.

    1. Data Consumers

    Final end points that utilize clean and preprocessed information to perform high end taks.

    ETL Schema
    ETL Schema | Source: InfoLibrarian

    Pipelines are very generic and serve different purposes depending on the business plan. They usually share the general concepts, but the implementation will be different case-by-case. 

    In our case, the data set is already prepared. What we need is to engineer an ETL process that transforms our data according to what we pretend to do with it.

      December 28, 2021 12:29 PM IST
    0
  • Just about any ETL tool can manage fixed width, CSV, TSV, or PSV input, and just about any tool should be able to manage 100B records. The limiting part of the question really has to do with what your destination format is, and what disk throughput you need.

    Expected throughput on an i2.4xLarge is 250mb/s. If an 8xLarge is double that, times 32 machines, you are looking at the ability to write a petabyte in ~138 hours. Not to mention the time and bandwidth of bringing in the source data in the first place. Unless my math is completely off, that means 30 Petabytes can get written to disk in about 6 months.

    It seems odd that you are looking to either normalize or turn into human readable format that much data (it's only going to get bigger), and even odder that you'd want to leverage machine learning as part of a transformation/load of that size.

    Your solution will need to be on local hardware in order to keep costs reasonable.

    I couldn't recommend a system (commercial or open source) that would scale to the degree necessary to perform this kind of ETL on 30 Petabytes in a matter of days. At that scale, I'd be looking into lots of memory, ram backed/fronted SSDs, and custom development on FPGAs for the actual transformations. Of course, if my math on the write timing is wrong this whole answer is invalid.

      June 14, 2019 12:09 PM IST
    0
  • Research in the ETL process mainly focused on modeling and designing at conceptual [13], logical [37] and physical level [33]. An ontology-based conceptual model for automatic data extraction is designed by Embley et al. [18]. Another semantic web-based ETL designing with high level of automation is found in [30]. SysML based conceptual ETL process modeling has been proposed in [13] Automatic data loading [17] into the warehouse is done followed by any business events from any application. An automated architecture is designed to optimize ETL throughput in article [31]. Using the model-driven approach [12, 24] designed an ETL process which automatically generates code from the conceptual model. A modeldriven framework using BPMN language is practiced in article [9]. The model-to-text transformation can automatically produce code suitable to any ETL commercial tool and the model-to-model transformations can automatically update for maintenance purpose. An empirical analysis of such programmable ETL tools has been done in
      January 7, 2022 12:35 PM IST
    0
  • ETL is the one of the most critical and time-consuming parts of data warehousing. One way that companies have been able to reduce the amount of time and resources spent on ETL workloads is through the use of ETL automation, most commonly available in the form of Data Warehouse Automation.

    Developers, data analysts, and IT teams can leverage a BI ETL tool or data integration software to automatically generate ETL code and more quickly and easily execute and manage ETL processes. As the volume, variety, and velocity of data flows increase, IT teams are finding that their existing ETL tools are unable to provide the data delivery speeds, flexibility, and extensibility necessary for delivering successful BI projects. What they need are real-time data warehousing solutions that help them overcome the limits of ETL automation and accelerate high-volume data acquisition.

    ETL automation tools are appealing for several reasons. First, writing ETL scripts is complicated and prone to error, and over time, modifying and troubleshooting ETL processes becomes increasingly difficult. An automated ETL solution allows IT teams or data integration specialists to design, execute, and monitor the performance of ETL integration workflows through a simple point-and-click graphical interface. An ETL automation tool eliminates the need for manual coding and provides a convenient and easy way to manage data flows and implement basic data transformations.

    In addition to providing a visual overview of data processing structures, ETL automation tools may offer built-in connectors optimized for common data sources and targets as well as advanced data profiling and cleansing capabilities.

    Implement Real-Time Data Warehousing with Qlik Compose (formerly Attunity Compose)

    Qlik (Attunity) offers innovative data integration and big data management solutions that help our clients stay agile. Qlik Compose (formerly Attunity Compose), our data warehouse automation solution, provides ETL automation for the agile data warehouse. Together with supporting GUI-driven design and creation of data warehouses and data marts, Qlik Compose (formerly Attunity Compose) integrates seamlessly with Qlik Replicate (formerly Attunity Replicate), a powerful data replication and data ingestion platform that supports real-time ETL and allows you to deliver the freshest data possible to your data warehouse.

    Equipped with our unique CDC technology and in-memory stream processing, Qlik Replicate (formerly Attunity Replicate) lets you extract data from the broadest range of platforms including all major databases, mainframes, and business applications and deliver data efficiently to a wide range of databases, data warehouses, and big data platforms. CDC minimizes the need for bulk data transfer and enables continuous loading of changed data for real-time data warehousing. And for your most resource-intensive ETL workloads, Qlik Replicate (formerly Attunity Replicate) can help you facilitate and accelerate ETL offload to Hadoop environments

      December 14, 2021 11:44 AM IST
    0