I'm curious if anyone can point to some successful extract, transform, load (ETL) automation libraries, papers, or use cases for somewhat inhomogenious data?
I would be interested to see any existing libraries dealing with scalable ETL solutions. Ideally these would be capable of ingesting 1-5 petabytes of data containing 50 billion records from 100 inhomogenious data sets in tens or hundreds of hours running on 4196 cores (256 I2.8xlarge AWS machines). I really do mean ideally, as I would be interested to hear about a system with 10% of this functionality to help reduce our team's ETL load.
Otherwise, I would be interested to see any books or review articles on the subject or high quality research papers. I have done a literature review and have only found lower quality conference proceedings with dubious claims.
I've seen a few commercial products advertised, but again, these make dubious claims without much evidence of their efficacy.
The datasets are rectangular and can take the form of fixed width files, CSV, TSV, and PSV. Number of fields range from 6 to 150 and contain mostly text based information about entities. Cardinality is large for individual information (address), but smaller for specific details like car type (van, suv, sedan).
Mappings from abbreviated data to human readable formats is commonly needed, as is transformation of records to first-normal-form.
As is likely obvious to the cognoscenti, I am looking for techniques that move beyond deterministic methods, using some sort of semi-supervised or supervised learning model.
I know this is a tall order, but I'm curious to assess the state-of-the-art before embarking on some ETL automation tasks to help guide how far to set our sights.
Thanks for your help!
I don't think you'll find anything that checks all of your requirements, but here are some things to look at:
As usually in business, we often deal with large amounts of data for very diverse purposes. Building robust pipelines can get quite complex, especially when data is scarce and transformation processes involve a lot of technicalities.
Efficient data transportation relies on three important blocks:
Data source points where raw data lies ready to be fetched.
ETL sub processes that involve stacks of extraction, transformation and data loading layers to route the data to its corresponding end points.
Final end points that utilize clean and preprocessed information to perform high end taks.
Pipelines are very generic and serve different purposes depending on the business plan. They usually share the general concepts, but the implementation will be different case-by-case.
In our case, the data set is already prepared. What we need is to engineer an ETL process that transforms our data according to what we pretend to do with it.
Just about any ETL tool can manage fixed width, CSV, TSV, or PSV input, and just about any tool should be able to manage 100B records. The limiting part of the question really has to do with what your destination format is, and what disk throughput you need.
Expected throughput on an i2.4xLarge is 250mb/s. If an 8xLarge is double that, times 32 machines, you are looking at the ability to write a petabyte in ~138 hours. Not to mention the time and bandwidth of bringing in the source data in the first place. Unless my math is completely off, that means 30 Petabytes can get written to disk in about 6 months.
It seems odd that you are looking to either normalize or turn into human readable format that much data (it's only going to get bigger), and even odder that you'd want to leverage machine learning as part of a transformation/load of that size.
Your solution will need to be on local hardware in order to keep costs reasonable.
I couldn't recommend a system (commercial or open source) that would scale to the degree necessary to perform this kind of ETL on 30 Petabytes in a matter of days. At that scale, I'd be looking into lots of memory, ram backed/fronted SSDs, and custom development on FPGAs for the actual transformations. Of course, if my math on the write timing is wrong this whole answer is invalid.
ETL is the one of the most critical and time-consuming parts of data warehousing. One way that companies have been able to reduce the amount of time and resources spent on ETL workloads is through the use of ETL automation, most commonly available in the form of Data Warehouse Automation.
Developers, data analysts, and IT teams can leverage a BI ETL tool or data integration software to automatically generate ETL code and more quickly and easily execute and manage ETL processes. As the volume, variety, and velocity of data flows increase, IT teams are finding that their existing ETL tools are unable to provide the data delivery speeds, flexibility, and extensibility necessary for delivering successful BI projects. What they need are real-time data warehousing solutions that help them overcome the limits of ETL automation and accelerate high-volume data acquisition.
ETL automation tools are appealing for several reasons. First, writing ETL scripts is complicated and prone to error, and over time, modifying and troubleshooting ETL processes becomes increasingly difficult. An automated ETL solution allows IT teams or data integration specialists to design, execute, and monitor the performance of ETL integration workflows through a simple point-and-click graphical interface. An ETL automation tool eliminates the need for manual coding and provides a convenient and easy way to manage data flows and implement basic data transformations.
In addition to providing a visual overview of data processing structures, ETL automation tools may offer built-in connectors optimized for common data sources and targets as well as advanced data profiling and cleansing capabilities.
Qlik (Attunity) offers innovative data integration and big data management solutions that help our clients stay agile. Qlik Compose (formerly Attunity Compose), our data warehouse automation solution, provides ETL automation for the agile data warehouse. Together with supporting GUI-driven design and creation of data warehouses and data marts, Qlik Compose (formerly Attunity Compose) integrates seamlessly with Qlik Replicate (formerly Attunity Replicate), a powerful data replication and data ingestion platform that supports real-time ETL and allows you to deliver the freshest data possible to your data warehouse.
Equipped with our unique CDC technology and in-memory stream processing, Qlik Replicate (formerly Attunity Replicate) lets you extract data from the broadest range of platforms including all major databases, mainframes, and business applications and deliver data efficiently to a wide range of databases, data warehouses, and big data platforms. CDC minimizes the need for bulk data transfer and enables continuous loading of changed data for real-time data warehousing. And for your most resource-intensive ETL workloads, Qlik Replicate (formerly Attunity Replicate) can help you facilitate and accelerate ETL offload to Hadoop environments