I am confused as to where Talend and Apache spark fit in the big data ecosystem as both Apache Spark and Talend can be used for ETL.
Could someone please explain this with an example?
For some reason my MDF file is 154gigs, however, I only loaded 7 gigs worth of data from flat files. Why is the MDF file so much larger than the actual source data?
More... moreFor some reason my MDF file is 154gigs, however, I only loaded 7 gigs worth of data from flat files. Why is the MDF file so much larger than the actual source data?
More info:
Only a few tables with ~25 million rows. No large varchar fields (biggest is 300, most are less than varchar(50). Not very wide tables < 20 columns. Also, none of the large tables are indexed yet. Tables with indexes have less than 1 million rows. I don't use char, only varchar for strings. Datatype is not the issue.
Turned out it was the log file, not the mdf file. The MDF file is actually 24gigs which seems more reasonable, however still big IMHO.
UPDATE:
I fixed the problem with the LDF (log) file by changing the recovery model from FULL to simple. This is okay because this server is only used for internal development and ETL processing. In addition, before changing to SIMPLE I had to shrink the LOG file. Shrinking is not recommended in most cases, however, this was one of those cases where the log file should have never... less
I know that ETL stands for Extract, Transform and Load data into a new target database. But in what scope does it still count as ETL? For example, if i want to move a contact... moreI know that ETL stands for Extract, Transform and Load data into a new target database. But in what scope does it still count as ETL? For example, if i want to move a contact database with 7000 records into a CRM software, does this process count as ETL as well?
I'm curious if anyone can point to some successful extract, transform, load (ETL) automation libraries, papers, or use cases for somewhat inhomogenious data?
I would be... moreI'm curious if anyone can point to some successful extract, transform, load (ETL) automation libraries, papers, or use cases for somewhat inhomogenious data?
I would be interested to see any existing libraries dealing with scalable ETL solutions. Ideally these would be capable of ingesting 1-5 petabytes of data containing 50 billion records from 100 inhomogenious data sets in tens or hundreds of hours running on 4196 cores (256 I2.8xlarge AWS machines). I really do mean ideally, as I would be interested to hear about a system with 10% of this functionality to help reduce our team's ETL load.
Otherwise, I would be interested to see any books or review articles on the subject or high quality research papers. I have done a literature review and have only found lower quality conference proceedings with dubious claims.
I've seen a few commercial products advertised, but again, these make dubious claims without much evidence of their efficacy.
The datasets are rectangular and can take the form of fixed... less