I have a simple ETL process in an Azure environmentblob storage > datafactory > datalake raw > databricks > datalake curated > datwarehouse(main ETL).the datasets for this... more
I have a simple ETL process in an Azure environmentblob storage > datafactory > datalake raw > databricks > datalake curated > datwarehouse(main ETL).the datasets for this project are not very big (~1 million rows 20 columns give or take) however I would like to keep them partitioned properly in my datalake as Parquet files.currently I run some simple logic to figure where in my lake each file should sit based off business calendars.the files vaguely looks like this
Year Week Data
2019 01 XXX
2019 02 XXX
I then partition a given file into the following format replacing data that exists and creating new folders for new data.
curated ---
dataset --
Year 2019
- Week 01 - file.pq + metadata
- Week 02 - file.pq + metadata
- Week 03 - file.pq + datadata #(pre existing file)
the metadata are success and commits that are auto generated.
to this end i use the following query in... less
I am a learner in Big data concepts. Based on my understanding Big Data is critical in handling unstructured data and high volume.When we look at the big data architecture... more
I am a learner in Big data concepts. Based on my understanding Big Data is critical in handling unstructured data and high volume.When we look at the big data architecture for a datawarehouse (DW) the data from source is extracted through the Hadoop (HDFS and Mapreduce) and the relevant unstructured information is converted to a valid business information and finally data is injected to the DW or DataMart through ETL processing (along with the existing sturctured data processing).
However i would like to know what are the new techniques/new dimensional model or storage requirements required at DW for an organization (due to the Big Data) as most of the tutorials/resources i try to learn only talks about Hadoop at source but not at target. How does the introduction of Big Data impacts the predefined reports/adhoc analysis of an organization due to this high volume of data
Appreciate your response less
I'm curious if anyone can point to some successful extract, transform, load (ETL) automation libraries, papers, or use cases for somewhat inhomogenious data?
I would be... moreI'm curious if anyone can point to some successful extract, transform, load (ETL) automation libraries, papers, or use cases for somewhat inhomogenious data?
I would be interested to see any existing libraries dealing with scalable ETL solutions. Ideally these would be capable of ingesting 1-5 petabytes of data containing 50 billion records from 100 inhomogenious data sets in tens or hundreds of hours running on 4196 cores (256 I2.8xlarge AWS machines). I really do mean ideally, as I would be interested to hear about a system with 10% of this functionality to help reduce our team's ETL load.
Otherwise, I would be interested to see any books or review articles on the subject or high quality research papers. I have done a literature review and have only found lower quality conference proceedings with dubious claims.
I've seen a few commercial products advertised, but again, these make dubious claims without much evidence of their efficacy.
The datasets are rectangular and can take the form of fixed... less
I have a bunch of client point of sale (POS) systems that periodically send new sales data to one centralized database, which stores the data into one big database for report... moreI have a bunch of client point of sale (POS) systems that periodically send new sales data to one centralized database, which stores the data into one big database for report generation.
The client POS is based on PHPPOS, and I have implemented a module that uses the standard XML-RPC library to send sales data to the service. The server system is built on CodeIgniter, and uses the XML-RPC and XML-RPCS libraries for the webservice component. Whenever I send a lot of sales data (as little as 50 rows from the sales table, and individual rows from sales_items pertaining to each item within the sale) I get the following error:
Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 54 bytes)
128M is the default value in php.in, but I assume that is a huge number to break. In fact, I have even tried setting this value to 1024M, and all it does is take a longer time to error out.
As for steps I've taken, I've tried disabling all processing on the server-side, and have rigged it... less
We are working on a datawarehouse for a bank and have pretty much followed the standard Kimball model of staging tables, a star schema and an ETL to pull the data through the... moreWe are working on a datawarehouse for a bank and have pretty much followed the standard Kimball model of staging tables, a star schema and an ETL to pull the data through the process.
Kimball talks about using the staging area for import, cleaning, processing and everything until you are ready to put the data into the star schema. In practice this typically means uploading data from the sources into a set of tables with little or no modification, followed by taking data optionally through intermediate tables until it is ready to go into the star schema. That's a lot of work for a single entity, no single responsibility here.
Previous systems I have worked on have made a distinction between the different sets of tables, to the extent of having:
Upload tables: raw source system data, unmodified
Staging tables: intermediate processing, typed and cleansed
Warehouse tables
You can stick these in separate schemas and then apply differing policies for archive/backup/security etc. One of the other guys... less
For some reason my MDF file is 154gigs, however, I only loaded 7 gigs worth of data from flat files. Why is the MDF file so much larger than the actual source data?
More... moreFor some reason my MDF file is 154gigs, however, I only loaded 7 gigs worth of data from flat files. Why is the MDF file so much larger than the actual source data?
More info:
Only a few tables with ~25 million rows. No large varchar fields (biggest is 300, most are less than varchar(50). Not very wide tables < 20 columns. Also, none of the large tables are indexed yet. Tables with indexes have less than 1 million rows. I don't use char, only varchar for strings. Datatype is not the issue.
Turned out it was the log file, not the mdf file. The MDF file is actually 24gigs which seems more reasonable, however still big IMHO.
UPDATE:
I fixed the problem with the LDF (log) file by changing the recovery model from FULL to simple. This is okay because this server is only used for internal development and ETL processing. In addition, before changing to SIMPLE I had to shrink the LOG file. Shrinking is not recommended in most cases, however, this was one of those cases where the log file should have never... less
I am concerning about extracting data from MongoDB where my application transact most of the data from MongoDB.
I have worked on sqoop to extract data and found RDBMS gel up with... moreI am concerning about extracting data from MongoDB where my application transact most of the data from MongoDB.
I have worked on sqoop to extract data and found RDBMS gel up with HDFS via sqoop. However, no clear direction found to extract data from NoSQL DB with sqoop to dump it over HDFS for big chunk of data processing? Please share your suggestions and investigations.
I have extracted static information and data transactions from MySQL. Simply, used sqoop to store data in HDFS and processed the data. Now, I have some live transactions of 1million unique emailIDs per day which data modelled into MongoDB. I need to move data from mongoDB to HDFS for processing/ETL. How can I achieve this goal using Sqoop. I know I can schedule my task but what should be the best approach to take out data from mongoDB via sqoop.
Consider 5DN cluster with 2TB size. Data size varies from 1GB ~ 2GB in peak hours. less
I know that ETL stands for Extract, Transform and Load data into a new target database. But in what scope does it still count as ETL? For example, if i want to move a contact... moreI know that ETL stands for Extract, Transform and Load data into a new target database. But in what scope does it still count as ETL? For example, if i want to move a contact database with 7000 records into a CRM software, does this process count as ETL as well?
I can't wrap my head around the basic theoretical concept of 'Operational and Analytical Big Data'.
According to... moreI can't wrap my head around the basic theoretical concept of 'Operational and Analytical Big Data'.
According to me:
Operational Big Data: Branch where we can perform Read/write operations on big data using specially designed Databases (NoSQL). Somewhat similar to ETL in RDMS.
Analytical Big Data: Branch where we analyse data in retrospect and draw predictions using techniques like MPP and MapReduce. Somewhat similar to reporting in RDMS.
(Please feel free to correct wherever I'm wrong, it's just my understanding.)
So according to me, Hadoop is used for Analytical Big Data where we just process data for analysis but don't temper original data and hence is not an idea choice for ETL. But recently I have come across this article which advocates using Hadoop for ETL: https://www.datanami.com/2014/09/01/five-steps-to-running-etl-on-hadoop-for-web-companies/ less
This is kind of naive question but I am new to NoSQL paradigm and don't know much about it. So if somebody can help me clearly understand difference between the HBase and Hadoop... moreThis is kind of naive question but I am new to NoSQL paradigm and don't know much about it. So if somebody can help me clearly understand difference between the HBase and Hadoop or if give some pointers which might help me understand the difference.
Till now, I did some research and acc. to my understanding Hadoop provides framework to work with raw chunk of data(files) in HDFS and HBase is database engine above Hadoop, which basically works with structured data instead of raw data chunk. Hbase provides a logical layer over HDFS just as SQL does. Is it correct? less