Will sqoop export create duplicates when the number of mappers is higher than the number of blocks in the source hdfs location?
My source hdfs directory has 24 million records and... moreWill sqoop export create duplicates when the number of mappers is higher than the number of blocks in the source hdfs location?
My source hdfs directory has 24 million records and when I do a sqoop export to Postgres table, it somehow creates duplicate records. I have set the number of mappers as 24. There are 12 blocks in the source location.
Any idea why the sqoop is creating duplicates?
I have a option of using Sqoop or Informatica Big Data edition to source data into HDFS. The source systems are Tearadata, Oracle.
I would like to know which one is better and any... moreI have a option of using Sqoop or Informatica Big Data edition to source data into HDFS. The source systems are Tearadata, Oracle.
I would like to know which one is better and any reason behind the same.
Note: My current utility is able to pull data using sqoop into HDFS , Create Hive staging table and archive external table.
Informatica is the ETL tool used in the organization.
Regards Sanjeeb
What are the differences between Apache Spark SQLContext and HiveContext ?
Some sources say that since the HiveContext is a superset of SQLContext developers should always use... moreWhat are the differences between Apache Spark SQLContext and HiveContext ?
Some sources say that since the HiveContext is a superset of SQLContext developers should always use HiveContext which has more features than SQLContext. But the current APIs of each contexts are mostly same.
What are the scenarios which SQLContext/HiveContext is more useful ?.
Is HiveContext more useful only when working with Hive ?.
Or does the SQLContext is all that needs in implementing a Big Data app using Apache Spark ?
From the official Hive documentation:Hive aims to provide acceptable (but not optimal) latency for interactive data browsing, queries over small data sets or test queries.
I'm... moreFrom the official Hive documentation:Hive aims to provide acceptable (but not optimal) latency for interactive data browsing, queries over small data sets or test queries.
I'm not an expert about database architecture, and I would like to know if there is an alternative when the assumption above is not true, that is, when queries are made over a big data set.
I am trying to understand what would be the best big data solution for reporting purposes?
Currently I narrowed it down to HBase vs Hive.
The use case is that we have hundreds of... moreI am trying to understand what would be the best big data solution for reporting purposes?
Currently I narrowed it down to HBase vs Hive.
The use case is that we have hundreds of terabytes of data with hundreds different files. The data is live and gets updated all the time. We need to provide the most efficient way to do reporting. We have dozens different reports pages where each report consist of different type of numeric and graph data. For instance:
Show all users that logged in to the system in the last hour and their origin is US.
Show a graph with the most played games to the least played games.
From all users in the system show the percentage of paying vs non paying users.
For a given user, show his entire history. How many games he played? What kind of games he played. What was his score in each and every game?
The way I see it, there are 3 solutions:
Store all data in Hadoop and do the queries in Hive. This might work but I am not sure about the performance. How will it perform when the... less
What are the benefits of using either Hadoop or HBase or Hive ?
From my understanding, HBase avoids using map-reduce and has a column oriented storage on top of... moreWhat are the benefits of using either Hadoop or HBase or Hive ?
From my understanding, HBase avoids using map-reduce and has a column oriented storage on top of HDFS. Hive is a sql-like interface for Hadoop and HBase.
I would also like to know how Hive compares with Pig.
I have a HBase with the 750GB data. All data in the HBase are time series sensor data. And, my row key design is like this;
deviceID,sensorID,timestamp
I want to prepare all data... moreI have a HBase with the 750GB data. All data in the HBase are time series sensor data. And, my row key design is like this;
deviceID,sensorID,timestamp
I want to prepare all data in the hbase for batch processing(for example, CSV format on the HDFS). But there is a lot of data in the hbase. Can I prepare data using hive without getting data partially? Because, if I will get data using sensor id(scan query with start-end row), I must specify start and end row for each time. I don't want do this.
I have a multinode Hadoop cluster setup with two nodes(one master node and one slave node). Each node with 8GB RAM.
I have also configured hive on the master node. Everything is... moreI have a multinode Hadoop cluster setup with two nodes(one master node and one slave node). Each node with 8GB RAM.
I have also configured hive on the master node. Everything is up and working.
Nodemanager and Datanode are working on the slave node.
ResourceManager, Namenode, and SecondaryNamenode are also working on the master node.
I am able to access the hive terminal as well, but I am not able to drop the database through the drop database databaseName; command. It is not showing any error but has been stuck for more than an hour... Three tables have size 10000 * 20. I thought these may be causing the speed issues, so I wanted to delete the database, but am not able to delete via drop database command, so is there any way to do it directly by deleting any files?
I have tried to access hive.metastore.warehouse.dir to delete the database directly, but this directory is completely empty.
Similar slow behavior can be observed with other hive commands as well. I am just able to run one... less