Will sqoop export create duplicates when the number of mappers is higher than the number of blocks in the source hdfs location?
My source hdfs directory has 24 million records and... moreWill sqoop export create duplicates when the number of mappers is higher than the number of blocks in the source hdfs location?
My source hdfs directory has 24 million records and when I do a sqoop export to Postgres table, it somehow creates duplicate records. I have set the number of mappers as 24. There are 12 blocks in the source location.
Any idea why the sqoop is creating duplicates?
I am trying to read messages on Kafka topic, but I am unable to read it. The process gets killed after sometime, without reading any messages.
Here is the rebalancing error which... moreI am trying to read messages on Kafka topic, but I am unable to read it. The process gets killed after sometime, without reading any messages.
Here is the rebalancing error which I get:
ERROR Error processing message, stopping consumer: (kafka.consumer.ConsoleConsumer$)
kafka.common.ConsumerRebalanceFailedException: topic-1395414642817-47bb4df2 can't rebalance after 4 retries
at kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:428)
at kafka.consumer.ZookeeperConsumerConnector.kafka$consumer$ZookeeperConsumerConnector$$reinitializeConsumer(ZookeeperConsumerConnector.scala:718)
at kafka.consumer.ZookeeperConsumerConnector$WildcardStreamsHandler.<init>(ZookeeperConsumerConnector.scala:752)
at kafka.consumer.ZookeeperConsumerConnector.createMessageStreamsByFilter(ZookeeperConsumerConnector.scala:142)
at kafka.consumer.ConsoleConsumer$.main(ConsoleConsumer.scala:196)
at... less
I have a option of using Sqoop or Informatica Big Data edition to source data into HDFS. The source systems are Tearadata, Oracle.
I would like to know which one is better and any... moreI have a option of using Sqoop or Informatica Big Data edition to source data into HDFS. The source systems are Tearadata, Oracle.
I would like to know which one is better and any reason behind the same.
Note: My current utility is able to pull data using sqoop into HDFS , Create Hive staging table and archive external table.
Informatica is the ETL tool used in the organization.
Regards Sanjeeb
I have recently deployed Big Data cluster. In that I've used Apache Kafka and zookeeper. But still I didn't understand about its usage in cluster. When both are required and for... moreI have recently deployed Big Data cluster. In that I've used Apache Kafka and zookeeper. But still I didn't understand about its usage in cluster. When both are required and for what purpose?
I found many options recently, and interesting in their comparisons primarely by maturity and stability.
Crunch - https://github.com/cloudera/crunch
Scrunch... moreI found many options recently, and interesting in their comparisons primarely by maturity and stability.
I am a newbie to Flume and Hadoop. We are developing a BI module where we can store all the logs from different servers in HDFS.
For this I am using Flume. I just started trying... moreI am a newbie to Flume and Hadoop. We are developing a BI module where we can store all the logs from different servers in HDFS.
For this I am using Flume. I just started trying it out. Succesfully created a node but now I am willing to setup a HTTP source and a sink that will write incoming requests over HTTP to local file.
Any suggesstions?
Thanks in Advance/
One of the first things I think about when using a new service (such as a non-RDBMS data store or a message queue) is: "How should I structure my data?".
I've read and watched... moreOne of the first things I think about when using a new service (such as a non-RDBMS data store or a message queue) is: "How should I structure my data?".
I've read and watched some introductory materials. In particular, take, for example, Kafka: a Distributed Messaging System for Log Processing, which writes:
"a Topic is the container with which messages are associated"
"the smallest unit of parallelism is the partition of a topic. This implies that all messages that ... belong to a particular partition of a topic will be consumed by a consumer in a consumer group."
Knowing this, what would be a good example that illustrates how to use topics and partitions? When should something be a topic? When should something be a partition?
As an example, let's say my (Clojure) data looks like:
{:user-id 101 :viewed "/page1.html" :at #inst "2013-04-12T23:20:50.22Z"}
{:user-id 102 :viewed "/page2.html" :at #inst "2013-04-12T23:20:55.50Z"}
Should the topic be based on user-id? viewed? at? What about the... less
Is there a way to purge the topic in kafka?I pushed a message that was too big into a kafka message topic on my local machine, now I'm getting an... moreIs there a way to purge the topic in kafka?I pushed a message that was too big into a kafka message topic on my local machine, now I'm getting an error:
kafka.common.InvalidMessageSizeException: invalid message size
Increasing the fetch.size is not ideal here, because I don't actually want to accept messages that big.
I am using hadoop-1.2.1 and sqoop version is 1.4.4.
I am trying to run the following query.
sqoop import --connect jdbc:mysql://IP:3306/database_name --table clients --target-dir... moreI am using hadoop-1.2.1 and sqoop version is 1.4.4.
I am trying to run the following query.
sqoop import --connect jdbc:mysql://IP:3306/database_name --table clients --target-dir /data/clients --username root --password-file /sqoop.password -m 1
sqoop.password is a file which is kept on HDFS in path /sqoop.password with permission 400.
It is giving me an error
Access denied for user 'root'@'IP' (using password: YES)
Can anyone provide solution for this?
Can I import RDBMS table data (table doesn't have a primary key) to hive using sqoop? If yes, then can you please give the sqoop import command.
I have tried with sqoop import... moreCan I import RDBMS table data (table doesn't have a primary key) to hive using sqoop? If yes, then can you please give the sqoop import command.
I have tried with sqoop import general command, but it failed.
I am trying to use Kafka.All configurations are done properly but when I try to produce message from console I keep getting the following error
WARN Error while fetching metadata... moreI am trying to use Kafka.All configurations are done properly but when I try to produce message from console I keep getting the following error
WARN Error while fetching metadata with correlation id 39 :
{4-3-16-topic1=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
Kafka version: 2.11-0.9.0.0
I exploring a few options to setup kafka and I knew that the Zookeeper has to be up and running to initiate a kafka.
I would like to know how can I find the below.
1) hostname and... moreI exploring a few options to setup kafka and I knew that the Zookeeper has to be up and running to initiate a kafka.
I would like to know how can I find the below.
1) hostname and port for my zookeeper instance---I checked the zoo.cfg and I could only find the ClientPort not the hostname, will hostname be the hostname of my box??
2) To check if ZooKeeper is up and running---I tried to do a ps -ef | grep "zoo" I could not find anything. May be I am using a wrong key word to search??
Any help would be really appreciated? less
Can anybody explain Apache Flume for me in a plain language? I'd appreciate an explanation with a practical example instead of abstract theoretical definitions, then I can... moreCan anybody explain Apache Flume for me in a plain language? I'd appreciate an explanation with a practical example instead of abstract theoretical definitions, then I can understand better.
What is it used for? At which stage of a BigData analysis is it used?
And what are prerequisites for learning it?
Please
As you would explain for a non-technical person