Sqoop export duplicates

QBoard » Big Data » Big Data - Data Ingestion Tools : Sqoop, Flume, Kafka, Nifi.. » Sqoop export duplicates

Back To Topics

Tags : hive sqoop PostgreSQL

Maryam Bains

317
Will sqoop export create duplicates when the number of mappers is higher than the number of blocks in the source hdfs location?

My source hdfs directory has 24 million records and when I do a sqoop export to Postgres table, it somehow creates duplicate records. I have set the number of mappers as 24. There are 12 blocks in the source location.

Any idea why the sqoop is creating duplicates?
- Sqoop Version: 1.4.5.2.2.9.2-1
- Hadoop Version: Hadoop 2.6.0.2.2.9.2-1
Sqoop Command Used-
```
sqoop export -Dmapred.job.queue.name=queuename \
--connect jdbc:postgresql://ServerName/database_name \
--username USER --password PWD \
--table Tablename \
--input-fields-terminated-by "\001" --input-null-string "\\\\N" --input-null-non-string "\\\\N" \
--num-mappers 24 -m 24 \
--export-dir $3/penet_baseline.txt -- --schema public;
```
January 8, 2022 3:21 PM IST

0
Advika Banerjee

319 1

If you have used sqoop incremental mode then there many be some duplicate records on HDFS ,before running export to postgres , collect all unique records based on max(date or timestamp column) in one table and then do export . I think it has to work

February 11, 2022 12:40 PM IST

0
Vaibhav Mali

259

bagavathi you mentioned that duplicate rows were seen in target table and when you tried to add PK constraint, it failed due to PK violation, further, the source does not have duplicate rows. One possible scenario is that your Target table could already have records which maybe because of a previous incomplete sqoop job. Please check whether target table has key which is also in source.

One workaround for this scenario is, use parameter "--update-mode allowinsert". In your query, add these parameters, --update-key --update-mode allowinsert. This will ensure that if key is already present in table then the record will get updated else if key is not present then sqoop will do an insert.

February 2, 2022 1:45 PM IST

0

Cluzters.ai is the first step towards uniting various Industry participants in the field of Applied Data Innovations. It is a gamified community geared towards creating a level playing turf for Data science professionals.

Member Sign In

Member Sign In

Create Account

Sqoop export duplicates

Connect With Us