Sqoop export duplicates

  • Will sqoop export create duplicates when the number of mappers is higher than the number of blocks in the source hdfs location?

    My source hdfs directory has 24 million records and when I do a sqoop export to Postgres table, it somehow creates duplicate records. I have set the number of mappers as 24. There are 12 blocks in the source location.

    Any idea why the sqoop is creating duplicates?

    • Sqoop Version: 1.4.5.2.2.9.2-1
    • Hadoop Version: Hadoop 2.6.0.2.2.9.2-1

    Sqoop Command Used-

    sqoop export -Dmapred.job.queue.name=queuename \
    --connect jdbc:postgresql://ServerName/database_name \
    --username USER --password PWD \
    --table Tablename \
    --input-fields-terminated-by "\001" --input-null-string "\\\\N" --input-null-non-string "\\\\N" \
    --num-mappers 24 -m 24 \
    --export-dir $3/penet_baseline.txt -- --schema public;
      January 8, 2022 3:21 PM IST
    0
  • If you have used sqoop incremental mode then there many be some duplicate records on HDFS ,before running export to postgres , collect all unique records based on max(date or timestamp column) in one table and then do export . I think it has to work

     
      February 11, 2022 12:40 PM IST
    0
  • bagavathi you mentioned that duplicate rows were seen in target table and when you tried to add PK constraint, it failed due to PK violation, further, the source does not have duplicate rows. One possible scenario is that your Target table could already have records which maybe because of a previous incomplete sqoop job. Please check whether target table has key which is also in source.

    One workaround for this scenario is, use parameter "--update-mode allowinsert". In your query, add these parameters, --update-key --update-mode allowinsert. This will ensure that if key is already present in table then the record will get updated else if key is not present then sqoop will do an insert.

      February 2, 2022 1:45 PM IST
    0