QBoard » Big Data » Big Data - Hadoop Eco-System » Hadoop replication factor precedence

Hadoop replication factor precedence

  • I have this only in my namenode:

    <property> <name>dfs.replication</name> <value>3</value> </property>

    In my data nodes, I have this:

    <property> <name>dfs.replication</name> <value>1</value> </property>

    Now my question is, will the replication factor be 3 or 1?

    At the moment, the output of hdfs dfs -ls hdfs:///user/hadoop-user/data/0/0/0 shows 1 replication factor:

    -rw-r--r-- 1 hadoop-user supergroup 68313 2015-11-06 19:32 hdfs:///user/hadoop-user/data/0/0/0/00099954tnemhcatta.bin

    Appreciate your answer.

      June 11, 2019 4:28 PM IST
    0
  • HDFS replication units are blocks composing files. When new not-empty file is written, its blocks are replicated if the configuration tells to do so. A file can be configured as "to replicate" through several different points:

    • configuration entries - different properties configuring replication exist
      • dfs.replication - default replication factor
      • dfs.replication.max - maximal replication factor
      • dfs.namenode.replication.min - minimal block replication factor
    • at creation time - when new file is created, we can define its replication factor. If this value is defined, it's used instead of default dfs.replication
    • after file creation - we can also modify previously defined replication factor with appropriated shell command (setrep).

    A block is replicated under different conditions:

    • file creation - as already explained, file blocks can be replicated with default or custom replication factor.
    • replication factor change - if the replication factor change, block replication is triggered. It applies for the increase and decrease of replication factor. For the first case, new replicas are added. For the case of decrease, extra replicas are removed. For the removal, the NameNode tries to keep the same number of racks holding replicas.
    • block is in corrupted state - when a block is corrupted (at least 1 corrupt replica with at least 1 live replica), HDFS will trigger the replication for this block. The goal of this operation is to guarantee block availability.
    • block is misreplicated  - it means that block is not fault-tolerant, for example all of its replicas are placed in the nodes of 1 rack.

    Thus, replica placement is rack-aware. It means that NameNode ensures that at least 1 replica is stored on different rack than other replicas. The choice of place is made according to DataNodes states (available capacity, total capacity). NameNode will try to balance the work between DataNodes. For example, if a block is replicated twice, both replicas are stored on DataNodes of different racks. If we decide to increase the number of replicas to 3, the 3rd replica will be written on a node located in 1 of these 2 racks. In the other side, if we have 3 replicas and the replication factor decreases to 2, the extra replica will be removed from the rack having 2 replicas.

    The replication is executed through a replication pipeline. When the replica is written on the first DataNode, this DataNode moves replicated block to the 2nd DataNode. If the 2nd DataNode ends to write the block, it moves the block to the 3rd DataNode and so on - until reaching the last DataNode supposed to hold the block.

    A good rule of thumb for determining the number of replicas is to specify more replicas for often read or important files. It should increase not only the fault-tolerance but also read performance.

      September 6, 2021 1:41 PM IST
    0
  • You can change the replication factor of a file using command:

    hdfs dfs –setrep –w 3 /user/hdfs/file.txt 
    

    You can also change the replication factor of a directory using command:

    hdfs dfs -setrep -R 2 /user/hdfs/test
    


    But changing the replication factor for a directory will only affect the existing files and the new files under the directory will get created with the default replication factor (dfs.replication from hdfs-site.xml) of the cluster.

    Please see the link to understand more on it.

    Please see link to configure replication factor for HDFS.

    But you can temporarily override and turn off the HDFS default replication factor by passing:

    -D dfs.replication=1
    

     

    This should work well when you pass it with a Map/Reduce job. This will be your job specific only.

     
      August 24, 2021 1:41 PM IST
    0
  • by default replication factor is 3, it is standard in most of the distributed system. if the replication factor was set to 3 (default value in HDFS) there would be one original block and two replicas. Most of time when we working on single node cluster(single machine) that time we put it 1. because if we will take 3 then there will be no benefit as all the copy are on single machine. so simple understanding. in multi node cluster replication factor should be 3 used in failure and in single machine replication factor should be 1.
      June 11, 2019 4:29 PM IST
    0
  • Open the hdfs-site.xml file. This file is usually found in the conf/ folder of the Hadoop installation directory. Change or add the following property to hdfs-site.xml:
    <property> 
    <name>dfs.replication<name> 
    <value>3<value> 
    <description>Block Replication<description> 
    <property>​


    You can also change the replication factor on a per-file basis using the Hadoop FS shell.

    [jpanda@localhost ~]$ hadoop fs –setrep –w 3 /my/file

    Alternatively, you can change the replication factor of all the files under a directory.

    [jpanda@localhost ~]$ hadoop fs –setrep –w 3 -R /my/dir

    This post was edited by Vaibhav Mali at August 11, 2021 1:38 PM IST
      August 11, 2021 1:37 PM IST
    0