This should work well when you pass it with a Map/Reduce job. This will be your job specific only.
I have this only in my namenode:
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
In my data nodes, I have this:
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
Now my question is, will the replication factor be 3 or 1?
At the moment, the output of hdfs dfs -ls hdfs:///user/hadoop-user/data/0/0/0 shows 1 replication factor:
-rw-r--r-- 1 hadoop-user supergroup 68313 2015-11-06 19:32 hdfs:///user/hadoop-user/data/0/0/0/00099954tnemhcatta.bin
Appreciate your answer.
HDFS replication units are blocks composing files. When new not-empty file is written, its blocks are replicated if the configuration tells to do so. A file can be configured as "to replicate" through several different points:
A block is replicated under different conditions:
Thus, replica placement is rack-aware. It means that NameNode ensures that at least 1 replica is stored on different rack than other replicas. The choice of place is made according to DataNodes states (available capacity, total capacity). NameNode will try to balance the work between DataNodes. For example, if a block is replicated twice, both replicas are stored on DataNodes of different racks. If we decide to increase the number of replicas to 3, the 3rd replica will be written on a node located in 1 of these 2 racks. In the other side, if we have 3 replicas and the replication factor decreases to 2, the extra replica will be removed from the rack having 2 replicas.
The replication is executed through a replication pipeline. When the replica is written on the first DataNode, this DataNode moves replicated block to the 2nd DataNode. If the 2nd DataNode ends to write the block, it moves the block to the 3rd DataNode and so on - until reaching the last DataNode supposed to hold the block.
A good rule of thumb for determining the number of replicas is to specify more replicas for often read or important files. It should increase not only the fault-tolerance but also read performance.
You can change the replication factor of a file using command:
hdfs dfs –setrep –w 3 /user/hdfs/file.txt
You can also change the replication factor of a directory using command:
hdfs dfs -setrep -R 2 /user/hdfs/test
But changing the replication factor for a directory will only affect the existing files and the new files under the directory will get created with the default replication factor (dfs.replication from hdfs-site.xml) of the cluster.
Please see the link to understand more on it.
Please see link to configure replication factor for HDFS.
But you can temporarily override and turn off the HDFS default replication factor by passing:
-D dfs.replication=1
This should work well when you pass it with a Map/Reduce job. This will be your job specific only.
<property>
<name>dfs.replication<name>
<value>3<value>
<description>Block Replication<description>
<property>
You can also change the replication factor on a per-file basis using the Hadoop FS shell.
[jpanda@localhost ~]$ hadoop fs –setrep –w 3 /my/file
Alternatively, you can change the replication factor of all the files under a directory.
[jpanda@localhost ~]$ hadoop fs –setrep –w 3 -R /my/dir