bin/hadoop fs -cat /user/username/folder/csv1.csv /user/username/folder/csv2.csv | hadoop fs -put - /user/username/folder/output.csv
Unforntunatley there is no efficient way to merge multiple files into one (unless you want to look into Hadoop 'appending', but in your version of hadoop, that is disabled by default and potentially buggy), without having to copy the files to one machine and then back into HDFS, whether you do that in
In order to merge two or more files into one single file and store it in hdfs, you need to have a folder in the hdfs path containing the files that you want to merge.
Here, I am having a folder namely merge_files which contains the following files that I want to merge
Then you can execute the following command to the merge the files and store it in hdfs:
hadoop fs -cat /user/edureka_425640/merge_files/* | hadoop fs -put - /user/edureka_425640/merged_file s
The merged_files folder need not be created manually. It is going to be created automatically to store your output when you are using the above command. You can view your output using the following command. Here my merged_files is storing my output result.
hadoop fs -cat merged_files
Supposing we have a folder with multiple empty files and some non-empty files and if we want to delete the files that are empty, we can use the below command:
hdfs dfs -rm $(hdfs dfs -ls -R /user/A/ | grep -v "^d" | awk '{if ($5 == 0) print $8}')
Here I am having a folder, temp_folder with three files, 2 being empty and 1 file is nonempty
Hadoop -getmerge command is used to merge multiple files in an HDFS(Hadoop Distributed File System) and then put it into one single output file in our local file system.
We want to merge the 2 files present inside are HDFS i.e. file1.txt and file2.txt, into a single file output.txt in our local file system.
Step 1: Let’s see the content of file1.txt and file2.txt that are available in our HDFS. You can see the content of File1.txt in the below image:
In this case, we have copied both of these files inside my HDFS in Hadoop_File folder. If you don’t know how to make the directory and copy files to HDFS then follow below command to do so.
hdfs dfs -mkdir /Hadoop_File
hdfs dfs -copyFromLocal /home/dikshant/Documents/hadoop_file/file1.txt /home/dikshant/Documents/hadoop_file/file2.txt /Hadoop_File
Below is the Image showing this file inside my /Hadoop_File directory in HDFS.
Step 2: Now it’s time to use -getmerge command to merge these files into a single output file in our local file system for that follow the below procedure.
Syntax:
hdfs dfs -getmerge -nl /path1 /path2 ..../path n /destination
-nl is used for adding new line. this will add a new line between the content of these n files. In this case we have merge it to /hadoop_file folder inside my /Documents folder.
hdfs dfs -getmerge -nl /Hadoop_File/file1.txt /Hadoop_File/file2.txt /home/dikshant/Documents/hadoop_file/output.txt
Now let’s see whether the file get merged in output.txt file or not.
In the above image, you can easily see that the file is merged successfully in our output.txt file.