QBoard » Big Data » Big Data - Hadoop Eco-System » merging two files in hadoop

merging two files in hadoop

  • I am a newbie in hadoop framework. So it would help me if someone can guide me thru this. I have two type of files. dirA/ --> file_a , file_b, file_c

    dirB/ --> another_file_a, another_file_b...

    Files in directory A contains tranascation information.

    So something like:

    id, time_stamp
    1 , some_time_stamp
    2 , some_another_time_stamp
    1 , another_time_stamp

    So, this kind of information is scattered across all the files in dirA. Now 1st thing to do is: I give a time frame (lets say last week) and I want to find all the unique ids which are present between that time frame.

    So, save a file.

    Now, dirB files contains the address information. Something like:

    id, address, zip code
    1, fooadd, 12345
    and so on

    So all the unique ids outputted by the first file.. I take them as input and then find the address and zip code.

    basically the final out is like the sql merge.

    Find all the unique ids between a time frame and then merge the address infomration.

    I would greatly appreciate any help. Thanks
      June 12, 2019 11:43 AM IST
    0
  • The error relates to you trying to re-direct the standard output of the command back to HDFS. There are ways you can do this, using the hadoop fs -put command with the source argument being a hypen:

    bin/hadoop fs -cat /user/username/folder/csv1.csv /user/username/folder/csv2.csv | hadoop fs -put - /user/username/folder/output.csv
    ​

    -getmerge also outputs to the local file system, not HDFS

    Unforntunatley there is no efficient way to merge multiple files into one (unless you want to look into Hadoop 'appending', but in your version of hadoop, that is disabled by default and potentially buggy), without having to copy the files to one machine and then back into HDFS, whether you do that in

    • a custom map reduce job with a single reducer and a custom mapper reducer that retains the file ordering (remember each line will be sorted by the keys, so you key will need to be some combination of the input file name and line number, and the value will be the line itself)
    • via the FsShell commands, depending on your network topology - i.e. does your client console have a good speed connection to the datanodes? This certainly is the least effort on your part, and will probably complete quicker than a MR job to do the same (as everything has to go to one machine anyway, so why not your local console?)
      September 6, 2021 1:38 PM IST
    0
  •  In order to merge two or more files into one single file and store it in hdfs, you need to have a folder in the hdfs path containing the files that you want to merge.

    Here, I am having a folder namely merge_files which contains the following files that I want to merge

    Then you can execute the following command to the merge the files and store it in hdfs:

    hadoop fs -cat /user/edureka_425640/merge_files/* | hadoop fs -put - /user/edureka_425640/merged_file s
    


    The merged_files folder need not be created manually. It is going to be created automatically to store your output when you are using the above command. You can view your output using the following command. Here my merged_files is storing my output result.

    hadoop fs -cat merged_files
    

     

    Supposing we have a folder with multiple empty files and some non-empty files and if we want to delete the files that are empty, we can use the below command:

    hdfs dfs -rm $(hdfs dfs -ls -R /user/A/ | grep -v "^d" | awk '{if ($5 == 0) print $8}')
    


    Here I am having a folder, temp_folder with three files, 2 being empty and 1 file is nonempty

      August 24, 2021 1:45 PM IST
    0
  • Hadoop -getmerge command is used to merge multiple files in an HDFS(Hadoop Distributed File System) and then put it into one single output file in our local file system.

    We want to merge the 2 files present inside are HDFS i.e. file1.txt and file2.txt, into a single file output.txt in our local file system.

    Steps To Use -getmerge Command

    Step 1: Let’s see the content of file1.txt and file2.txt that are available in our HDFS. You can see the content of File1.txt in the below image:

    Content of File1.txt

     
     

    Content of File2.txt

    In this case, we have copied both of these files inside my HDFS in Hadoop_File folder. If you don’t know how to make the directory and copy files to HDFS then follow below command to do so.

    • Making Hadoop_Files directory in our HDFS
      hdfs dfs -mkdir /Hadoop_File
    • Copying files to HDFS

      hdfs dfs -copyFromLocal /home/dikshant/Documents/hadoop_file/file1.txt /home/dikshant/Documents/hadoop_file/file2.txt /Hadoop_File

    Hadoop - getmerge Command - 1

    Below is the Image showing this file inside my /Hadoop_File directory in HDFS.

    Hadoop - getmerge Command - 2

    Step 2: Now it’s time to use -getmerge command to merge these files into a single output file in our local file system for that follow the below procedure.

    Syntax:

    hdfs dfs -getmerge -nl /path1 /path2 ..../path n /destination

    -nl is used for adding new line. this will add a new line between the content of these n files. In this case we have merge it to /hadoop_file folder inside my /Documents folder.

    hdfs dfs -getmerge -nl /Hadoop_File/file1.txt /Hadoop_File/file2.txt /home/dikshant/Documents/hadoop_file/output.txt

    getmerge-command

    Now let’s see whether the file get merged in output.txt file or not.

    getmerge command output in Hadoop

    In the above image, you can easily see that the file is merged successfully in our output.txt file.

      August 11, 2021 1:35 PM IST
    0

  •  
    You tagged this as pig, so I'm guessing you're looking to use it to accomplish this? If so, I think that's a great choice - this is really easy in pig!

    times = LOAD 'dirA' USING PigStorage(', ') AS (id:int, time:long);
    addresses = LOAD 'dirB' USING PigStorage(', ') AS (id:int, address:chararray, zipcode:chararray);
    filtered_times = FILTER times BY (time >= $START_TIME) AND (time <= $END_TIME);
    just_ids = FOREACH filtered_times GENERATE id;
    distinct_ids = DISTINCT just_ids;
    result = JOIN distinct_ids BY id, addresses BY id;

    Where $START_TIME and $END_TIME are parameters you can pass to the script.
      June 12, 2019 11:44 AM IST
    0