I am currently starting a project titled "Cloud computing for time series mining algorithms using Hadoop". The data which I have is hdf files of size over a terabyte.In hadoop as... moreI am currently starting a project titled "Cloud computing for time series mining algorithms using Hadoop". The data which I have is hdf files of size over a terabyte.In hadoop as I know that we should have text files as input for further processing (map-reduce task). So I have one option that I convert all my .hdf files to text files which is going to take a lot of time.
Or I find a way of how to use raw hdf files in map reduce programmes. So far I have not been successful in finding any java code which reads hdf files and extract data from them. If somebody has a better idea of how to work with hdf files I will really appreciate such help.
Thanks Ayush less
I have this only in my namenode:
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
In my data... moreI have this only in my namenode:
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
In my data nodes, I have this:
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
Now my question is, will the replication factor be 3 or 1?
At the moment, the output of hdfs dfs -ls hdfs:///user/hadoop-user/data/0/0/0 shows 1 replication factor:
-rw-r--r-- 1 hadoop-user supergroup 68313 2015-11-06 19:32 hdfs:///user/hadoop-user/data/0/0/0/00099954tnemhcatta.bin
I am a newbie in hadoop framework. So it would help me if someone can guide me thru this. I have two type of files. dirA/ --> file_a , file_b, file_cdirB/ -->... moreI am a newbie in hadoop framework. So it would help me if someone can guide me thru this. I have two type of files. dirA/ --> file_a , file_b, file_cdirB/ --> another_file_a, another_file_b...Files in directory A contains tranascation information.So something like: id, time_stamp 1 , some_time_stamp 2 , some_another_time_stamp 1 , another_time_stampSo, this kind of information is scattered across all the files in dirA. Now 1st thing to do is: I give a time frame (lets say last week) and I want to find all the unique ids which are present between that time frame.So, save a file.Now, dirB files contains the address information. Something like: id, address, zip code 1, fooadd, 12345 and so onSo all the unique ids outputted by the first file.. I take them as input and then find the address and zip code.basically the final out is like the sql merge.Find all the unique ids between a time frame and then merge the address infomration.I would greatly appreciate any help. Thanks less