Section One: Install and configuration
In this casse I used Ubuntu as master and slave and CentOS as slave only. Therefore
Daemons for master nodes (Ubuntu): NameNode SecondaryNameNode DataNode TaskTracker JobTracker Daemons for slave node (CentOS): DataNode TaskTracker
1. Download and install Hadoop on both computers, make sure they are intalled into the same locations (highly recommend) bu following http://b2ctran.wordpress.com/2013/08/26/install-hadoop-1-1-2-on-ubuntu-12-04/.
2. Edit /etc/hosts to add host names for each computer and make sure that you can ping each other by host names.
2.1 for the master node make sure that your host names is listed behind your external ip not 127.0.0.1, to make sure that DataNodes could find NameNode.
use sudo netstat -tuplen to verify.
host file for master node/Ubuntu
127.0.0.1 localhost 184.108.40.206 ncw01111.test.com ncw01111 # adding here for preventing internal binding # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters
host file for slave node/CentOS
127.0.0.1 ncw02222 localhost.localdomain localhost ::1 localhost6.localdomain6 localhost6 220.127.116.11 ncw01111.test.com ncw01111 # I added master node ip and host name here when I have problem pinging master node. # at that time my Ubuntu host name has not been put into dns server
2.2 Set passwordless SSH for each computer, so they can talk to each other without password.
3. make configuration for master, Ubuntu in this case:
3.1 Edit master and slave file located in conf folder of the master node (NOTE: For master node only)
# Edit master file sudo vim $HADOOP_PREFIX/conf/masters # then add ncw01111 into the file # Edit salve file sudo vim $HADOOP_PREFIX/conf/slaves # then add ncw01111 and ncw02222 into the files # one host name per line
4. Edit configuration files: the configuration files, core-site.xml, mapred-site.xml and hdfs-site.xml are all the same for all the computers within the cluster.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hadoop.tmp.dir</name> <value>/home/bee/projects/hdfs_test</value> </property> <property> <name>fs.default.name</name> <value>hdfs://ncw01111:54310</value> <final>true</final> </property> </configuration>
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>mapred.job.tracker</name> <value>ncw01111:54311</value> </property> <property> <name>mapred.system.dir</name> <value>/home/bee/projects/mapred/system</value> <final>true</final> </property> </configuration>
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>dfs.replication</name> <value>2</value> </property> </configuration>
5. Format the HDFS file system
hadoop namenode -format
6. Start the cluster
start-all.sh # or, use the following start up sequence start-dfs.sh start-mapred.sh
then use jsp check the daemons
# master node, Ubuntu 9700 SecondaryNameNode 10093 TaskTracker 9169 NameNode 10367 Jps 9808 JobTracker 9432 DataNode #slave node, CentOS 26392 Jps 26201 TaskTracker 26095 DataNode
Please note, results from jps could not guarantee that your cluster has no problem. I had an issue that DataNode could not connect to NameNode (Problem 1 in the following problems and solution section ), I still could get the correct output from jps.
Therefore, always look at your log files to see if there is any problem.
if you see all your nodes of your cluster from here http://ncw01111:50030/machines.jsp?type=active your cluster might be good.
7. Put data into cluster and run a MapReduce job
I used 99 years’ NOAA data as an example, total 263G. They were preprocessed into a single file for each year, for example 1901.all.
7.1 put the data into cluster
gzip -c $year | hadoop fs -put - noaa/cluster/`basename $year.gz`
7.2 run the python version of the mapreduce on noaa data
# do not use relative paths in the following command (see problem and solution #2) hadoop jar /opt/hadoop-1.1.2/contrib/streaming/hadoop-streaming-1.1.2.jar -mapper /home/bee/projects/hadoop/max_temperature_map.py -reducer /home/bee/projects/hadoop/max_temperature_reduce.py -input /user/bee/noaa/cluster/ -output /user/bee/noaa/cluster_out
7.3 Running results comparison for 99 years’ NOAA data.
7.1 Time respect
Single node (Ubuntu only): Finished in: 59mins, 42sec
Two Nodes Cluster (Ubuntu and CentOS): Finished in: 60mins, 12sec
Two Nodes Cluster (Ubuntu and CentOS and using combiner): Finished in: 34mins, 12sec
7.2 Check results in HDFS
hadoop dfs -tail noaa/cluster_out/part-00000 1901 317 1902 244 ... 1999 568
7.3 Retrieve the job result from HDFS
hadoop dfs -copyToLocal noaa/cluster_out/part-00000 /home/bee/projects/hadoop/output/cluster_out
Section 2: Problem and Solution
1. 2013-09-05 15:12:26,822 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: ncw01111/18.104.22.168:54310. Already tried 0 time(s); retry policy is RetryUpToMax
imumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
when you check your port opening with netstat you found that your port 54310 is binding to a internal ip (127.0.0.1)
tcp6 0 0 127.0.0.1:54310 :::* LISTEN 1001 624829 24900/java
check you /etc/hosts file make sure that your hostname is not listed after 127.0.0.1. or 127.0.1.1
Put your external IP and hostname pair in hosts file if it doe snot exist.
then you check the port now:
tcp6 0 0 22.214.171.124:54310 :::* LISTEN 1001 624829 24900/java
2. # of failed Map Tasks exceed allowed limit error.
# ERROR streaming.StreamJob: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. error.
Solution: Go to master:50030/jobdetails.jsp?jobid=XXXXX(You can find it from your Hadoop output) to look for the real error.
After debugging I found an IO error in my case. Fix: fix any relative paths in your job request with absolute paths when you using Hadoop streaming with python mapper and reducer.