Thursday, May 27, 2010

Updating Cloudera Hadoop To Latest Patch On CentOS

Warning: Below steps should only be used when updating Hadoop within the same major version. For instance updates like 0.20.1+169.56 to 0.20.1+169.88. When upgrading (example going from 0.20 to 0.21) these steps should not be used.

We have a small cluster in place so the below steps works for us. Bigger cluster nodes will need a more automated way.

Here are the step by step instructions. Ensure no map/reduce jobs/HDFS data write jobs are running.

Step 1: Create a screen session on all hadoop nodes

Step 2: Prepare for any disaster that might happen

On the master node where NameNode is current running, take a latest checkpoint in case things go wrong. This will serve as last image to recover from if things go wrong. To take a manual checkpoint, you need to enter safemode. See below for series of commands to do that.


- check current mode
[sudhirv@hadoop-cluster-3 bin]$ hadoop dfsadmin -safemode get
Safe mode is OFF

- enter safe mode
[sudhirv@hadoop-cluster-3 bin]$ hadoop dfsadmin -safemode enter
Safe mode is ON

- check safe mode is ON
[sudhirv@hadoop-cluster-3 bin]$ hadoop dfsadmin -safemode get
Safe mode is ON

- run manual checkpoint
[sudhirv@hadoop-cluster-1 ~]$ hadoop dfsadmin -saveNamespace

-- ensure a checkpoint was taken by looking at image timestamp
[sudhirv@hadoop-cluster-1 ~]$ ls -ltr /var/lib/hadoop-0.20/cache/hadoop/dfs/name/image
total 4
-rw-rw-r-- 1 hadoop hadoop 157 May 26 16:49 fsimage

- leave safe mode
[sudhirv@hadoop-cluster-1 ~]$ hadoop dfsadmin -safemode leave
Safe mode is OFF

- confirm safe mode is off
[sudhirv@hadoop-cluster-1 ~]$ hadoop dfsadmin -safemode get
Safe mode is OFF


Step 3: Log the current hadoop version and configurations versions. The hadoop version will change after update. The configurations version should still stay the same.


sudo su -

[root@hadoop-cluster-3 ~]$ hadoop version
Hadoop 0.20.1+169.56
Subversion -r 8e662cb065be1c4bc61c55e6bff161e09c1d36f3
Compiled by root on Tue Feb 9 13:40:08 EST 2010

[root@hadoop-cluster-3 ~]# alternatives --display hadoop-0.20-conf
hadoop-0.20-conf - status is auto.
link currently points to /etc/hadoop-0.20/conf.cluster
/etc/hadoop-0.20/conf.empty - priority 10
/etc/hadoop-0.20/conf.pseudo - priority 30
/etc/hadoop-0.20/conf.cluster - priority 50
Current `best' version is /etc/hadoop-0.20/conf.cluster.


Step 4: Shutdown hadoop services on the node. Check to ensure services are turned off. It is recommended to start with nodes that host the Name Node and Job Tracker first as mismatch in versions between NN,JT and DN,TT won't work in most cases.

Step 5: Check to ensure the new hadoop update patch shows up in the yum update list. If not then you would need to configure the repo correctly



yum list updates


Step 5: Run the update



sudo su -
[root@hadoop-cluster-3 ~]# yum update hadoop-0.20


Step 6: Check updated version to ensure update was performed and that config alternatives were not changed



[root@hadoop-cluster-3 ~]# hadoop version
Hadoop 0.20.1+169.88
Subversion -r ded54b29979a93e0f3ed773175cafd16b72511ba
Compiled by root on Thu May 20 22:22:39 EDT 2010

[root@hadoop-cluster-3 ~]# alternatives --display hadoop-0.20-conf
hadoop-0.20-conf - status is auto.
link currently points to /etc/hadoop-0.20/conf.cluster
/etc/hadoop-0.20/conf.empty - priority 10
/etc/hadoop-0.20/conf.pseudo - priority 30
/etc/hadoop-0.20/conf.cluster - priority 50
Current `best' version is /etc/hadoop-0.20/conf.cluster.


Step 7: Start the hadoop services on the node and tail logs to ensure no errors are being reported.

Wednesday, May 19, 2010

HDFS Relative path error when running map/reduce jobs

Recently while running a map reduce we encountered this error


java.io.IOException: Can not get the relative path:
base = hdfs://hadoop-master:8020/5/_temporary/_attempt_201005150057_0006_r_000000_0
child = hdfs://hadoop-master/5/_temporary/_attempt_201005150057_0006_r_000000_0/part-r-00000
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.getFinalPath(FileOutputCommitter.java:200)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:146)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:165)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:118)
at org.apache.hadoop.mapred.Task.commit(Task.java:779)
at org.apache.hadoop.mapred.Task.done(Task.java:691)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:414)
at org.apache.hadoop.mapred.Child.main(Child.java:170)


The job tracker and task tracker logs continuously kept failing across many retry attempts and continuously were dumping this error in the logs. Here's the code part that throws the error.

Researched this for quite a while and finally found this excerpt from the Pro Hadoop book "An example of an HDFS URI is hdfs://NamenodeHost[:8020]/. The file system protocol is hdfs, the host to contact for services is NamenodeHost, and the port to connect to is 8020, which is the default port for HDFS. If the default 8020 port is used, the URI may be simplified as hdfs://NamenodeHost/. This value may be altered by individual jobs. You can choose an arbitrary port for the hdfs NameNode."

Currently our "fs.default.name" parameter in core-site.xml file is set to "hdfs://hadoop-master.ic:8020". Based on the above text I went ahead and removed the default port from the parameter and restarted the name node. Ran your job again and voila!!! works now.

Sunday, May 02, 2010

Fluid app when signin opens default browser

I am big fan of Fluid and use it for creating apps for most of my email/social sites. I have also one created for Campfire which I use to communicate with my team. Recently my Campfire app stopped working (not sure if this happened after I upgraded to newer Fluid version or whether the Campfire dev started using a single signon servide ) in the sense then when I click on 'Signin' it was opening my default browser and forwarding me to "https://launchpad.37signals.com/signin". On further googling found that this was happening because the campfire fluid app did not have permission to allow other site domains meaning the signin process was forwarding to a different domain than the campfire specific domain. For instance my campfire domain is "https://icrossing.campfirenow.com" and signin would redirect to "https://launchpad.37signals.com/signin" for authentication. To make it work all it needed was to set the app permission to allow browsing to other domains and that made it work.

Go to fluid app -> preferences -> Advanced and enable the below option