Tuesday, August 24, 2010

Namenode editlog (dfs.name.dir) multiple storage directories

Recently there was a question posted on the Apache Hadoop User mailing list regarding the behavior of specifying multiple storage directories fir property "dfs.name.dir". The documentation on the official site is not clear on what happens if one of the specified storage directories become inaccessible. While the documentation for other properties such as "dfs.data.dir" and "mapred.local.dir" clearly states that the system will ignore any directory that is inaccessible in case some of the specified directories are unavailable. See link below for reference

http://hadoop.apache.org/common/docs/current/hdfs-default.html

There were many comments on the post. See below the link for the complete post

http://lucene.472066.n3.nabble.com/what-will-happen-if-a-backup-name-node-folder-becomes-unaccessible-td1253293.html#a1253293

In this post I am basically going to summarize my tests to prove that it works in the cloudera distribution. So the behavior is that it ignores any directories that are inaccessible and the namenode only bails out when it can't access any of the specified directories. The below series of tests are pretty much self explanatory




hadoop@training-vm:~$ hadoop version
Hadoop 0.20.1+152
Subversion -r c15291d10caa19c2355f437936c7678d537adf94
Compiled by root on Mon Nov 2 05:15:37 UTC 2009

hadoop@training-vm:~$ jps
8923 Jps
8548 JobTracker
8467 SecondaryNameNode
8250 NameNode
8357 DataNode
8642 TaskTracker

hadoop@training-vm:~$ /usr/lib/hadoop/bin/stop-all.sh
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode

hadoop@training-vm:~$ mkdir edit_log_dir1

hadoop@training-vm:~$ mkdir edit_log_dir2

hadoop@training-vm:~$ ls
edit_log_dir1 edit_log_dir2

hadoop@training-vm:~$ ls -ltr /var/lib/hadoop-0.20/cache/hadoop/dfs/name
total 8
drwxr-xr-x 2 hadoop hadoop 4096 2009-10-15 16:17 image
drwxr-xr-x 2 hadoop hadoop 4096 2010-08-24 15:56 current

hadoop@training-vm:~$ cp -r /var/lib/hadoop-0.20/cache/hadoop/dfs/name edit_log_dir1

hadoop@training-vm:~$ cp -r /var/lib/hadoop-0.20/cache/hadoop/dfs/name edit_log_dir2

------ hdfs-site.xml added new dirs

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<!-- specify this so that running 'hadoop namenode -format' formats the right dir -->
<name>dfs.name.dir</name>
<value>/var/lib/hadoop- 0.20/cache/hadoop/dfs/name,/home/hadoop/edit_log_dir1,
/home/hadoop/edit_log_dir2</value>
</property>
<property>
<name>fs.checkpoint.period</name>
<value>600</value>
</property>
<property>
<name>dfs.namenode.plugins</name>
<value>org.apache.hadoop.thriftfs.NamenodePlugin</value>
</property>
<property>
<name>dfs.datanode.plugins</name>
<value>org.apache.hadoop.thriftfs.DatanodePlugin</value>
</property>
<property>
<name>dfs.thrift.address</name>
<value>0.0.0.0:9090</value>
</property>
</configuration>

---- start all daemons

hadoop@training-vm:~$ /usr/lib/hadoop/bin/start-all.sh
starting namenode, logging to
/usr/lib/hadoop/bin/../logs/hadoop-hadoop-namenode-training-vm.out
localhost: starting datanode, logging to
/usr/lib/hadoop/bin/../logs/hadoop-hadoop-datanode-training-vm.out
localhost: starting secondarynamenode, logging to
/usr/lib/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-training-vm.out
starting jobtracker, logging to
/usr/lib/hadoop/bin/../logs/hadoop-hadoop-jobtracker-training-vm.out
localhost: starting tasktracker, logging to
/usr/lib/hadoop/bin/../logs/hadoop-hadoop-tasktracker-training-vm.out


-------- namenode log confirms all dirs taken

2010-08-24 16:20:48,718 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = training-vm/127.0.0.1
STARTUP_MSG: args = []
STARTUP_MSG: version = 0.20.1+152
STARTUP_MSG: build = -r c15291d10caa19c2355f437936c7678d537adf94;
compiled by 'root' on Mon Nov 2 05:15:37 UTC 2009
************************************************************/
2010-08-24 16:20:48,815 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
Initializing RPC Metrics with hostName=NameNode, port=8022
2010-08-24 16:20:48,819 INFO org.apache.hadoop.hdfs.server.namenode.NameNode:
Namenode up at: localhost/127.0.0.1:8022
2010-08-24 16:20:48,821 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=NameNode, sessionId=null
2010-08-24 16:20:48,822 INFO
org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics:
Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NoEmitMetricsContext
2010-08-24 16:20:48,894 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
fsOwner=hadoop,hadoop
2010-08-24 16:20:48,894 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
supergroup=supergroup
2010-08-24 16:20:48,894 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
isPermissionEnabled=false
2010-08-24 16:20:48,903 INFO org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics:
Initializing
FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NoEmitMetricsContext
2010-08-24 16:20:48,905 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
FSNamesystemStatusMBean
2010-08-24 16:20:48,937 INFO org.apache.hadoop.hdfs.server.common.Storage: Storage directory
/home/hadoop/edit_log_dir1 is not formatted.
2010-08-24 16:20:48,937 INFO org.apache.hadoop.hdfs.server.common.Storage: Formatting ...
2010-08-24 16:20:48,937 INFO org.apache.hadoop.hdfs.server.common.Storage: Storage directory
/home/hadoop/edit_log_dir2 is not formatted.
2010-08-24 16:20:48,937 INFO org.apache.hadoop.hdfs.server.common.Storage: Formatting ...
2010-08-24 16:20:48,938 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files = 41
2010-08-24 16:20:48,947 INFO org.apache.hadoop.hdfs.server.common.Storage:
Number of files under construction = 0
2010-08-24 16:20:48,947 INFO org.apache.hadoop.hdfs.server.common.Storage:
Image file of size 4357 loaded in 0 seconds.

---- directories confirm in use

hadoop@training-vm:~$ ls -ltr edit_log_dir1
total 12
drwxr-xr-x 4 hadoop hadoop 4096 2010-08-24 16:01 name
-rw-r--r-- 1 hadoop hadoop 0 2010-08-24 16:20 in_use.lock
drwxr-xr-x 2 hadoop hadoop 4096 2010-08-24 16:20 image
drwxr-xr-x 2 hadoop hadoop 4096 2010-08-24 16:20 current

hadoop@training-vm:~$ ls -ltr edit_log_dir2
total 12
drwxr-xr-x 4 hadoop hadoop 4096 2010-08-24 16:01 name
-rw-r--r-- 1 hadoop hadoop 0 2010-08-24 16:20 in_use.lock
drwxr-xr-x 2 hadoop hadoop 4096 2010-08-24 16:20 image
drwxr-xr-x 2 hadoop hadoop 4096 2010-08-24 16:20 current

----- secondary name node checkpoint worked fine

...
2010-08-24 16:27:10,756 INFO org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Posted URL
localhost:50070putimage=1&port=50090&machine=127.0.0.1&token=-
18:1431678956:1255648991179:1282692430000:1282692049090
2010-08-24 16:27:11,008 WARN org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode:

Checkpoint done. New Image Size: 4461
....

--- dirctory put works fine

hadoop@training-vm:~$ hadoop fs -ls /user/training
Found 3 items
drwxr-xr-x - training supergroup 0 2010-06-30 13:18 /user/training/grep_output
drwxr-xr-x - training supergroup 0 2010-06-30 13:14 /user/training/input
drwxr-xr-x - training supergroup 0 2010-06-30 15:30 /user/training/output

hadoop@training-vm:~$ hadoop fs -put /etc/hadoop/conf.with-desktop/hdfs-site.xml /user/training

hadoop@training-vm:~$ hadoop fs -ls /user/training
Found 4 items
drwxr-xr-x - training supergroup 0 2010-06-30 13:18 /user/training/grep_output
-rw-r--r-- 1 hadoop supergroup 987 2010-08-24 16:25 /user/training/hdfs-site.xml
drwxr-xr-x - training supergroup 0 2010-06-30 13:14 /user/training/input
drwxr-xr-x - training supergroup 0 2010-06-30 15:30 /user/training/output


------ delete one of the directories
hadoop@training-vm:~$ rm -rf edit_log_dir2

hadoop@training-vm:~$ ls -ltr
total 4
drwxr-xr-x 5 hadoop hadoop 4096 2010-08-24 16:20 edit_log_dir1

-- namenode logs

No errors/warns in logs

-------- namenode still running

hadoop@training-vm:~$ jps
12426 NameNode
12647 SecondaryNameNode
12730 JobTracker
14090 Jps
12535 DataNode
12826 TaskTracker

---- puts and ls work fine

hadoop@training-vm:~$ hadoop fs -ls /user/training
Found 4 items
drwxr-xr-x - training supergroup 0 2010-06-30 13:18 /user/training/grep_output
-rw-r--r-- 1 hadoop supergroup 987 2010-08-24 16:25 /user/training/hdfs-site.xml
drwxr-xr-x - training supergroup 0 2010-06-30 13:14 /user/training/input
drwxr-xr-x - training supergroup 0 2010-06-30 15:30 /user/training/output

hadoop@training-vm:~$ hadoop fs -put /etc/hadoop/conf.with-desktop/core-site.xml /user/training

hadoop@training-vm:~$ hadoop fs -put /etc/hadoop/conf.with-desktop/mapred-site.xml /user/training

hadoop@training-vm:~$ hadoop fs -ls /user/training
Found 6 items
-rw-r--r-- 1 hadoop supergroup 338 2010-08-24 16:28 /user/training/core-site.xml
drwxr-xr-x - training supergroup 0 2010-06-30 13:18 /user/training/grep_output
-rw-r--r-- 1 hadoop supergroup 987 2010-08-24 16:25 /user/training/hdfs-site.xml
drwxr-xr-x - training supergroup 0 2010-06-30 13:14 /user/training/input
-rw-r--r-- 1 hadoop supergroup 454 2010-08-24 16:29 /user/training/mapred-site.xml
drwxr-xr-x - training supergroup 0 2010-06-30 15:30 /user/training/output

------- secondary namenode checkpoint is successdul

2010-08-24 16:37:11,455 WARN org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode:
Checkpoint done. New Image Size: 4671
....
2010-08-24 16:47:11,884 WARN org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode:
Checkpoint done. New Image Size: 4671
...
2010-08-24 16:57:12,264 WARN org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode:
Checkpoint done. New Image Size: 4671

------- after 30 mins

hadoop@training-vm:~$ jps
12426 NameNode
12647 SecondaryNameNode
12730 JobTracker
16256 Jps
12535 DataNode
12826 TaskTracker

No comments:

Post a Comment