Tuesday, July 06, 2010

Hadoop error logs: org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException

2010-06-06 23:28:04,195 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(, storageID=DS-1219700830-,
infoPort=50075, ipcPort=50020):DataXceiver
org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block
blk_-7647836931304231161_18783 is valid, and cannot be written to.
at org.apache.hadoop.hdfs.server.datanode.FSDataset.writeToBlock(FSDataset.java:1069)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:98)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:259)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
at java.lang.Thread.run(Thread.java:619)

Most of the research online pointed to an issue with the open file limit on the cluster instances. Did some analysis and found below

-- current open files [sudhirv@hadoop-cluster-2 ~]$ sudo /usr/sbin/lsof | wc -l

-- allowed open file limit cutoff
[hadoop@hadoop-cluster-2 ~]$ ulimit -n

As per the Pro Hadoop book Pg 112 "Hadoop Core uses large numbers of file descriptors for MapReduce, and the DFSClient uses a large number of file descriptors for communicating with the HDFS NameNode and DataNode server processes. Looking at the below commands I see the current open file descriptors greater than set file limit. Most sites immediately bump up the number of file descriptors to 64,000, and large sites, or sites that have MapReduce jobs that open many files, might go higher."

See reference links below to find more suggestions

Decided to set the limit to 8192 and monitor the logs. Modified the open file limit on all hadoop cluster boxes to 8192. Monitored the logs for a few days and we don't see the error anymore.

[root@hadoop-cluster-1 ~]# ulimit -Hn
[root@hadoop-cluster-1 ~]# ulimit -Sn



No comments:

Post a Comment