Wednesday, July 21, 2010

Installing Hadoop Native Libraries : JAVA_HOME not set correctly error

Followed the instructions here to install the hadoop native libraries for the Cloudera distribution. I got the JAVA_HOME not set error when running the ant compile-native target. Here's the step by step procedure to install the native libraries and resolve the error.

Follow the instructions here first to ensure you have the native libraries installed in your OS

To configure the hadoop native libraries follow the steps below. Since all our cluster nodes were of the same configuration (architecture and OS wise), I built the native libraries on one and copied onto the other nodes. If you nodes are different then you will have to repeat the below steps on each hadoop node.

Step 1: Install and configure Apache Ant on the hadoop cluster node

Step 2: Download the source of you corresponding hadoop version.

- find your hadoop version

$ hadoop version
Hadoop 0.20.1+169.88
Subversion -r ded54b29979a93e0f3ed773175cafd16b72511ba
Compiled by root on Thu May 20 22:22:39 EDT 2010

- download the corresponding SRPM for the above version.
We use Cloudera distribution so I got mine from below link

- Install the SRPM to get source. This basically dumped the source in my
folder. If it dumped as a tar.gz file then go ahead and untar it.

$rpm -i hadoop-0.20-0.20.1+169.88-1.src.rpm

Step 3: Link the 'src' folder in source to the hadoop installation folder

cd /usr/lib/hadoop-0.20 -- this is basically the HADOOP_HOME
sudo ln -s /usr/src/redhat/SOURCES/hadoop-0.20.1+169.88/hadoop-0.20.1+169.88/src src

Step 4: Run the ant task

ant -Dcompile.native=true compile-native

If the above task fails with the 'jni.h' not found error then configure the JAVA_HOME PATH

[exec] checking jni.h usability... /usr/lib/hadoop-0.20/src/native/configure: line
19091: test: !=: unary operator expected
[exec] no
[exec] checking jni.h presence... no
[exec] checking for jni.h... no
[exec] configure: error: Native java headers not found. Is $JAVA_HOME set correctly?

The following error occurred while executing this line:
/usr/lib/hadoop-0.20/build.xml:487: exec returned: 1

To resolve the error open the build.xml file and the add env JAVA_HOME in compile-core-native target. After adding the change should look like below

<exec dir="${build.native}" executable="sh" failonerror="true">
<env key="OS_NAME" value="${}"/>
<env key="JAVA_HOME" value="/usr/java"/>
<env key="OS_ARCH" value="${os.arch}"/>
<env key="JVM_DATA_MODEL" value="${}"/>
<env key="HADOOP_NATIVE_SRCDIR" value="${native.src.dir}"/>
<arg line="${native.src.dir}/configure"/>

The above will resolve the issue.

A successful installation should show something like this

$ ls -ltr /usr/lib/hadoop-0.20/build/native/Linux-amd64-64/lib
total 212
-rw-r--r-- 1 root root 12953 Jul 21 16:43 Makefile
-rwxr-xr-x 1 root root 73327 Jul 21 16:43
-rw-r--r-- 1 root root 850 Jul 21 16:43
-rw-r--r-- 1 root root 114008 Jul 21 16:43 libhadoop.a
lrwxrwxrwx 1 root root 18 Jul 21 17:04 ->
lrwxrwxrwx 1 root root 18 Jul 21 17:04 ->

Run a test job to make sure native libs are being picked properly. Use the test jar based on your version of hadoop. In the below test ensure that the native libs are getting picked. Check for statement "Successfully loaded & initialized native-zlib library".

$ hadoop jar /usr/lib/hadoop-0.20/hadoop-0.20.1+169.88-test.jar testsequencefile -seed
0 -count 1000 -compressType RECORD xxx -codec -check
10/07/22 11:46:27 INFO io.TestSequenceFile: count = 1000
10/07/22 11:46:27 INFO io.TestSequenceFile: megabytes = 1
10/07/22 11:46:27 INFO io.TestSequenceFile: factor = 10
10/07/22 11:46:27 INFO io.TestSequenceFile: create = true
10/07/22 11:46:27 INFO io.TestSequenceFile: seed = 0
10/07/22 11:46:27 INFO io.TestSequenceFile: rwonly = false
10/07/22 11:46:27 INFO io.TestSequenceFile: check = true
10/07/22 11:46:27 INFO io.TestSequenceFile: fast = false
10/07/22 11:46:27 INFO io.TestSequenceFile: merge = false
10/07/22 11:46:27 INFO io.TestSequenceFile: compressType = RECORD
10/07/22 11:46:27 INFO io.TestSequenceFile: compressionCodec =
10/07/22 11:46:27 INFO io.TestSequenceFile: file = xxx
10/07/22 11:46:27 INFO io.TestSequenceFile: creating 1000 records with RECORD compression
10/07/22 11:46:27 INFO util.NativeCodeLoader: Loaded the native-hadoop library
10/07/22 11:46:27 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
10/07/22 11:46:27 INFO compress.CodecPool: Got brand-new compressor
10/07/22 11:46:27 INFO compress.CodecPool: Got brand-new decompressor
10/07/22 11:46:28 INFO compress.CodecPool: Got brand-new decompressor
10/07/22 11:46:28 INFO io.TestSequenceFile: done sorting 1000 debug
10/07/22 11:46:28 INFO io.TestSequenceFile: sorting 1000 records in memory for debug

Looking online some other folks have had issues with above test. They had to manually add the native libs to JAVA_LIBRARY_PATH. This issue was fixed in If you are having that issue then here is a link on how to manually add it to path

Friday, July 16, 2010

Ning : Use your own domain from HostMonster

Here are the instructions on how to set the DNS records to point your domain from HostMonster to Ning. Currently my Ning site is set at "" and I wanted to point it to "". So I purchased the "" on HostMonster. Under cPanel in HostMonster there is a link for "Advanced DNS Editor". Here are the instructions on how to locate the DNS editor

click on the DNS editor link and choose the "" from the "Select a Domain" dropdown list.

Take a screenshot of the current settings (in case you screw up you can always restore it to what it was earlier using the screenshot)

Step 1: Delete any current "A" type record with name ""

Step 2: Set up a new "A" type record like the below screen shot

Step 3: Set up a new "CNAME" type record like the below screen shot

Now access the "" site or "" site and it should point to your ning page. Depending on the set TTL, it might take a while for the new mapping to refresh. You can try to flush your DNS cache using the steps from this site

Thursday, July 15, 2010

Hadoop HDFS Error: Could not complete write to file

We recently have been seeing this error in our hadoop name node logs quite a bit.

2010-06-14 05:15:30,428 WARN org.apache.hadoop.hdfs.StateChange: DIR*
NameSystem.completeFile: failed to complete {filename} because dir.getFileBlocks() is null and pendingFile is null Could not complete write to file {filename} by DFSClient_-44010819 Could not complete write to file {filename} by DFSClient_-44010819
at org.apache.hadoop.hdfs.server.namenode.NameNode.complete(
at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
at java.lang.reflect.Method.invoke(
at org.apache.hadoop.ipc.RPC$
at org.apache.hadoop.ipc.Server$Handler$
at org.apache.hadoop.ipc.Server$Handler$
at Method)

Further researching, we found that a new map/reduce job was deployed recently that merges daily files. The errors pretty much started showing up after this job was deployed and is cron to run every day. This job takes a bunch of small files as input and merges them to create a merged compressed file. In the process of creating the merging compressed file, it creates a temp file with merged content, verifies the file and then compresses it to a final file after which it deletes the temp file. Errors were being thrown when the temp file was getting deleted. Certain times delete was happening (meta data on namenode) before the file was replicated to all nodes and that was throwing this error. Currently the HDFS API does not provide a way to synchronouos method to wait till file is replicated.

To resolve this, we started using the HDFS trash functionality. HDFS best practices suggests "Enable HDFS trash, and avoid programmatic deletes - prefer the trash facility."If trash is enabled (and, it should be noted, by default it is not), files that are deleted using the Hadoop filesystem shell are moved into a special hidden trash directory, rather than being deleted immediately. The trash directory is deleted periodically by the system. Any files that are mistakenly deleted can be recovered manually by moving them out of the trash directory. The trash deletion period is configurable. The trash facility is a user level feature, it is not used when programmatically deleting files. As a best practice it is recommended considering moving data to be deleted to the trash directory (to be deleted by the system after the given time), rather than doing an immediate delete.

We changed the code to move to trash instead of deleting the code and that resolved the errors in the logs. Below is a common utility method we use across our code base when moving files.

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.Trash;

public static void moveToTrash(Configuration conf,Path path) throws IOException
Trash t=new Trash(conf);
boolean isMoved=t.moveToTrash(path);
logger.error("Trash is not enabled or file is already in the trash.");

Tuesday, July 06, 2010

Hadoop error logs: org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException

2010-06-06 23:28:04,195 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(, storageID=DS-1219700830-,
infoPort=50075, ipcPort=50020):DataXceiver
org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block
blk_-7647836931304231161_18783 is valid, and cannot be written to.
at org.apache.hadoop.hdfs.server.datanode.FSDataset.writeToBlock(
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(

Most of the research online pointed to an issue with the open file limit on the cluster instances. Did some analysis and found below

-- current open files [sudhirv@hadoop-cluster-2 ~]$ sudo /usr/sbin/lsof | wc -l

-- allowed open file limit cutoff
[hadoop@hadoop-cluster-2 ~]$ ulimit -n

As per the Pro Hadoop book Pg 112 "Hadoop Core uses large numbers of file descriptors for MapReduce, and the DFSClient uses a large number of file descriptors for communicating with the HDFS NameNode and DataNode server processes. Looking at the below commands I see the current open file descriptors greater than set file limit. Most sites immediately bump up the number of file descriptors to 64,000, and large sites, or sites that have MapReduce jobs that open many files, might go higher."

See reference links below to find more suggestions

Decided to set the limit to 8192 and monitor the logs. Modified the open file limit on all hadoop cluster boxes to 8192. Monitored the logs for a few days and we don't see the error anymore.

[root@hadoop-cluster-1 ~]# ulimit -Hn
[root@hadoop-cluster-1 ~]# ulimit -Sn