Hbase Slowdown and Region Inconsistencies

At work, we had an issue with Hbase where one of our regionservers was reporting its hostname and ip address to the Hbase master.  This was causing a lot of slowness to occur when regions were being accessed through the ip address and the hostname was returning the data.

First, I needed to isolate the problem to this node.  Since I knew the regionserver data is being shared on the underlying datanode, I wasn’t concerned with shutting the regionserver off.

Act I.  Shutting the regionserver off.

Well, that worked.  However, looking at the Hbase master, the ip is still checking in.  The next step is to see what kind of processes are running on the node.  ssh onto the regionserver host and run something like:

ps -ef | grep regionserver

When I did this, there was a orphaned regionserver process.  Take note of the child and parent PIDs in case you kill the process and still have an orphan.

kill <PID>

Now, I was free of any lingering regionservers.  Looking at the hbase master, the regionserver IP was finally listed on the dead nodes.

At this point, I brought the regionserver back.  However, as it came back, there were several regionservers still stuck in transition.  This is the dreaded RIT issue!

Act II.  Remove the Regions In Transition

The typical steps to fixing regions in transition are to stop hbase, remove the znodes, start hbase.

And that’s what I did.

So, I stopped hbase and watched the logs to see if there was going to be any errors.  Things shutdown rather quickly, so I logged into the Zookeeper Command Line Interface (zkcli).

hbase zkcli or zookeeper-client

in here you want to remove the /hbase znode.

rmr /hbase

Then I started hbase back up.

Act. III – Fixing Regions in Transition (Again)

So, we didn’t actually fix the regions.  Some of the regions were still tied to the orphaned regionserver process ip.  This is where you bring out the big guns.

hbase hbck -repair

When I ran this, it started to look at all the hbase tables and regionservers for inconsistencies.  This ended up repairing the regions that were tied to the orphaned process and found quite a few files that had no references at all.

In the end, these three steps fixed a majority of slowness for us.