Baby Steps into HBase

Today, after reading (the amazing and invaluable!) Understanding HBase and BigTable, while researching schemas for Google App Engine, I took my first tentative steps into using HBase.  About HBase:

HBase is the Hadoop database. Its (sic) an open-source, distributed, column-oriented store modeled after the Google paper, Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop.

HBase’s goal is the hosting of very large tables — billions of rows X millions of columns — atop clusters of commodity hardware. Try it if your plans for a data store run to big.

Well, my plans don’t run to big, but they do run to indexed over time.  Since every cell in an HBase table has a timestamp, it makes it really easy to snapshot data over time, and “rollback” a query as though it was asked at any point in the past.   For data that changes rarely over time, but for which one wants a historical record, this might make querying with history much simpler.

Historical Data Example

Think about how an organization changes over time.  Employees enter and leave, business units might be bought and sold.  One approach to modeling this is to take a snapshot every day, and store that in a RDBMS.    The snapshots will have lot of  redundant information, since an org doesn’t really change very much.

A simpler model is to simply enter a new snapshot of the organization when only when it changes, essentially overwriting the previous configuration.  Since HBase automatically labels cells with timestamp, this comes for free.

Setting it up

Using Ole-Martin Mørk’s instructions was a breeze!  Even though I know almost nothing about Java and the Java environment, I managed it.  I followed them, with these modifications:

  1. After downloading, unzipping, and symbolic linking to ~hbase, I version control the whole thing ( $ git init;  git-add * ; git ci -m “initial checkin, as unpacked from source”) , so that if I foul up anything, I can easily revert!
  2. Edit ~hbase/conf/ to have the right “JAVA_HOME” which for me (Debian) is  -> export JAVA_HOME=/usr/lib/jvm/java-6-openjdk

Since I don’t have passwordless ssh set up to local host, I get this error:

~/hbase$ ~/hbase/bin/
localhost: ssh: connect to host localhost port 22: Connection refused

The rest of the example seems to run fine though, and I’m in no mood to really track this down, since I’m still in the experiment phase.

Future Steps

I’m not sure whether I’m be going any deeper anytime soon, since I have a lot of SqlAlchemy code built around handling these sorts of ‘historical’ queries (where inserting and updating are the real difficulties!), but I do like the idea of easily versioned, map-like data stores quite well.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s