Baby Steps into HBasePosted: July 15, 2009
HBase is the Hadoop database. Its (sic) an open-source, distributed, column-oriented store modeled after the Google paper, Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop.
HBase’s goal is the hosting of very large tables — billions of rows X millions of columns — atop clusters of commodity hardware. Try it if your plans for a data store run to big.
Well, my plans don’t run to big, but they do run to indexed over time. Since every cell in an HBase table has a timestamp, it makes it really easy to snapshot data over time, and “rollback” a query as though it was asked at any point in the past. For data that changes rarely over time, but for which one wants a historical record, this might make querying with history much simpler.
Historical Data Example
Think about how an organization changes over time. Employees enter and leave, business units might be bought and sold. One approach to modeling this is to take a snapshot every day, and store that in a RDBMS. The snapshots will have lot of redundant information, since an org doesn’t really change very much.
A simpler model is to simply enter a new snapshot of the organization when only when it changes, essentially overwriting the previous configuration. Since HBase automatically labels cells with timestamp, this comes for free.
Setting it up
Using Ole-Martin Mørk’s instructions was a breeze! Even though I know almost nothing about Java and the Java environment, I managed it. I followed them, with these modifications:
- After downloading, unzipping, and symbolic linking to ~hbase, I version control the whole thing ( $ git init; git-add * ; git ci -m “initial checkin, as unpacked from source”) , so that if I foul up anything, I can easily revert!
- Edit ~hbase/conf/hbase-env.sh to have the right “JAVA_HOME” which for me (Debian) is -> export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
Since I don’t have passwordless ssh set up to local host, I get this error:
localhost: ssh: connect to host localhost port 22: Connection refused
The rest of the example seems to run fine though, and I’m in no mood to really track this down, since I’m still in the experiment phase.
I’m not sure whether I’m be going any deeper anytime soon, since I have a lot of SqlAlchemy code built around handling these sorts of ‘historical’ queries (where inserting and updating are the real difficulties!), but I do like the idea of easily versioned, map-like data stores quite well.