Snippet: Size of PostgreSQL Tables using SqlAlchemy

This does a decent job of getting approximate sizes of tables and indexes in a PG table, using SA.

# SELECT pg_database_size('geekdb');
# with indexes
# SELECT pg_size_pretty(pg_total_relation_size('big_table'));
# without
# SELECT pg_size_pretty(pg_relation_size('big_table'));

from sqlalchemy import MetaData

dbparts = dict(user = some_user,
    pw = some_user,
    db = somedb,
    host = localhost)
dburi = "postgresql://%(user)s:%(pw)s@%(host)s/%(dd)s" % dbparts

def table_sizes(dburi,pretty=False):
    """ 
    Args:
        dburl.  string.  url of of the db to be reflected
        prettyprint. bool.  make the indexes into GB or whatnot
    
    Returns:
        dict::

            {table_name:  {'_total': int, '_table': int,
                'indexes':  {'idx1': int}}}
        
    Note:  Only works on PG, using PG specific items
    """
    out = dict()
    m = MetaData(dburi)
    e = lambda x: m.bind.execute(x).first()[0]
    m.reflect()
    q_pretty_total = "SELECT pg_size_pretty(pg_total_relation_size('%s'))"
    q_pretty_relation = "SELECT pg_size_pretty(pg_relation_size('%s'))"
    q_total = "SELECT pg_total_relation_size('%s')"
    q_relation = "SELECT pg_relation_size('%s')"

    for table_name,table in m.tables.iteritems():
        t = dict()
        if pretty:
            t['_total'] = e(q_pretty_total % table_name)
            t['_table'] = e(q_pretty_relation % table_name)
        else:
            t['_total'] = e(q_total % table_name)
            t['_table'] = e(q_relation % table_name)
        for idx in table.indexes:
            name = idx.name
            if pretty:
                t[name] = e(q_pretty_relation % name)
            else:
                t[name] = e(q_relation % name)
        
        out[table_name] = t
    
    return out

Advertisements

String Join Aggregate Function in PostgreSQL 8.1

Sometimes I store data in denormalized ways, even in Postgres. This can have performance benefits, and definitely can be faster to develop, easier to understand, and in general, Lower Stress ™. Usually, I tend to use comma-delimited text in those sorts of denormalized fields. In this scenario, it’s useful to have an aggregate function to join such fields. So based on Abulyadi/ and some chatter in freenode#postgresql (vol7ron, xzilla, others), here is an 8.1 idiom for “string join aggregate”:

/* 8.3+? */
CREATE AGGREGATE array_agg(anyelement) (
    SFUNC=array_append,
    STYPE=anyarray,
    INITCOND=’{}’
);
/* 8.1; the format for CREATE AGGREGATE changes in later versions */
CREATE AGGREGATE array_agg (
    SFUNC = array_append,
    BASETYPE = anyelement,
    STYPE = anyarray,
    INITCOND = '{}'
);

once the array_agg is created you can call it as:

SELECT array_to_string(array_agg(some_field), ',') FROM some_table;

Quibble, a Damn Small Query Langauge (DSQL) Using Python

This intermediate-level article will demonstrate how do use the filter idiom, delegation tables, list generators and the operator module to create a compact but expandable query langauge for querying data.

When many people hear the word ‘query’, their minds jump to Structured Query Language (SQL).  Now I love SQL as much as anyone[1].  Using SQL for queries is wonderful when one’s data is already loaded into a SQL database[2].  Sometimes the Real World (TM) conspires against this, since:

  • the data might be heterogeneous
  • the data might be easy to express in Python terms, but tedious to refector into a normalized form.  As a quick example, consider a dict of sets, which would require a join and a foreign key and actual *gasp* schema design.
  • one might not have access to a database (though with SQLite being embedded in Python from 2.5 onward, this is less an issue)
  • one might have irrational biases against schemas and the straightjacketing that they impose on agile development, and programmer whimsy.  I suffer from this bias myself, and attend regular SQL indoctination meetings, but so far it’s not sticking!  NoSQL Forever!
  • SQL is enterprisey, but not Web2.0, man!

That said, SQL has lots of advantages:

  • Exteremely flexible, complex querying
  • Widely deployed
  • (etc, etc.)

Let’s begin by building a list of dictionaries to query against.  These could be any list of object that support a dictionary interface.  Note that these objects are heterogeneous.  Also note they are quite contrived, and rather boring.

# a list of dicts to query against
data = [
    dict(a=None, b=1, c=[1,2,3]),
    dict(a=13, d=dict(a=1,b=2)),
    dict(c=13,e="some string"),
    dict(c=10,e="some other string"),
    dict(a=10,e="some other string"),
    {('author','email'): ('Gregg Lind','gregg.lind at fakearoo.com')},
]

Now that we have some data, we’re going to build a simple query language called Quibble [3] to search against it.  We will be using the filter/pipeline idiom.  The filter idiom is quite simple:  if the an object matches some condition, keep it; else continue on.  On Unix, this is a very simple type of pipeline; when one wants venture capital, call it “map-reduce”.  While Python has a filter function (http://docs.python.org/library/functions.html#filter), the list comprehension builtin will be quite a bit simpler to use for our dumb purposes.

Next we will build a delegation table.  This simple mapping maps names like “<=” to functions.  When people talk about the power of ‘functions are first-class objects’, which is part of what they’re on about.  We can make this mapping of function shorthand names mapped to *unevaluated functions*.

To make our lives easier, Quibble will use a simple convention for defining what is a valid query operator.  An ‘operator function’ must take exactly two argument, following this format:

     my_operator(some_dict[key], value)

Luckily for us [4], the functions in the python operator module http://docs.python.org/library/operator.html mostly take this form.  Having this same calling convention will make it possible to just drop the ‘right’ function in.

import operator
operators = {
    "<" : operator.lt,
    "<=" : operator.le,
    "==" : operator.eq,
    "!=" : operator.ne,
    ">=" : operator.ge,
    ">"  : operator.gt,
    "in" : operator.contains,
    "nin" : lambda x,y: not operator.contains(x,y),
}

Note that with ‘nin’, we had to wrap it.   Python’s lambda statement makes this easy, and the resulting code is still easy-to-read.  We could also use a true named function here, like this:

def nin_(x,y):
    return x not in y

Or a simpler lambda:

"nin" :  lambda x,y:  x not in y,
def query(D,key,val,operator="=="):
    '''
    D:  a dictionary
    key:  the key to query
    val:  the value
    operator:  "==", ">=", "in", et all.

    Returns elements in D such that operator(D.get(key,None), val) is true
    '''
    try:
        op = operators[operator]
    except KeyError:
        raise ValueError, "operator must be one of %r" % operators
    return [x for x in D if op(x.get(key,None),val)]

print "1st version"
print query(data,'a',1)
print query(data,'c',None,'!=')

Excellent.  Time to retire to a private island.  Oh wait, you want to define new functions?  Chain these queries together?  It should handle exceptions?  We can fix those.

A more fundamental problem with this filter approach is that defining “or” conditions is quite awkward, since filters reduce the input set at each stage, but we will clean this up as well (but it will be ugly).

Let’s add some functionality.

  • operator can be any two argument function
  • return an iterator instead of a list
  • tee the original input, just in case it too is an iterator, we don’t want to exhaust it.
  • adds a keynotfound argument, to change what happens if the key isn't found in the dict
import itertools
import inspect
def _can_take_at_least_n_args(f,n=2):
    ''' helper to check that a function can take at least two unnamed args'''
    (pos, args,kwargs, defaults) = inspect.getargspec(f)
    if args is not None or len(pos) >= n:
        return True
    else:
        return False

def query(D,key,val,operator="==", keynotfound=None):
    '''
    D:  a list of dictionaries
    key:  the key to query
    val:  the value
    operator:  "==", ">=", "in", et all, or any two-arg function
    keynotfound:  value if key is not found

    Returns elements in D such that operator(D.get(key,None), val) is true
    '''
    D = itertools.tee(D,2)[1]  # take a teed copy

    # let's let operator be any two argument callable function, *then*
    # fall back on the delegation table.
    if callable(operator):
        if not _can_take_at_least_n_args(operator,2):
            raise ValueError ("operator must take at least 2 arguments")
            # alternately, we could wrap it in a lambda, like:
            # op = lambda(x,y): operator(x),
            # but we have to check to see how many args it really wants (inc. 0!)
        op = operator
    else:
        op = operators.get(operator,None)
    if not op:
        raise ValueError, "operator must be one of %r, or a two-argument function" % operators

    def try_op(f,x,y):
        try:
            ans = f(x,y)
            return f(x,y)
        except Exception, exc:
            return False

    return (x for x in D if try_op(op, x.get(key,keynotfound),val))

print "2nd version"
print list(query(data,'a',1))
print list(query(data,'c',None,'!='))
at_fakaroo = lambda k,v:  "fakearoo" in k[1] # v will be irrelevant
print list(query(data, ('author','email'), None, at_fakaroo, keynotfound=('','')))

That is looking quite a bit more powerful!  It still has lots of problems:

  • ‘or’ isn’t well supported.
  • we handle all errors in the function equivalently — by eating them!  This will make it really hard to debug, since none of us writes perfect code.
  • chaining queries is doable via nesting, but it’s ugly (see below).
  • relies on the dictionary interface
  • awkward to peer inside nested components
  • doesn’t handle attribute lookup easily (but could be modified to, using getattr http://docs.python.org/library/functions.html#getattr)

Let’s try to make a “Queryable” object that chains operations via method calls (something like
SQLAlchememy generative selects http://www.sqlalchemy.org/docs/05/sqlexpression.html#intro-to-generative-selects-and-transformations):

class Queryable(object):
    def __init__(self,D):
        self.D = itertools.tee(D,2)[1]

    def tolist(self):
        return list(itertools.tee(self.D,2)[1])

    def query(self,*args,**kwargs):
        return Queryable(query(self.D,*args,**kwargs))

    q = query

print "3rd version, Queryable"
# c > 10 and "other" in e
Q = Queryable(data).q('c',8,'>')
print Q.tolist()
Q = Q.q('e', 'other', 'in')
print Q.tolist()

This is OKAY, and but it still has plenty of codesmell.

  • lots of tee madness
  • ugly “tolist” method
  • we’re the query optimizer… we’re guaranteed that at least one pass will be O(n), since there is no indexing, and no smarts at all in the querying.

Next steps / alternatives:

Knowing when to give up!

Like any domain specific language, Quibble (as written here) walks a very fine line between functionality and complexity (okay it stumbles over the line drunkenly, but not by too much!) If we need much more complexity in our queries (or object model) then we’re back to writing python, and investigating a proper solution (SQL, Mongo, etc.) is probably worthwhile!  For a simple reporting language, or debugging, or a simple command line interface, this might be plenty.

Happy Yule!

Notes:

1. Not true, I hate it.

2. Unless it’s super complex to query, involves lots of joins, or the query optimizer is off drunk at the pub, or stars are poorly aligned.

3. Quibble — from Query Bibble, Bibble being an ancient Etruscan word for a teething ring.

4. Well, actually, not lucky at all.  Like most scientific papers, this article pretends that inquiry is orderly.  I knew that I wanted to talk about the operator module, and most of the functions in operator take this form, so it seems like a sensible first-approximation convention.


Installing PlPython (Postgres 8.1 on Centos 4)

I kept getting this sort of error from createlang (PG 8.1 on Centos 4 — from when dinosaurs walked).  I tried this:

$ sudo yum install postgresql-python.x86_64

But this wasn’t enough to get createlang going.

$ sudo -u postgres createlang plpythonu mydb
Password:
createlang: language installation failed: ERROR:  could not access file "$libdir/plpython": No such file or directory

It turns out that there is a non-obvious dependency:

$ sudo yum install postgresql-python.x86_64 postgresql-pl.x86_64

$ sudo -u postgres createlang --echo plpythonu test3
SELECT oid FROM pg_catalog.pg_language WHERE lanname = 'plpythonu';
CREATE LANGUAGE "plpythonu";

Thus, postgresql-pl.x86_64 is a sooper sekrit dependency.

Good luck!

(ps.:  createlang --echo is useful)


Baby Steps into HBase

Today, after reading (the amazing and invaluable!) Understanding HBase and BigTable, while researching schemas for Google App Engine, I took my first tentative steps into using HBase.  About HBase:

HBase is the Hadoop database. Its (sic) an open-source, distributed, column-oriented store modeled after the Google paper, Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop.

HBase’s goal is the hosting of very large tables — billions of rows X millions of columns — atop clusters of commodity hardware. Try it if your plans for a data store run to big.

Well, my plans don’t run to big, but they do run to indexed over time.  Since every cell in an HBase table has a timestamp, it makes it really easy to snapshot data over time, and “rollback” a query as though it was asked at any point in the past.   For data that changes rarely over time, but for which one wants a historical record, this might make querying with history much simpler.

Historical Data Example

Think about how an organization changes over time.  Employees enter and leave, business units might be bought and sold.  One approach to modeling this is to take a snapshot every day, and store that in a RDBMS.    The snapshots will have lot of  redundant information, since an org doesn’t really change very much.

A simpler model is to simply enter a new snapshot of the organization when only when it changes, essentially overwriting the previous configuration.  Since HBase automatically labels cells with timestamp, this comes for free.

Setting it up

Using Ole-Martin Mørk’s instructions was a breeze!  Even though I know almost nothing about Java and the Java environment, I managed it.  I followed them, with these modifications:

  1. After downloading, unzipping, and symbolic linking to ~hbase, I version control the whole thing ( $ git init;  git-add * ; git ci -m “initial checkin, as unpacked from source”) , so that if I foul up anything, I can easily revert!
  2. Edit ~hbase/conf/hbase-env.sh to have the right “JAVA_HOME” which for me (Debian) is  -> export JAVA_HOME=/usr/lib/jvm/java-6-openjdk

Since I don’t have passwordless ssh set up to local host, I get this error:

~/hbase$ ~/hbase/bin/start-hbase.sh
localhost: ssh: connect to host localhost port 22: Connection refused

The rest of the example seems to run fine though, and I’m in no mood to really track this down, since I’m still in the experiment phase.

Future Steps

I’m not sure whether I’m be going any deeper anytime soon, since I have a lot of SqlAlchemy code built around handling these sorts of ‘historical’ queries (where inserting and updating are the real difficulties!), but I do like the idea of easily versioned, map-like data stores quite well.