Updated Bash IPython Alias That Works for 0.10 and 0.11.

In July 2011, IPython released version 0.11, which changes/ breaks the embedded shell api.

https://github.com/ipython/ipython/issues/269

http://ipython.org/ipython-doc/stable/interactive/reference.html#embedding

I use this bash alias to get it work right under both versions (0.11 first, then 0.10):

alias ipython='python -c "from IPython import embed; embed()" 2>/dev/null || python -c "import IPython; IPython.Shell.IPShell().mainloop()"'


Release Announcement: Duck 0.1, a ‘simple data’ and JSON Verifier, written in Python.

As a Boxing Day present to myself, I have released version 0.1 of Duck, a ‘simple data’ and JSON verifier. Extending the metaphor of duck typing, I often find it useful to see if a ‘simple data’ structure (one made of dicts, lists, ints, floats, strings, booleans and the like) is ‘close enough’ to what I need to be useful.

Use cases:

  1. checking json (using duck.jsonduck)
  2. simple guards on kwargs, or particular function arguments

Project is at: https://github.com/gregglind/duck

I could also use help with it, since I have done a public release of anything packaged in a long time! I could use advice on:

Housekeeping:
————————-

  • pypi / pip setup?
  • license?
  • where the code should *really* live
  • integrating test stuff into setup.py

Look and Feel:
—————————

  • does what I am doing even make sense?
  • is it *too magical*

Quibble, a Damn Small Query Langauge (DSQL) Using Python

This intermediate-level article will demonstrate how do use the filter idiom, delegation tables, list generators and the operator module to create a compact but expandable query langauge for querying data.

When many people hear the word ‘query’, their minds jump to Structured Query Language (SQL).  Now I love SQL as much as anyone[1].  Using SQL for queries is wonderful when one’s data is already loaded into a SQL database[2].  Sometimes the Real World (TM) conspires against this, since:

  • the data might be heterogeneous
  • the data might be easy to express in Python terms, but tedious to refector into a normalized form.  As a quick example, consider a dict of sets, which would require a join and a foreign key and actual *gasp* schema design.
  • one might not have access to a database (though with SQLite being embedded in Python from 2.5 onward, this is less an issue)
  • one might have irrational biases against schemas and the straightjacketing that they impose on agile development, and programmer whimsy.  I suffer from this bias myself, and attend regular SQL indoctination meetings, but so far it’s not sticking!  NoSQL Forever!
  • SQL is enterprisey, but not Web2.0, man!

That said, SQL has lots of advantages:

  • Exteremely flexible, complex querying
  • Widely deployed
  • (etc, etc.)

Let’s begin by building a list of dictionaries to query against.  These could be any list of object that support a dictionary interface.  Note that these objects are heterogeneous.  Also note they are quite contrived, and rather boring.

# a list of dicts to query against
data = [
    dict(a=None, b=1, c=[1,2,3]),
    dict(a=13, d=dict(a=1,b=2)),
    dict(c=13,e="some string"),
    dict(c=10,e="some other string"),
    dict(a=10,e="some other string"),
    {('author','email'): ('Gregg Lind','gregg.lind at fakearoo.com')},
]

Now that we have some data, we’re going to build a simple query language called Quibble [3] to search against it.  We will be using the filter/pipeline idiom.  The filter idiom is quite simple:  if the an object matches some condition, keep it; else continue on.  On Unix, this is a very simple type of pipeline; when one wants venture capital, call it “map-reduce”.  While Python has a filter function (http://docs.python.org/library/functions.html#filter), the list comprehension builtin will be quite a bit simpler to use for our dumb purposes.

Next we will build a delegation table.  This simple mapping maps names like “<=” to functions.  When people talk about the power of ‘functions are first-class objects’, which is part of what they’re on about.  We can make this mapping of function shorthand names mapped to *unevaluated functions*.

To make our lives easier, Quibble will use a simple convention for defining what is a valid query operator.  An ‘operator function’ must take exactly two argument, following this format:

     my_operator(some_dict[key], value)

Luckily for us [4], the functions in the python operator module http://docs.python.org/library/operator.html mostly take this form.  Having this same calling convention will make it possible to just drop the ‘right’ function in.

import operator
operators = {
    "<" : operator.lt,
    "<=" : operator.le,
    "==" : operator.eq,
    "!=" : operator.ne,
    ">=" : operator.ge,
    ">"  : operator.gt,
    "in" : operator.contains,
    "nin" : lambda x,y: not operator.contains(x,y),
}

Note that with ‘nin’, we had to wrap it.   Python’s lambda statement makes this easy, and the resulting code is still easy-to-read.  We could also use a true named function here, like this:

def nin_(x,y):
    return x not in y

Or a simpler lambda:

"nin" :  lambda x,y:  x not in y,
def query(D,key,val,operator="=="):
    '''
    D:  a dictionary
    key:  the key to query
    val:  the value
    operator:  "==", ">=", "in", et all.

    Returns elements in D such that operator(D.get(key,None), val) is true
    '''
    try:
        op = operators[operator]
    except KeyError:
        raise ValueError, "operator must be one of %r" % operators
    return [x for x in D if op(x.get(key,None),val)]

print "1st version"
print query(data,'a',1)
print query(data,'c',None,'!=')

Excellent.  Time to retire to a private island.  Oh wait, you want to define new functions?  Chain these queries together?  It should handle exceptions?  We can fix those.

A more fundamental problem with this filter approach is that defining “or” conditions is quite awkward, since filters reduce the input set at each stage, but we will clean this up as well (but it will be ugly).

Let’s add some functionality.

  • operator can be any two argument function
  • return an iterator instead of a list
  • tee the original input, just in case it too is an iterator, we don’t want to exhaust it.
  • adds a keynotfound argument, to change what happens if the key isn't found in the dict
import itertools
import inspect
def _can_take_at_least_n_args(f,n=2):
    ''' helper to check that a function can take at least two unnamed args'''
    (pos, args,kwargs, defaults) = inspect.getargspec(f)
    if args is not None or len(pos) >= n:
        return True
    else:
        return False

def query(D,key,val,operator="==", keynotfound=None):
    '''
    D:  a list of dictionaries
    key:  the key to query
    val:  the value
    operator:  "==", ">=", "in", et all, or any two-arg function
    keynotfound:  value if key is not found

    Returns elements in D such that operator(D.get(key,None), val) is true
    '''
    D = itertools.tee(D,2)[1]  # take a teed copy

    # let's let operator be any two argument callable function, *then*
    # fall back on the delegation table.
    if callable(operator):
        if not _can_take_at_least_n_args(operator,2):
            raise ValueError ("operator must take at least 2 arguments")
            # alternately, we could wrap it in a lambda, like:
            # op = lambda(x,y): operator(x),
            # but we have to check to see how many args it really wants (inc. 0!)
        op = operator
    else:
        op = operators.get(operator,None)
    if not op:
        raise ValueError, "operator must be one of %r, or a two-argument function" % operators

    def try_op(f,x,y):
        try:
            ans = f(x,y)
            return f(x,y)
        except Exception, exc:
            return False

    return (x for x in D if try_op(op, x.get(key,keynotfound),val))

print "2nd version"
print list(query(data,'a',1))
print list(query(data,'c',None,'!='))
at_fakaroo = lambda k,v:  "fakearoo" in k[1] # v will be irrelevant
print list(query(data, ('author','email'), None, at_fakaroo, keynotfound=('','')))

That is looking quite a bit more powerful!  It still has lots of problems:

  • ‘or’ isn’t well supported.
  • we handle all errors in the function equivalently — by eating them!  This will make it really hard to debug, since none of us writes perfect code.
  • chaining queries is doable via nesting, but it’s ugly (see below).
  • relies on the dictionary interface
  • awkward to peer inside nested components
  • doesn’t handle attribute lookup easily (but could be modified to, using getattr http://docs.python.org/library/functions.html#getattr)

Let’s try to make a “Queryable” object that chains operations via method calls (something like
SQLAlchememy generative selects http://www.sqlalchemy.org/docs/05/sqlexpression.html#intro-to-generative-selects-and-transformations):

class Queryable(object):
    def __init__(self,D):
        self.D = itertools.tee(D,2)[1]

    def tolist(self):
        return list(itertools.tee(self.D,2)[1])

    def query(self,*args,**kwargs):
        return Queryable(query(self.D,*args,**kwargs))

    q = query

print "3rd version, Queryable"
# c > 10 and "other" in e
Q = Queryable(data).q('c',8,'>')
print Q.tolist()
Q = Q.q('e', 'other', 'in')
print Q.tolist()

This is OKAY, and but it still has plenty of codesmell.

  • lots of tee madness
  • ugly “tolist” method
  • we’re the query optimizer… we’re guaranteed that at least one pass will be O(n), since there is no indexing, and no smarts at all in the querying.

Next steps / alternatives:

Knowing when to give up!

Like any domain specific language, Quibble (as written here) walks a very fine line between functionality and complexity (okay it stumbles over the line drunkenly, but not by too much!) If we need much more complexity in our queries (or object model) then we’re back to writing python, and investigating a proper solution (SQL, Mongo, etc.) is probably worthwhile!  For a simple reporting language, or debugging, or a simple command line interface, this might be plenty.

Happy Yule!

Notes:

1. Not true, I hate it.

2. Unless it’s super complex to query, involves lots of joins, or the query optimizer is off drunk at the pub, or stars are poorly aligned.

3. Quibble — from Query Bibble, Bibble being an ancient Etruscan word for a teething ring.

4. Well, actually, not lucky at all.  Like most scientific papers, this article pretends that inquiry is orderly.  I knew that I wanted to talk about the operator module, and most of the functions in operator take this form, so it seems like a sensible first-approximation convention.


Two Simple Tips to Speed up Python Time Parsing

  1. Sometimes, date parsing formatting in Python takes a long time. It can be worth writing custom datestring converters to sacrifice generality for speed.
  2. Another oddity:  setting the timezone by force can speed up code as well, like this: os.environ[‘TZ’] = ‘GMT’

Both tips are demo’d and tested in the code snipped below.

import os
import time

def _convert_date(string, year=None):
 ''' take a log string, turn it into time epoch, tuple, string

 >>> _convert_date2('Aug 19 13:45:01',2009)
 (1250689501, (2009, 8, 19, 13, 45, 1, 2, 231, 0), 'Aug 19 13:45:01')
 '''
 if year is None:  year = time.gmtime()[0]

 # was, but this profiled 4x slower
 tt = list(time.strptime("%s " % year + string, "%Y %b %d %H:%M:%S"))
 tt[-1] = 0 # turn off timezone
 tt= tuple(tt)
 ts = int(time.mktime(tt))
 return (ts,tt,string)

_months = dict(jan=1,feb=2,mar=3,apr=4,may=5,jun=6,jul=7,aug=8,sep=9,oct=10,nov=11,dec=12)
def _convert_date2(string, year=None):
 ''' take a log string, turn it into time epoch, tuple, string

 >>> _convert_date2('Aug 19 13:45:01',2009)
 (1250689501, (2009, 8, 19, 13, 45, 1, 2, 231, 0), 'Aug 19 13:45:01')
 '''
 if year is None:  year = time.gmtime()[0]

 # was, but this profiled 4x slower
 #tt = list(time.strptime("%s " % year + x, "%Y %b %d %H:%M:%S"))
 mon,d,t  = string.split()
 h,m,s = t.split(":")
 mon = _months[mon.lower()]
 tt = [year, mon,d,h,m,s,0,0,0]
 tt = tuple([int(v) for v  in tt])
 ts = int(time.mktime(tt))
 tt = time.gmtime(ts)
 return (ts,tt,string)

assert _convert_date('Aug 19 13:45:01',2009) == _convert_date2('Aug 19 13:45:01',2009)

#%timeit is an ipython macro that is like timeit.Timer with brains!

# including figuring out how many loops to run heuristically

# key fact:  a microsecond is 1000 nanoseconds

timeit _convert_date('Aug 19 13:45:01',2009)
timeit _convert_date2('Aug 19 13:45:01',2009)
os.environ['TZ'] = 'GMT'
timeit _convert_date('Aug 19 13:45:01',2009)
timeit _convert_date2('Aug 19 13:45:01',2009)

Results  (Python 2.4.3 on x64 Linux):

timeit _convert_date(‘Aug 19 13:45:01’,2009)
10000 loops, best of 3: 62 µs per loop

In [11]: timeit _convert_date2(‘Aug 19 13:45:01’,2009)
10000 loops, best of 3: 18.3 µs per loop

In [12]: os.environ[‘TZ’] = ‘GMT’

In [13]: timeit _convert_date(‘Aug 19 13:45:01’,2009)
10000 loops, best of 3: 60.2 µs per loop

In [14]: timeit _convert_date2(‘Aug 19 13:45:01’,2009)
100000 loops, best of 3: 13.3 µs per loop

The Win Factor:

  • custom parser:  300%
  • setting TZ:  20%

Feedback and additional speedup improvements welcome.

(Thanks to Jon Nelson; of the Pycurious Blog for the TZ idea)


Deepcopy is a Pig (For Simple Data)

“Being this easy ain’t cheap.”  “There’s no such thing as a free lunch.”

We’ve all heard these tropes before, right?  Sometimes, without testing, it’s hard to see exactly how much that lunch costs.  This week’s example:  Python’s copy.deepcopy.

I tend to fancy myself as using a lot of functional programming techniques in my code, and as part of that, I try to avoid modifying data by side-effect.  Deepcopy makes it easy to copy the original structure, modify the copy, and return it.  After some profiling and timing work, I saw that, of all things, deepcopy was the bottleneck!

Sure, it’s bulletproof, battle-tested, and designed to do the Right Thing ™ in almost every case!  But for simple data structures, it can be overkill, since it does so much accounting, reference tracking, and the like.

Most of the data I see in my day job has simple formats: mainly dictionaries of lists, sets, strings, tuples, and integers. — the basic python types we know and love, easily representable (in plain text, html, tables), and easy to munge / transmit (using JSON or the like).  In short, they’re nice to work with, and transparent.

As it turns out, when we control the input data, we don’t need to worry as much about robustness.  Sure the code below for “deepish_copy” doesn’t handle classes, and nested iterables, or generators, or even nesting to arbitrary depth.  But, it runs fast, as the speed results below show.

import timeit
from copy import deepcopy

def deepish_copy(org):
    '''
    much, much faster than deepcopy, for a dict of the simple python types.
    '''
    out = dict().fromkeys(org)
    for k,v in org.iteritems():
        try:
            out[k] = v.copy()   # dicts, sets
        except AttributeError:
            try:
                out[k] = v[:]   # lists, tuples, strings, unicode
            except TypeError:
                out[k] = v      # ints

    return out

def test_deepish_copy():
    o1 = dict(name = u"blah", id=1, att0 = (1,2,3), att1 = range(10), att2 = set(range(10)))
    o2 = deepish_copy(o1)
    assert o2 == o1, "not equal, but should be"
    del o2['att1'][-1]
    assert o2 != o1, "are equal, shouldn't be"

#prun for ii in xrange(1000):  o2 = deepcopy(o1)
#prun for ii in xrange(1000):  o2 = dc2(o1)

o1 = dict(name = u"blah", id=1, att0 = (1,2,3), att1 = range(10), att2 = set(range(10)))

a = timeit.Timer("o2 = deepish_copy(o1)","from __main__ import deepish_copy,o1")
b = timeit.Timer("o2 = deepcopy(o1)","from __main__ import deepcopy,o1")

# 64-bit linux, 1 gHz chip, python 2.4.3
a.repeat(3,number=20000)
# [0.45441699028015137, 0.41893100738525391, 0.46757102012634277]
b.repeat(3,number=20000)
# [2.5441901683807373, 2.5316669940948486, 2.4751369953155518]

Using the custom written code speeds things up quite a bit (5 fold!).  For me, where this copying *was* the bottleneck, and I have to iterate over hundreds of thousands of these things, it made a noticible difference in total run time.  Taking the 10 minutes it took to write this code was worth it.   So was profiling (using ipython’s simple %prun macro).

As always, to end with another cliche:  your mileage may vary… but if you’re not relying on the car manufacturers to degisn an engine for exactly your needs,  you can probably improve it.


Adventures in Nose: nosetests and unbuffered stdout.

In some of our server code, we like to insure we get unbuffered output, like in perl.  In Python, this is easy to do:


import sys

sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0)

However, using the wonderful nosetest for testing will barf on this code, because it reassigns stdout to a cStringIO for capturing.

This is a workaround:


import sys
import os

try:   # get unbuffered output
    sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0)
except AttributeError, exc: # under nose, sys.stdout is reassigned to a string buffer
    pass

def test():
    assert 1

Alternately, run nosetests -s, which disables the output capture feature.

Cf:  my embarrassing bug report over at nose


Simple “object-db” using JSON and python-sqlite

As part of a much larger project, I have a group of “snapshots” of a complicated data structure.   I need to save these in a persistent way, and continue to have access to them, when needed.  My solution is to output the snapshots as JSON, and store them into a sqlite database*, where they will be persistent on disk as “jlobs” (json large objects).

This “sqlite as object-db”  has several advantages:

  1. atomic transactions,
  2. easy database replication,
  3. jlob can easily change format without affecting schema
  4. very light runtime requirements.

Building off of the sqlite3 manual, it is easy to see how to  extract the json back *out* of the database.

There are  drawbacks to this approach, of course:

  1. you’re responsible for building and maintaining tables indexing any queryable elements of your jlob, if you want to be able to access them using SQL.
  2. sql normalization purists will throw up when they look at your schema

(*Note: if you are on centos 5, and do not have access to Python 2.5, make sure that you install python-sqlite2, for example from one of these rpms) rather than updating your python-sqlite in place.  BAD THINGS WILL HAPPEN, including breaking yum. )

#!/usr/bin/python
import sys
if sys.version_info >= (2,5):
    import sqlite3
else:
    from pysqlite2 import dbapi2 as sqlite3

try:
    import json
except ImportError:
    import simplejson as json

sqlite3.register_converter("json", json.loads)

conn = sqlite3.connect(":memory:",   \
    detect_types=sqlite3.PARSE_DECLTYPES|sqlite3.PARSE_COLNAMES)
c = conn.cursor()
c.row_factory = sqlite3.Row  # fields by name
d = conn.cursor()  # normal row

json_string = json.dumps( dict(a=1,b=[1,2,3]))
conn.execute('''
    create table snapshot(
          id INTEGER PRIMARY KEY AUTOINCREMENT,
          mydata json);
    ''')
conn.execute('''
    insert into snapshot values
       (null, ?)''', (json_string,))

R1 = c.execute("select * from snapshot").fetchone()['mydata']
R2 = d.execute("select * from snapshot").fetchone()[1]
R3 = conn.execute("select * from snapshot").fetchone()[1]

assert R1==R2==R3 == {'a': 1, 'b': [1, 2, 3]}, "all should be equal"