Release Announcement: Duck 0.1, a ‘simple data’ and JSON Verifier, written in Python.

As a Boxing Day present to myself, I have released version 0.1 of Duck, a ‘simple data’ and JSON verifier. Extending the metaphor of duck typing, I often find it useful to see if a ‘simple data’ structure (one made of dicts, lists, ints, floats, strings, booleans and the like) is ‘close enough’ to what I need to be useful.

Use cases:

  1. checking json (using duck.jsonduck)
  2. simple guards on kwargs, or particular function arguments

Project is at: https://github.com/gregglind/duck

I could also use help with it, since I have done a public release of anything packaged in a long time! I could use advice on:

Housekeeping:
————————-

  • pypi / pip setup?
  • license?
  • where the code should *really* live
  • integrating test stuff into setup.py

Look and Feel:
—————————

  • does what I am doing even make sense?
  • is it *too magical*
Advertisements

The Hassle of Unicode (and Getting On With It in Python)

Let’s face facts.

  1. Unicode is a hassle [1]
  2. Not using unicode is a hassle, especially if you have one of thoese “weirdo” languages, and *gasp*, you want to read text *in your own language*.

I was faced with a simple task.  Take some text, process it, and print out some results (in JSON).  This should be trivial, and in a world where programming
was invented by a multinational consortium, and designed from the first day to be compatible with all text, maybe it would be.  Instead we have a world with a rich history of mutual incompatibility.

Find out how I solved this in Python


Simple “object-db” using JSON and python-sqlite

As part of a much larger project, I have a group of “snapshots” of a complicated data structure.   I need to save these in a persistent way, and continue to have access to them, when needed.  My solution is to output the snapshots as JSON, and store them into a sqlite database*, where they will be persistent on disk as “jlobs” (json large objects).

This “sqlite as object-db”  has several advantages:

  1. atomic transactions,
  2. easy database replication,
  3. jlob can easily change format without affecting schema
  4. very light runtime requirements.

Building off of the sqlite3 manual, it is easy to see how to  extract the json back *out* of the database.

There are  drawbacks to this approach, of course:

  1. you’re responsible for building and maintaining tables indexing any queryable elements of your jlob, if you want to be able to access them using SQL.
  2. sql normalization purists will throw up when they look at your schema

(*Note: if you are on centos 5, and do not have access to Python 2.5, make sure that you install python-sqlite2, for example from one of these rpms) rather than updating your python-sqlite in place.  BAD THINGS WILL HAPPEN, including breaking yum. )

#!/usr/bin/python
import sys
if sys.version_info >= (2,5):
    import sqlite3
else:
    from pysqlite2 import dbapi2 as sqlite3

try:
    import json
except ImportError:
    import simplejson as json

sqlite3.register_converter("json", json.loads)

conn = sqlite3.connect(":memory:",   \
    detect_types=sqlite3.PARSE_DECLTYPES|sqlite3.PARSE_COLNAMES)
c = conn.cursor()
c.row_factory = sqlite3.Row  # fields by name
d = conn.cursor()  # normal row

json_string = json.dumps( dict(a=1,b=[1,2,3]))
conn.execute('''
    create table snapshot(
          id INTEGER PRIMARY KEY AUTOINCREMENT,
          mydata json);
    ''')
conn.execute('''
    insert into snapshot values
       (null, ?)''', (json_string,))

R1 = c.execute("select * from snapshot").fetchone()['mydata']
R2 = d.execute("select * from snapshot").fetchone()[1]
R3 = conn.execute("select * from snapshot").fetchone()[1]

assert R1==R2==R3 == {'a': 1, 'b': [1, 2, 3]}, "all should be equal"


Quick and Dirty JSON Speed Testing in Python

[See updated post for analysis using timeit]

As per Poromenos‘ request on Reddit, I decided to do a bit of expansion on my cryptic comment about the major json packages in python (simplejson, cjson, demjson):

My conclusion: use demjson if you really really want to make sure everything is right, and you don’t care at all about time. Use simplejson if you’re in the 99% of all users who want reasonable performance over a broad range of objects, and use enhanced cjson 1.0.3x if you in the came with reasonable json inputs, and you need much faster (10x) speed…. that is, if the json step is the bottleneck.

More worrisome — demjson didn’t handle the unicode string I threw at it properly…

benchmark code and more indepth discussion