Two Simple Tips to Speed up Python Time Parsing
October 12, 2009
- Sometimes, date parsing formatting in Python takes a long time. It can be worth writing custom datestring converters to sacrifice generality for speed.
- Another oddity: setting the timezone by force can speed up code as well, like this: os.environ['TZ'] = ‘GMT’
Both tips are demo’d and tested in the code snipped below.
import os
import time
def _convert_date(string, year=None):
''' take a log string, turn it into time epoch, tuple, string
>>> _convert_date2('Aug 19 13:45:01',2009)
(1250689501, (2009, 8, 19, 13, 45, 1, 2, 231, 0), 'Aug 19 13:45:01')
'''
if year is None: year = time.gmtime()[0]
# was, but this profiled 4x slower
tt = list(time.strptime("%s " % year + string, "%Y %b %d %H:%M:%S"))
tt[-1] = 0 # turn off timezone
tt= tuple(tt)
ts = int(time.mktime(tt))
return (ts,tt,string)
_months = dict(jan=1,feb=2,mar=3,apr=4,may=5,jun=6,jul=7,aug=8,sep=9,oct=10,nov=11,dec=12)
def _convert_date2(string, year=None):
''' take a log string, turn it into time epoch, tuple, string
>>> _convert_date2('Aug 19 13:45:01',2009)
(1250689501, (2009, 8, 19, 13, 45, 1, 2, 231, 0), 'Aug 19 13:45:01')
'''
if year is None: year = time.gmtime()[0]
# was, but this profiled 4x slower
#tt = list(time.strptime("%s " % year + x, "%Y %b %d %H:%M:%S"))
mon,d,t = string.split()
h,m,s = t.split(":")
mon = _months[mon.lower()]
tt = [year, mon,d,h,m,s,0,0,0]
tt = tuple([int(v) for v in tt])
ts = int(time.mktime(tt))
tt = time.gmtime(ts)
return (ts,tt,string)
assert _convert_date('Aug 19 13:45:01',2009) == _convert_date2('Aug 19 13:45:01',2009)
#%timeit is an ipython macro that is like timeit.Timer with brains!
# including figuring out how many loops to run heuristically
# key fact: a microsecond is 1000 nanoseconds
timeit _convert_date('Aug 19 13:45:01',2009)
timeit _convert_date2('Aug 19 13:45:01',2009)
os.environ['TZ'] = 'GMT'
timeit _convert_date('Aug 19 13:45:01',2009)
timeit _convert_date2('Aug 19 13:45:01',2009)
Results (Python 2.4.3 on x64 Linux):
timeit _convert_date(‘Aug 19 13:45:01′,2009)
10000 loops, best of 3: 62 µs per loopIn [11]: timeit _convert_date2(‘Aug 19 13:45:01′,2009)
10000 loops, best of 3: 18.3 µs per loopIn [12]: os.environ['TZ'] = ‘GMT’
In [13]: timeit _convert_date(‘Aug 19 13:45:01′,2009)
10000 loops, best of 3: 60.2 µs per loopIn [14]: timeit _convert_date2(‘Aug 19 13:45:01′,2009)
100000 loops, best of 3: 13.3 µs per loop
The Win Factor:
- custom parser: 300%
- setting TZ: 20%
Feedback and additional speedup improvements welcome.
(Thanks to Jon Nelson; of the Pycurious Blog for the TZ idea)
Deepcopy is a Pig (For Simple Data)
May 7, 2009
“Being this easy ain’t cheap.” “There’s no such thing as a free lunch.”
We’ve all heard these tropes before, right? Sometimes, without testing, it’s hard to see exactly how much that lunch costs. This week’s example: Python’s copy.deepcopy.
I tend to fancy myself as using a lot of functional programming techniques in my code, and as part of that, I try to avoid modifying data by side-effect. Deepcopy makes it easy to copy the original structure, modify the copy, and return it. After some profiling and timing work, I saw that, of all things, deepcopy was the bottleneck!
Sure, it’s bulletproof, battle-tested, and designed to do the Right Thing ™ in almost every case! But for simple data structures, it can be overkill, since it does so much accounting, reference tracking, and the like.
Most of the data I see in my day job has simple formats: mainly dictionaries of lists, sets, strings, tuples, and integers. — the basic python types we know and love, easily representable (in plain text, html, tables), and easy to munge / transmit (using JSON or the like). In short, they’re nice to work with, and transparent.
As it turns out, when we control the input data, we don’t need to worry as much about robustness. Sure the code below for “deepish_copy” doesn’t handle classes, and nested iterables, or generators, or even nesting to arbitrary depth. But, it runs fast, as the speed results below show.
import timeit
from copy import deepcopy
def deepish_copy(org):
'''
much, much faster than deepcopy, for a dict of the simple python types.
'''
out = dict().fromkeys(org)
for k,v in org.iteritems():
try:
out[k] = v.copy() # dicts, sets
except AttributeError:
try:
out[k] = v[:] # lists, tuples, strings, unicode
except TypeError:
out[k] = v # ints
return out
def test_deepish_copy():
o1 = dict(name = u"blah", id=1, att0 = (1,2,3), att1 = range(10), att2 = set(range(10)))
o2 = deepish_copy(o1)
assert o2 == o1, "not equal, but should be"
del o2['att1'][-1]
assert o2 != o1, "are equal, shouldn't be"
#prun for ii in xrange(1000): o2 = deepcopy(o1)
#prun for ii in xrange(1000): o2 = dc2(o1)
o1 = dict(name = u"blah", id=1, att0 = (1,2,3), att1 = range(10), att2 = set(range(10)))
a = timeit.Timer("o2 = deepish_copy(o1)","from __main__ import deepish_copy,o1")
b = timeit.Timer("o2 = deepcopy(o1)","from __main__ import deepcopy,o1")
# 64-bit linux, 1 gHz chip, python 2.4.3
a.repeat(3,number=20000)
# [0.45441699028015137, 0.41893100738525391, 0.46757102012634277]
b.repeat(3,number=20000)
# [2.5441901683807373, 2.5316669940948486, 2.4751369953155518]
Using the custom written code speeds things up quite a bit (5 fold!). For me, where this copying *was* the bottleneck, and I have to iterate over hundreds of thousands of these things, it made a noticible difference in total run time. Taking the 10 minutes it took to write this code was worth it. So was profiling (using ipython’s simple %prun macro).
As always, to end with another cliche: your mileage may vary… but if you’re not relying on the car manufacturers to degisn an engine for exactly your needs, you can probably improve it.
Adventures in Nose: nosetests and unbuffered stdout.
March 6, 2009
In some of our server code, we like to insure we get unbuffered output, like in perl. In Python, this is easy to do:
import sys sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0)
However, using the wonderful nosetest for testing will barf on this code, because it reassigns stdout to a cStringIO for capturing.
This is a workaround:
import sys import os try: # get unbuffered output sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0) except AttributeError, exc: # under nose, sys.stdout is reassigned to a string buffer pass def test(): assert 1
Alternately, run nosetests -s, which disables the output capture feature.
Simple “object-db” using JSON and python-sqlite
December 5, 2008
As part of a much larger project, I have a group of “snapshots” of a complicated data structure. I need to save these in a persistent way, and continue to have access to them, when needed. My solution is to output the snapshots as JSON, and store them into a sqlite database*, where they will be persistent on disk as “jlobs” (json large objects).
This “sqlite as object-db“ has several advantages:
- atomic transactions,
- easy database replication,
- jlob can easily change format without affecting schema
- very light runtime requirements.
Building off of the sqlite3 manual, it is easy to see how to extract the json back *out* of the database.
There are drawbacks to this approach, of course:
- you’re responsible for building and maintaining tables indexing any queryable elements of your jlob, if you want to be able to access them using SQL.
- sql normalization purists will throw up when they look at your schema
(*Note: if you are on centos 5, and do not have access to Python 2.5, make sure that you install python-sqlite2, for example from one of these rpms) rather than updating your python-sqlite in place. BAD THINGS WILL HAPPEN, including breaking yum. )
#!/usr/bin/python
import sys
if sys.version_info >= (2,5):
import sqlite3
else:
from pysqlite2 import dbapi2 as sqlite3
try:
import json
except ImportError:
import simplejson as json
sqlite3.register_converter("json", json.loads)
conn = sqlite3.connect(":memory:", \
detect_types=sqlite3.PARSE_DECLTYPES|sqlite3.PARSE_COLNAMES)
c = conn.cursor()
c.row_factory = sqlite3.Row # fields by name
d = conn.cursor() # normal row
json_string = json.dumps( dict(a=1,b=[1,2,3]))
conn.execute('''
create table snapshot(
id INTEGER PRIMARY KEY AUTOINCREMENT,
mydata json);
''')
conn.execute('''
insert into snapshot values
(null, ?)''', (json_string,))
R1 = c.execute("select * from snapshot").fetchone()['mydata']
R2 = d.execute("select * from snapshot").fetchone()[1]
R3 = conn.execute("select * from snapshot").fetchone()[1]
assert R1==R2==R3 == {'a': 1, 'b': [1, 2, 3]}, "all should be equal"
Len() calls can be SLOW in Berkeley Database and Python bsddb.
September 26, 2008
In my day-to-day coding work, I make extensive use of Berkeley DB (bdb) hash and btree tables. They’re really fast, easy-ish to use, and work for the apps I need them for (persistent storage of json and other small data structures).
So, this python code was having all kinds of weird slowdowns for me, and it was the len() call (of all things) that was causing the issue!
As it turns out, sometimes the Berkeley database does have to iterate over all keys to give a proper answer. Even the “fast stats” *number of records* call has to
References:
Jesus Cea’s comments one why bdb’s don’t know how many keys they have
db_stat tool description
DB->stat api
Dumping and loading a bsddb, for humans.
September 26, 2008
Sometimes things happen with Python shelves that screw up the bsddb’s (Berkeley DB [bdb] databases*) that power them. A common way for this to happen is when two apps have it open for writing, and something goes flooey like both try to write to the same page. The bsddb emits this helpful error:
DBRunRecoveryError: [Terror, death and destruction will ensue] or something equally opaque and non-reassuring
So how to run the recovery, eh? Assuming you have the db_dump and db_load tools on your platform, take hints from Library and Extension FAQ and try this bash snippet:
#!/usr/bin/bash
## example usage:
## $ bdb_repair /path/to/my.db
function bdb_repair {
BDIR=`dirname $1` # /path/to/dir
BADDB=`basename $1` # bad.db
cd $BDIR && \
cp $BADDB{,.bak} # seriously! back it up first
db_dump -f $BADDB.dump $BADDB # might take a while
db_load -f $BADDB.dump $BADDB.repaired
cp -o $BADDB.repaired $BADDB
cd -
}
So far, I’ve had universal success with this method.
If any bash gurus want to improve the error handling here, I’d appreciate it.
FOOTNOTES
* Yes, I know this is redundant.
Using Python “timeit” to time functions
September 12, 2008
Python’s timeit module is not intuitive for me.
Steve D’Aprano’s thread on the python mailing list: Timing a function object… and especially Carl Bank’s response of
def foo(x):
return x+1
import timeit
timeit.Timer("foo(1)","from __main__ import foo")
was a godsend!
Reading this hint contradicted my blithe statement to friend this morning that “the only time __main__ appears in user code is to determine when a script is being run form the command line”. Such ignorance and folly!