Len() calls can be SLOW in Berkeley Database and Python bsddb.

In my day-to-day coding work, I make extensive use of Berkeley DB (bdb) hash and btree tables. They’re really fast, easy-ish to use, and work for the apps I need them for (persistent storage of json and other small data structures).

So, this python code was having all kinds of weird slowdowns for me, and it was the len() call (of all things) that was causing the issue!

As it turns out, sometimes the Berkeley database does have to iterate over all keys to give a proper answer. Even the “fast stats” *number of records* call has to

References:
Jesus Cea’s comments one why bdb’s don’t know how many keys they have
db_stat tool description
DB->stat api


Dumping and loading a bsddb, for humans.

Sometimes things happen with Python shelves that screw up the bsddb’s (Berkeley DB [bdb] databases*) that power them. A common way for this to happen is when two apps have it open for writing, and something goes flooey like both try to write to the same page. The bsddb emits this helpful error:

DBRunRecoveryError: [Terror, death and destruction will ensue] or something equally opaque and non-reassuring

So how to run the recovery, eh? Assuming you have the db_dump and db_load tools on your platform, take hints from Library and Extension FAQ and try this bash snippet:

#!/usr/bin/bash 

## example usage:
## $ bdb_repair  /path/to/my.db
function bdb_repair {
  BDIR=`dirname $1` #  /path/to/dir    
  BADDB=`basename $1`   #  bad.db  
  cd $BDIR  && \
  cp $BADDB{,.bak}  # seriously!  back it up first  
  db_dump -f $BADDB.dump  $BADDB   # might take a while
  db_load -f $BADDB.dump  $BADDB.repaired
  cp -o $BADDB.repaired $BADDB
  cd -
}

So far, I’ve had universal success with this method.

If any bash gurus want to improve the error handling here, I’d appreciate it.

FOOTNOTES
* Yes, I know this is redundant.


The 100 Doors Puzzle in R at Rosetta Code

From time to time I see a puzzle at Rosetta Code that interests me, and I post an R solution for it.  This time it was the 100 doors puzzle.

Problem: You have 100 doors in a row that are all initially closed. You make 100 passes by the doors. The first time through, you visit every door and toggle the door (if the door is closed, you open it; if it is open, you close it). The second time you only visit every 2nd door (door #2, #4, #6, …). The third time, every 3rd door (door #3, #6, #9, …), etc, until you only visit the 100th door.

Question: What state are the doors in after the last pass? Which are open, which are closed?

The code for this in R is pretty simple:

# UNOPTIMIZED
doors_puzzle <- function(ndoors=100,passes=100) {     doors <- rep(FALSE,ndoors)     for (ii in seq(1,passes)) {         mask <- seq(0,ndoors,ii)         doors[mask] <- !doors[mask]        }     return (which(doors == TRUE)) } doors_puzzle() ## optimized version... we only have to to up to the square root of 100 seq(1,sqrt(100))**2 [/sourcecode]


Monty Hall in R

Inspired by paddy3118, I decided to write up a Monty Hall simulation in R for Rosetta Code.   Enjoy!

… The rules of the game show are as follows: After you have chosen a door, the door remains closed for the time being. The game show host, Monty Hall, who knows what is behind the doors, now has to open one of the two remaining doors, and the door he opens must have a goat behind it…

R-project code follows


Quick and (less?)-Dirty JSON Speed Testing in Python

Back in a previous article, I made some bold claims.   After a good vetting on Reddit, the incomparable effbot pointed me toward timeit (cf: notes).

Quick, dirty, and quite possibly deeply flawed.

The profiler’s designed for profiling, not benchmarking, and Python code running under the profiler runs a lot slower than usual — but C code isn’t affected at all.

To get proper results, use the timeit module instead.

So, here is a revised analysis.  It still looks like cjson strongly outperforms the others.*  Most interestingly, I tried oblivion95’s suggestion to read in json using eval, and that seems slower than cjson, which seems implausible to me.  I look forward to corrections.

Results

dumping to JSON

cjson dump nested_dict
0.096393 0.096989 0.097203 0.097859 0.098357
demjson dump nested_dict
4.589573 4.601798 4.609123 4.621567 4.625506
simplejson dump nested_dict
0.595901 0.596267 0.596555 0.597104 0.597633

cjson dump ustring
0.024242 0.024264 0.024453 0.024475 0.024548
demjson dump ustring
2.350742 2.363112 2.364416 2.365360 2.374244
simplejson dump ustring
0.039637 0.039668 0.039820 0.039890 0.039976

loading from JSON

cjson load nested_dict_json
0.042304 0.042332 0.042936 0.043246 0.043858
demjson load nested_dict_json
8.317319 8.332928 8.334701 8.367242 8.371535
simplejson load nested_dict_json
1.858826 1.862957 1.864221 1.864268 1.868705
eval load nested_dict_json
0.484512 0.485497 0.487538 0.487866 0.488751

cjson load ustring_json
0.045566 0.045803 0.045846 0.046027 0.046056
demjson load ustring_json
3.391110 3.401287 3.403575 3.408148 3.416667
simplejson load ustring_json
0.243784 0.244193 0.244920 0.245126 0.246061
eval load ustring_json
0.121635 0.121801 0.122561 0.123064 0.123563

Code and footnotes follow


Using Python “timeit” to time functions

Python’s timeit module is not intuitive for me.

Steve D’Aprano’s thread on the python mailing list:  Timing a function object… and especially Carl Bank’s response of


def foo(x):
    return x+1

import timeit
timeit.Timer("foo(1)","from __main__ import foo")

was a godsend!

Reading this hint contradicted my blithe statement to friend this morning that “the only time __main__ appears in user code is to determine when a script is being run form the command line”.  Such ignorance and folly!


Quick and Dirty JSON Speed Testing in Python

[See updated post for analysis using timeit]

As per Poromenos‘ request on Reddit, I decided to do a bit of expansion on my cryptic comment about the major json packages in python (simplejson, cjson, demjson):

My conclusion: use demjson if you really really want to make sure everything is right, and you don’t care at all about time. Use simplejson if you’re in the 99% of all users who want reasonable performance over a broad range of objects, and use enhanced cjson 1.0.3x if you in the came with reasonable json inputs, and you need much faster (10x) speed…. that is, if the json step is the bottleneck.

More worrisome — demjson didn’t handle the unicode string I threw at it properly…

benchmark code and more indepth discussion