Sorting Dictionaries by Value in Python (improved?)

Sorting dictionaries by value in Python is a chronic annoyance. The dictionary implementation is very efficient, well-thought out, and easy-to-use, but sorting by value isn’t part of the spec. So, I was pretty excited when read this post on sorting by value at Digital Sanitation Engineering, I was excited. Finally, a method that “is the fastest way to do this and it uses the least amount of memory. Enjoy.”

Oh, enjoy I will.

But then, a wrench in the works. I’m not sure the claim in true. PEP 265 has a different opinion on the matter, and suggests a different idiom, so I decided to do a little testing.

(Preview for time impatient: the PEP 265 version is 10x faster than any others in this test.)

def sbv0(adict,reverse=False):
    ''' proposed at Digital Sanitation Engineering
    http://blog.modp.com/2007/11/sorting-python-dict-by-value.html '''
    return sorted(adict.iteritems(), key=lambda (k,v): (v,k), reverse=reverse)

def sbv1(d,reverse=False):
    '''  explicit list expansion '''
    L = [(k,v) for (k,v) in d.iteritems()]
    return sorted(L, key=lambda x: x[1] , reverse=reverse)

def sbv2(d,reverse=False):
    '''  generator '''
    L = ((k,v) for (k,v) in d.iteritems())
    return sorted(L, key=lambda x: x[1] , reverse=reverse)

def sbv3(d,reverse=False):
    ''' using a lambda to get the key, rather than "double-assignment" '''

    return sorted(d.iteritems(), key=lambda x: x[1] , reverse=reverse)

def sbv4(d,reverse=False):
    ''' using a formal function to get the sorting key, rather than a lambda'''
    def sk(x):  return x[1]
    return sorted(d.iteritems(), key=sk , reverse=reverse)

def sk(x):  return x[1]

def sbv5(d,reverse=False):
    ''' using a formal function, defined in outer scope
    to get the sorting key, rather than a lambda
    '''
    return sorted(d.iteritems(), key=sk , reverse=reverse)

from operator import itemgetter
def sbv6(d,reverse=False):
    ''' proposed in PEP 265, using  the itemgetter '''
    return sorted(d.iteritems(), key=itemgetter(1), reverse=True)

D = dict(zip(range(100),range(100)))

from profile import run

run("for ii in xrange(10000):  sbv0(D, reverse=True)")
run("for ii in xrange(10000):  sbv1(D, reverse=True)")
run("for ii in xrange(10000):  sbv2(D, reverse=True)")
run("for ii in xrange(10000):  sbv3(D, reverse=True)")
run("for ii in xrange(10000):  sbv4(D, reverse=True)")
run("for ii in xrange(10000):  sbv5(D, reverse=True)")
run("for ii in xrange(10000):  sbv6(D, reverse=True)")

>>> run("for ii in xrange(10000):  sbv0(D, reverse=True)")
         1020003 function calls in 5.500 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    10000    0.040    0.000    0.040    0.000 :0(iteritems)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
    10000    2.880    0.000    5.440    0.001 :1(sbv0)
  1000000    2.520    0.000    2.520    0.000 :4()
        1    0.060    0.060    5.500    5.500 :1(?)
        1    0.000    0.000    5.500    5.500 profile:0(for ii in xrange(10000):  sbv0(D, reverse=True))
        0    0.000             0.000          profile:0(profiler)

>>> run("for ii in xrange(10000):  sbv1(D, reverse=True)")
         1020003 function calls in 5.030 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    10000    0.030    0.000    0.030    0.000 :0(iteritems)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
    10000    3.080    0.000    4.930    0.000 :1(sbv1)
  1000000    1.820    0.000    1.820    0.000 :4()
        1    0.100    0.100    5.030    5.030 :1(?)
        1    0.000    0.000    5.030    5.030 profile:0(for ii in xrange(10000):  sbv1(D, reverse=True))
        0    0.000             0.000          profile:0(profiler)

>>> run("for ii in xrange(10000):  sbv2(D, reverse=True)")
         2030003 function calls in 8.900 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    10000    0.020    0.000    0.020    0.000 :0(iteritems)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
    10000    4.460    0.000    8.870    0.001 <stdin>:1(sbv2)
  1010000    2.630    0.000    2.630    0.000 <stdin>:3(<generator expression>)
  1000000    1.760    0.000    1.760    0.000 <stdin>:4(<lambda>)
        1    0.030    0.030    8.900    8.900 <string>:1(?)
        1    0.000    0.000    8.900    8.900 profile:0(for ii in xrange(10000):  sbv2(D, reverse=True))
        0    0.000             0.000          profile:0(profiler)

>>> run("for ii in xrange(10000):  sbv3(D, reverse=True)")
         1020003 function calls in 4.860 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    10000    0.050    0.000    0.050    0.000 :0(iteritems)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
    10000    2.650    0.000    4.820    0.000 :1(sbv3)
  1000000    2.120    0.000    2.120    0.000 :4()
        1    0.040    0.040    4.860    4.860 :1(?)
        1    0.000    0.000    4.860    4.860 profile:0(for ii in xrange(10000):  sbv3(D, reverse=True))
        0    0.000             0.000          profile:0(profiler)

>>> run("for ii in xrange(10000):  sbv4(D, reverse=True)")
         1020003 function calls in 4.850 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    10000    0.020    0.000    0.020    0.000 :0(iteritems)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
    10000    2.770    0.000    4.810    0.000 :1(sbv4)
  1000000    2.020    0.000    2.020    0.000 :3(sk)
        1    0.040    0.040    4.850    4.850 :1(?)
        1    0.000    0.000    4.850    4.850 profile:0(for ii in xrange(10000):  sbv4(D, reverse=True))
        0    0.000             0.000          profile:0(profiler)

>>> run("for ii in xrange(10000):  sbv5(D, reverse=True)")
         1020003 function calls in 4.670 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    10000    0.010    0.000    0.010    0.000 :0(iteritems)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
    10000    2.690    0.000    4.650    0.000 :1(sbv5)
  1000000    1.950    0.000    1.950    0.000 :1(sk)
        1    0.020    0.020    4.670    4.670 :1(?)
        1    0.000    0.000    4.670    4.670 profile:0(for ii in xrange(10000):  sbv5(D, reverse=True))
        0    0.000             0.000          profile:0(profiler)

>>> run("for ii in xrange(10000):  sbv6(D, reverse=True)")
         20003 function calls in 0.460 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    10000    0.020    0.000    0.020    0.000 :0(iteritems)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
    10000    0.340    0.000    0.360    0.000 :1(sbv6)
        1    0.100    0.100    0.460    0.460 :1(?)
        1    0.000    0.000    0.460    0.460 profile:0(for ii in xrange(10000):  sbv6(D, reverse=True))
        0    0.000             0.000          profile:0(profiler)

I can’t speak to memory use, but I can speak to speed…. itemgetter, and functions in the operator module in general, are fast. Use them. As usual, the overhead of function calling kills performance, and eliminating those calls makes all the difference.

Good luck, pythonistas!

gL

(comments welcome!)

Advertisements