Sorting Dictionaries by Value in Python (improved?)
August 30, 2008
Sorting dictionaries by value in Python is a chronic annoyance. The dictionary implementation is very efficient, well-thought out, and easy-to-use, but sorting by value isn’t part of the spec. So, I was pretty excited when read this post on sorting by value at Digital Sanitation Engineering, I was excited. Finally, a method that “is the fastest way to do this and it uses the least amount of memory. Enjoy.”
Oh, enjoy I will.
But then, a wrench in the works. I’m not sure the claim in true. PEP 265 has a different opinion on the matter, and suggests a different idiom, so I decided to do a little testing.
(Preview for time impatient: the PEP 265 version is 10x faster than any others in this test.)
def sbv0(adict,reverse=False):
''' proposed at Digital Sanitation Engineering
http://blog.modp.com/2007/11/sorting-python-dict-by-value.html '''
return sorted(adict.iteritems(), key=lambda (k,v): (v,k), reverse=reverse)
def sbv1(d,reverse=False):
''' explicit list expansion '''
L = [(k,v) for (k,v) in d.iteritems()]
return sorted(L, key=lambda x: x[1] , reverse=reverse)
def sbv2(d,reverse=False):
''' generator '''
L = ((k,v) for (k,v) in d.iteritems())
return sorted(L, key=lambda x: x[1] , reverse=reverse)
def sbv3(d,reverse=False):
''' using a lambda to get the key, rather than "double-assignment" '''
return sorted(d.iteritems(), key=lambda x: x[1] , reverse=reverse)
def sbv4(d,reverse=False):
''' using a formal function to get the sorting key, rather than a lambda'''
def sk(x): return x[1]
return sorted(d.iteritems(), key=sk , reverse=reverse)
def sk(x): return x[1]
def sbv5(d,reverse=False):
''' using a formal function, defined in outer scope
to get the sorting key, rather than a lambda
'''
return sorted(d.iteritems(), key=sk , reverse=reverse)
from operator import itemgetter
def sbv6(d,reverse=False):
''' proposed in PEP 265, using the itemgetter '''
return sorted(d.iteritems(), key=itemgetter(1), reverse=True)
D = dict(zip(range(100),range(100)))
from profile import run
run("for ii in xrange(10000): sbv0(D, reverse=True)")
run("for ii in xrange(10000): sbv1(D, reverse=True)")
run("for ii in xrange(10000): sbv2(D, reverse=True)")
run("for ii in xrange(10000): sbv3(D, reverse=True)")
run("for ii in xrange(10000): sbv4(D, reverse=True)")
run("for ii in xrange(10000): sbv5(D, reverse=True)")
run("for ii in xrange(10000): sbv6(D, reverse=True)")
>>> run("for ii in xrange(10000): sbv0(D, reverse=True)")
1020003 function calls in 5.500 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
10000 0.040 0.000 0.040 0.000 :0(iteritems)
1 0.000 0.000 0.000 0.000 :0(setprofile)
10000 2.880 0.000 5.440 0.001 :1(sbv0)
1000000 2.520 0.000 2.520 0.000 :4()
1 0.060 0.060 5.500 5.500 :1(?)
1 0.000 0.000 5.500 5.500 profile:0(for ii in xrange(10000): sbv0(D, reverse=True))
0 0.000 0.000 profile:0(profiler)
>>> run("for ii in xrange(10000): sbv1(D, reverse=True)")
1020003 function calls in 5.030 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
10000 0.030 0.000 0.030 0.000 :0(iteritems)
1 0.000 0.000 0.000 0.000 :0(setprofile)
10000 3.080 0.000 4.930 0.000 :1(sbv1)
1000000 1.820 0.000 1.820 0.000 :4()
1 0.100 0.100 5.030 5.030 :1(?)
1 0.000 0.000 5.030 5.030 profile:0(for ii in xrange(10000): sbv1(D, reverse=True))
0 0.000 0.000 profile:0(profiler)
>>> run("for ii in xrange(10000): sbv2(D, reverse=True)")
2030003 function calls in 8.900 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
10000 0.020 0.000 0.020 0.000 :0(iteritems)
1 0.000 0.000 0.000 0.000 :0(setprofile)
10000 4.460 0.000 8.870 0.001 <stdin>:1(sbv2)
1010000 2.630 0.000 2.630 0.000 <stdin>:3(<generator expression>)
1000000 1.760 0.000 1.760 0.000 <stdin>:4(<lambda>)
1 0.030 0.030 8.900 8.900 <string>:1(?)
1 0.000 0.000 8.900 8.900 profile:0(for ii in xrange(10000): sbv2(D, reverse=True))
0 0.000 0.000 profile:0(profiler)
>>> run("for ii in xrange(10000): sbv3(D, reverse=True)")
1020003 function calls in 4.860 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
10000 0.050 0.000 0.050 0.000 :0(iteritems)
1 0.000 0.000 0.000 0.000 :0(setprofile)
10000 2.650 0.000 4.820 0.000 :1(sbv3)
1000000 2.120 0.000 2.120 0.000 :4()
1 0.040 0.040 4.860 4.860 :1(?)
1 0.000 0.000 4.860 4.860 profile:0(for ii in xrange(10000): sbv3(D, reverse=True))
0 0.000 0.000 profile:0(profiler)
>>> run("for ii in xrange(10000): sbv4(D, reverse=True)")
1020003 function calls in 4.850 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
10000 0.020 0.000 0.020 0.000 :0(iteritems)
1 0.000 0.000 0.000 0.000 :0(setprofile)
10000 2.770 0.000 4.810 0.000 :1(sbv4)
1000000 2.020 0.000 2.020 0.000 :3(sk)
1 0.040 0.040 4.850 4.850 :1(?)
1 0.000 0.000 4.850 4.850 profile:0(for ii in xrange(10000): sbv4(D, reverse=True))
0 0.000 0.000 profile:0(profiler)
>>> run("for ii in xrange(10000): sbv5(D, reverse=True)")
1020003 function calls in 4.670 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
10000 0.010 0.000 0.010 0.000 :0(iteritems)
1 0.000 0.000 0.000 0.000 :0(setprofile)
10000 2.690 0.000 4.650 0.000 :1(sbv5)
1000000 1.950 0.000 1.950 0.000 :1(sk)
1 0.020 0.020 4.670 4.670 :1(?)
1 0.000 0.000 4.670 4.670 profile:0(for ii in xrange(10000): sbv5(D, reverse=True))
0 0.000 0.000 profile:0(profiler)
>>> run("for ii in xrange(10000): sbv6(D, reverse=True)")
20003 function calls in 0.460 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
10000 0.020 0.000 0.020 0.000 :0(iteritems)
1 0.000 0.000 0.000 0.000 :0(setprofile)
10000 0.340 0.000 0.360 0.000 :1(sbv6)
1 0.100 0.100 0.460 0.460 :1(?)
1 0.000 0.000 0.460 0.460 profile:0(for ii in xrange(10000): sbv6(D, reverse=True))
0 0.000 0.000 profile:0(profiler)
I can’t speak to memory use, but I can speak to speed…. itemgetter, and functions in the operator module in general, are fast. Use them. As usual, the overhead of function calling kills performance, and eliminating those calls makes all the difference.
Good luck, pythonistas!
gL
(comments welcome!)