Sorting Dictionaries by Value in Python (improved?)
Posted: August 30, 2008 Filed under: performance, Python 24 Comments »Sorting dictionaries by value in Python is a chronic annoyance. The dictionary implementation is very efficient, well-thought out, and easy-to-use, but sorting by value isn’t part of the spec. So, I was pretty excited when read this post on sorting by value at Digital Sanitation Engineering, I was excited. Finally, a method that “is the fastest way to do this and it uses the least amount of memory. Enjoy.”
Oh, enjoy I will.
But then, a wrench in the works. I’m not sure the claim in true. PEP 265 has a different opinion on the matter, and suggests a different idiom, so I decided to do a little testing.
(Preview for time impatient: the PEP 265 version is 10x faster than any others in this test.)
def sbv0(adict,reverse=False):
''' proposed at Digital Sanitation Engineering
http://blog.modp.com/2007/11/sorting-python-dict-by-value.html '''
return sorted(adict.iteritems(), key=lambda (k,v): (v,k), reverse=reverse)
def sbv1(d,reverse=False):
''' explicit list expansion '''
L = [(k,v) for (k,v) in d.iteritems()]
return sorted(L, key=lambda x: x[1] , reverse=reverse)
def sbv2(d,reverse=False):
''' generator '''
L = ((k,v) for (k,v) in d.iteritems())
return sorted(L, key=lambda x: x[1] , reverse=reverse)
def sbv3(d,reverse=False):
''' using a lambda to get the key, rather than "double-assignment" '''
return sorted(d.iteritems(), key=lambda x: x[1] , reverse=reverse)
def sbv4(d,reverse=False):
''' using a formal function to get the sorting key, rather than a lambda'''
def sk(x): return x[1]
return sorted(d.iteritems(), key=sk , reverse=reverse)
def sk(x): return x[1]
def sbv5(d,reverse=False):
''' using a formal function, defined in outer scope
to get the sorting key, rather than a lambda
'''
return sorted(d.iteritems(), key=sk , reverse=reverse)
from operator import itemgetter
def sbv6(d,reverse=False):
''' proposed in PEP 265, using the itemgetter '''
return sorted(d.iteritems(), key=itemgetter(1), reverse=True)
D = dict(zip(range(100),range(100)))
from profile import run
run("for ii in xrange(10000): sbv0(D, reverse=True)")
run("for ii in xrange(10000): sbv1(D, reverse=True)")
run("for ii in xrange(10000): sbv2(D, reverse=True)")
run("for ii in xrange(10000): sbv3(D, reverse=True)")
run("for ii in xrange(10000): sbv4(D, reverse=True)")
run("for ii in xrange(10000): sbv5(D, reverse=True)")
run("for ii in xrange(10000): sbv6(D, reverse=True)")
>>> run("for ii in xrange(10000): sbv0(D, reverse=True)")
1020003 function calls in 5.500 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
10000 0.040 0.000 0.040 0.000 :0(iteritems)
1 0.000 0.000 0.000 0.000 :0(setprofile)
10000 2.880 0.000 5.440 0.001 :1(sbv0)
1000000 2.520 0.000 2.520 0.000 :4()
1 0.060 0.060 5.500 5.500 :1(?)
1 0.000 0.000 5.500 5.500 profile:0(for ii in xrange(10000): sbv0(D, reverse=True))
0 0.000 0.000 profile:0(profiler)
>>> run("for ii in xrange(10000): sbv1(D, reverse=True)")
1020003 function calls in 5.030 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
10000 0.030 0.000 0.030 0.000 :0(iteritems)
1 0.000 0.000 0.000 0.000 :0(setprofile)
10000 3.080 0.000 4.930 0.000 :1(sbv1)
1000000 1.820 0.000 1.820 0.000 :4()
1 0.100 0.100 5.030 5.030 :1(?)
1 0.000 0.000 5.030 5.030 profile:0(for ii in xrange(10000): sbv1(D, reverse=True))
0 0.000 0.000 profile:0(profiler)
>>> run("for ii in xrange(10000): sbv2(D, reverse=True)")
2030003 function calls in 8.900 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
10000 0.020 0.000 0.020 0.000 :0(iteritems)
1 0.000 0.000 0.000 0.000 :0(setprofile)
10000 4.460 0.000 8.870 0.001 <stdin>:1(sbv2)
1010000 2.630 0.000 2.630 0.000 <stdin>:3(<generator expression>)
1000000 1.760 0.000 1.760 0.000 <stdin>:4(<lambda>)
1 0.030 0.030 8.900 8.900 <string>:1(?)
1 0.000 0.000 8.900 8.900 profile:0(for ii in xrange(10000): sbv2(D, reverse=True))
0 0.000 0.000 profile:0(profiler)
>>> run("for ii in xrange(10000): sbv3(D, reverse=True)")
1020003 function calls in 4.860 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
10000 0.050 0.000 0.050 0.000 :0(iteritems)
1 0.000 0.000 0.000 0.000 :0(setprofile)
10000 2.650 0.000 4.820 0.000 :1(sbv3)
1000000 2.120 0.000 2.120 0.000 :4()
1 0.040 0.040 4.860 4.860 :1(?)
1 0.000 0.000 4.860 4.860 profile:0(for ii in xrange(10000): sbv3(D, reverse=True))
0 0.000 0.000 profile:0(profiler)
>>> run("for ii in xrange(10000): sbv4(D, reverse=True)")
1020003 function calls in 4.850 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
10000 0.020 0.000 0.020 0.000 :0(iteritems)
1 0.000 0.000 0.000 0.000 :0(setprofile)
10000 2.770 0.000 4.810 0.000 :1(sbv4)
1000000 2.020 0.000 2.020 0.000 :3(sk)
1 0.040 0.040 4.850 4.850 :1(?)
1 0.000 0.000 4.850 4.850 profile:0(for ii in xrange(10000): sbv4(D, reverse=True))
0 0.000 0.000 profile:0(profiler)
>>> run("for ii in xrange(10000): sbv5(D, reverse=True)")
1020003 function calls in 4.670 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
10000 0.010 0.000 0.010 0.000 :0(iteritems)
1 0.000 0.000 0.000 0.000 :0(setprofile)
10000 2.690 0.000 4.650 0.000 :1(sbv5)
1000000 1.950 0.000 1.950 0.000 :1(sk)
1 0.020 0.020 4.670 4.670 :1(?)
1 0.000 0.000 4.670 4.670 profile:0(for ii in xrange(10000): sbv5(D, reverse=True))
0 0.000 0.000 profile:0(profiler)
>>> run("for ii in xrange(10000): sbv6(D, reverse=True)")
20003 function calls in 0.460 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
10000 0.020 0.000 0.020 0.000 :0(iteritems)
1 0.000 0.000 0.000 0.000 :0(setprofile)
10000 0.340 0.000 0.360 0.000 :1(sbv6)
1 0.100 0.100 0.460 0.460 :1(?)
1 0.000 0.000 0.460 0.460 profile:0(for ii in xrange(10000): sbv6(D, reverse=True))
0 0.000 0.000 profile:0(profiler)
I can’t speak to memory use, but I can speak to speed…. itemgetter, and functions in the operator module in general, are fast. Use them. As usual, the overhead of function calling kills performance, and eliminating those calls makes all the difference.
Good luck, pythonistas!
gL
(comments welcome!)
“Sorting dictionaries by value in Python is a chronic annoyance.”
If you find that you’re having to sort your dictionaries by value all the time, you’re using the wrong data structures.
Let me clarify.
I find it to be an annoyance because some of the more obvious idioms aren’t the fast ones. Normally, I run into this sorting issue when I want to *output* the sorted dictionary for some reason. And when I do that I might have a million or 10 million of them, so even small differences in timing add up. The rest of the time, during access/storage, dictionaries are a sensible structure for my data.
I appreciate your general point about choosing the right structures for the right job.
gL
“If you find that you’re having to sort your dictionaries by value all the time, you’re using the wrong data structures”
Oh bull. Using a hash table as a counter for as-yet-unseen items is a common idiom that drastically simplifies code legibility. I don’t even think there’s a better way to handle this extremely common use-case. Sure if you know exactly what items you’re counting you can allocate a specific counter for each one, but then how do you sort it? Are you suggesting that you should predict exactly what items are being counted, and then add a big mess of if x > y nested logic to handle it afterward? What ARE you suggesting actually?
sbv2 code is no generator function (yield missing). Don’t understand the performance difference between sbv2 and sbv1 (code identical).
Probably run tests with correct generator function, could explain the better performance of sbv2 in comparison to sbv1.
But than wrong source code for sbv2 is presented.
test….
Sorry about the typo in sbv2. I don’t remember what I was trying to explicitly test in sbv2, but it seems like a dumb test in hindsight, since I’m just “double-generating”, and sbv3 is also a generator based solution, with less overhead. I have corrected the times for sbv2 to reflect this. Good catch!
cj_:
The most charitable interpretation I had of Observer’s comment was that if I was constantly doing order queries against a dict, I should be using a tree (and saving the data in sorted order) rather than sorting a dict over and over again. My problem space is more often that I have to only sort each dict once (at the end, after using them to count occurances), but I have scads of them.
Hi writeonly, it’s nickg from Digital Sanitation Engineering. You win the battle of best way or sorting by value! I deprecated the original article, and made a new one:
http://blog.modp.com/2008/09/sorting-python-dictionary-by-value-take.html
Good job!
Hi Again.. 1) I hit reject on your last comment by mistake, and 2) someone corrected my original algorithm. It should be lambda(k,v): v (sort by value) NOT lambda(k,v):(v,k) (sort by value then key). It speeds it up quite a bit but itemgetter still wins. –nickg
Hey Nick,
I tried to benchmark your proposed simpler lambda:
def sbv7(adict,reverse=False): return sorted(adict.iteritems(), key=lambda (k,v): v, reverse=reverse)
and it didn’t perform much differently than the original. It might if there was more (any?) key duplication in the dictionary, but that’s a story for another day.
>>> run(“for ii in xrange(10000): sbv7(D, reverse=True)”)
1020003 function calls in 5.710 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
10000 0.040 0.000 0.040 0.000 :0(iteritems)
1 0.000 0.000 0.000 0.000 :0(setprofile)
10000 3.180 0.000 5.650 0.001 :1(sbv7)
1000000 2.430 0.000 2.430 0.000 :4()
1 0.060 0.060 5.710 5.710 :1(?)
1 0.000 0.000 5.710 5.710 profile:0(for ii in xrange(10000): sbv7(D, reverse=True))
0 0.000 0.000 profile:0(profiler)
Thanks for this discussion! It really helped on a project. I’m sure you know this, but I didn’t, so for anyone else who doesn’t, it’s even faster to just get a sorted list of the keys in order of their value, than to return the list of tuples. I added some tests to your code with the following results:
sorted(d, key=d.__getitem__, reverse=reverse)
10003 function calls in 0.200 CPU seconds
While for the fastest above I got:
20003 function calls in 0.370 CPU seconds
Until I found this, also in the PEP you referenced, I was using a list comprehension to get just a list of the keys in sorted order out of the return list of tuples. This eliminated that step, and is faster
Thanks again for posting!
Also, I benchmarked getting only the top 100 results of the sort via a simple slice and via the nlargest function from the heapq module.
sorted(d, key=d.__getitem__, reverse=reverse)[:100]
10003 function calls in 0.220 CPU seconds
nlargest(100, d, d.__getitem__)
50003 function calls in 0.950 CPU seconds
Definitely not worth it to use the nlargest function!
Paul, thanks for the tips. Looking through the code for nlargest, there’s a lot more going on there. Maybe this would be a good place to do a patch
_nlargest = nlargest
def nlargest(n, iterable, key=None):
“”"Find the n largest elements in a dataset.
Equivalent to: sorted(iterable, key=key, reverse=True)[:n]
“”"
in1, in2 = tee(iterable)
keys = in1 if key is None else map(key, in1)
it = zip(keys, map(neg, count()), in2) # decorate
result = _nlargest(n, it)
return list(map(itemgetter(2), result)) # undecorate
[...] Gregg Lind (aka “write-only”) replied to the article with a comment pointing to his performance notes. PEP 0265 has the “best answer” that is at least 2x [...]
Great post. I ran into the same requirement and sv6 seems to address it really well.
Yet, as also mentioned in PEP 265, this is not really an approach obvious to newbies. Can you give me -and doubtlessly some other newbies- an example of how to get the output back in the dict structure which was passed to sv6?
Here is an example. Having passed the following data:
{’127.0.0.1′: ['no', '9999'], ‘remote_host’: ['no', '888'], ‘localhost’: ['yes', '9999']}
i get this back
[('localhost', ['yes', '9999']), (‘remote_host’, ['no', '8888']), (’127.0.0.1′, ['no', '9999'])]
From what i have learned so far, it looks like a list with tuples and lists inside those tuples.
Thanks in advance.
So, in your example, you have a dict of lists, or a dict of “X”s. The general pattern for what you get back is:
_LIST_ of _TUPLES_, where each TUPLE has 2 elements: (KEY, X).
It has to be list (or it wouldn’t be orderable), and the tuple makes sense since the data in each position has a meaning (KEY, X).
I’m glad you found this helpful!
I seem to have hit submit too quick.
The input was
{’127.0.0.1′: ['yes', '9999'], ‘remote_host’: ['no', '8888'], ‘localhost’: ['yes', '9999']}
and the output is
[('remote_host', ['no', '8888']), (’127.0.0.1′, ['yes', '9999']), (‘localhost’, ['yes', '9999'])]
>> It has to be list (or it wouldn’t be orderable), and the tuple makes sense since the data in each position has a meaning (KEY, X).
In that case, i think i will have to just use the list.
I already tried to convert it back to a dict, but that caused me to end up with my original input.
Thanks for your answer.
This is maybe non-obvious, but if you convert it back to a dictionary, then you lose the ordering! That’s the whole point of sorting, it to end up with something sorted. Think about it a bit
To use:
for (k,v) in sbv6(my_dict): do_function(k,v) # example
In a few years this will be just a little easier for everyone
Ordered dictionaries in 3.1:
http://docs.python.org/dev/py3k/whatsnew/3.1.html
@Andy,
Python 3.1 ordered dictionaries won’t fix this problem unfortunately. All ordered dictionaries do is ensure that the dictionary always iterates in the same way, based on *insertion order*.
There’s an error in sbv6; you say ‘reverse=True’ instead of ‘reverse=reverse’.
I noticed that too.
I think this is skewing the results, however insignificantly.
[...] And I’d like the results to be sorted. A quick trawl across the internet and I found this post about sorting dictionaries in Python. For an introduction to sorting in Python, this article [...]
[...] is a really neat discussion on writeonly’s[1] blog that covers the testing done by digital sanitation engineering[2]. Here is the [...]