Sorting Dictionaries by Value in Python (improved?)

Sorting dictionaries by value in Python is a chronic annoyance. The dictionary implementation is very efficient, well-thought out, and easy-to-use, but sorting by value isn’t part of the spec. So, I was pretty excited when read this post on sorting by value at Digital Sanitation Engineering, I was excited. Finally, a method that “is the fastest way to do this and it uses the least amount of memory. Enjoy.”

Oh, enjoy I will.

But then, a wrench in the works. I’m not sure the claim in true. PEP 265 has a different opinion on the matter, and suggests a different idiom, so I decided to do a little testing.

(Preview for time impatient: the PEP 265 version is 10x faster than any others in this test.)

def sbv0(adict,reverse=False):
    ''' proposed at Digital Sanitation Engineering
    http://blog.modp.com/2007/11/sorting-python-dict-by-value.html '''
    return sorted(adict.iteritems(), key=lambda (k,v): (v,k), reverse=reverse)

def sbv1(d,reverse=False):
    '''  explicit list expansion '''
    L = [(k,v) for (k,v) in d.iteritems()]
    return sorted(L, key=lambda x: x[1] , reverse=reverse)

def sbv2(d,reverse=False):
    '''  generator '''
    L = ((k,v) for (k,v) in d.iteritems())
    return sorted(L, key=lambda x: x[1] , reverse=reverse)

def sbv3(d,reverse=False):
    ''' using a lambda to get the key, rather than "double-assignment" '''

    return sorted(d.iteritems(), key=lambda x: x[1] , reverse=reverse)

def sbv4(d,reverse=False):
    ''' using a formal function to get the sorting key, rather than a lambda'''
    def sk(x):  return x[1]
    return sorted(d.iteritems(), key=sk , reverse=reverse)

def sk(x):  return x[1]

def sbv5(d,reverse=False):
    ''' using a formal function, defined in outer scope
    to get the sorting key, rather than a lambda
    '''
    return sorted(d.iteritems(), key=sk , reverse=reverse)

from operator import itemgetter
def sbv6(d,reverse=False):
    ''' proposed in PEP 265, using  the itemgetter '''
    return sorted(d.iteritems(), key=itemgetter(1), reverse=True)

D = dict(zip(range(100),range(100)))

from profile import run

run("for ii in xrange(10000):  sbv0(D, reverse=True)")
run("for ii in xrange(10000):  sbv1(D, reverse=True)")
run("for ii in xrange(10000):  sbv2(D, reverse=True)")
run("for ii in xrange(10000):  sbv3(D, reverse=True)")
run("for ii in xrange(10000):  sbv4(D, reverse=True)")
run("for ii in xrange(10000):  sbv5(D, reverse=True)")
run("for ii in xrange(10000):  sbv6(D, reverse=True)")

>>> run("for ii in xrange(10000):  sbv0(D, reverse=True)")
         1020003 function calls in 5.500 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    10000    0.040    0.000    0.040    0.000 :0(iteritems)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
    10000    2.880    0.000    5.440    0.001 :1(sbv0)
  1000000    2.520    0.000    2.520    0.000 :4()
        1    0.060    0.060    5.500    5.500 :1(?)
        1    0.000    0.000    5.500    5.500 profile:0(for ii in xrange(10000):  sbv0(D, reverse=True))
        0    0.000             0.000          profile:0(profiler)

>>> run("for ii in xrange(10000):  sbv1(D, reverse=True)")
         1020003 function calls in 5.030 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    10000    0.030    0.000    0.030    0.000 :0(iteritems)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
    10000    3.080    0.000    4.930    0.000 :1(sbv1)
  1000000    1.820    0.000    1.820    0.000 :4()
        1    0.100    0.100    5.030    5.030 :1(?)
        1    0.000    0.000    5.030    5.030 profile:0(for ii in xrange(10000):  sbv1(D, reverse=True))
        0    0.000             0.000          profile:0(profiler)

>>> run("for ii in xrange(10000):  sbv2(D, reverse=True)")
         2030003 function calls in 8.900 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    10000    0.020    0.000    0.020    0.000 :0(iteritems)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
    10000    4.460    0.000    8.870    0.001 <stdin>:1(sbv2)
  1010000    2.630    0.000    2.630    0.000 <stdin>:3(<generator expression>)
  1000000    1.760    0.000    1.760    0.000 <stdin>:4(<lambda>)
        1    0.030    0.030    8.900    8.900 <string>:1(?)
        1    0.000    0.000    8.900    8.900 profile:0(for ii in xrange(10000):  sbv2(D, reverse=True))
        0    0.000             0.000          profile:0(profiler)

>>> run("for ii in xrange(10000):  sbv3(D, reverse=True)")
         1020003 function calls in 4.860 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    10000    0.050    0.000    0.050    0.000 :0(iteritems)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
    10000    2.650    0.000    4.820    0.000 :1(sbv3)
  1000000    2.120    0.000    2.120    0.000 :4()
        1    0.040    0.040    4.860    4.860 :1(?)
        1    0.000    0.000    4.860    4.860 profile:0(for ii in xrange(10000):  sbv3(D, reverse=True))
        0    0.000             0.000          profile:0(profiler)

>>> run("for ii in xrange(10000):  sbv4(D, reverse=True)")
         1020003 function calls in 4.850 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    10000    0.020    0.000    0.020    0.000 :0(iteritems)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
    10000    2.770    0.000    4.810    0.000 :1(sbv4)
  1000000    2.020    0.000    2.020    0.000 :3(sk)
        1    0.040    0.040    4.850    4.850 :1(?)
        1    0.000    0.000    4.850    4.850 profile:0(for ii in xrange(10000):  sbv4(D, reverse=True))
        0    0.000             0.000          profile:0(profiler)

>>> run("for ii in xrange(10000):  sbv5(D, reverse=True)")
         1020003 function calls in 4.670 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    10000    0.010    0.000    0.010    0.000 :0(iteritems)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
    10000    2.690    0.000    4.650    0.000 :1(sbv5)
  1000000    1.950    0.000    1.950    0.000 :1(sk)
        1    0.020    0.020    4.670    4.670 :1(?)
        1    0.000    0.000    4.670    4.670 profile:0(for ii in xrange(10000):  sbv5(D, reverse=True))
        0    0.000             0.000          profile:0(profiler)

>>> run("for ii in xrange(10000):  sbv6(D, reverse=True)")
         20003 function calls in 0.460 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    10000    0.020    0.000    0.020    0.000 :0(iteritems)
        1    0.000    0.000    0.000    0.000 :0(setprofile)
    10000    0.340    0.000    0.360    0.000 :1(sbv6)
        1    0.100    0.100    0.460    0.460 :1(?)
        1    0.000    0.000    0.460    0.460 profile:0(for ii in xrange(10000):  sbv6(D, reverse=True))
        0    0.000             0.000          profile:0(profiler)

I can’t speak to memory use, but I can speak to speed…. itemgetter, and functions in the operator module in general, are fast. Use them. As usual, the overhead of function calling kills performance, and eliminating those calls makes all the difference.

Good luck, pythonistas!

gL

(comments welcome!)


25 Comments on “Sorting Dictionaries by Value in Python (improved?)”

  1. Observer says:

    “Sorting dictionaries by value in Python is a chronic annoyance.”

    If you find that you’re having to sort your dictionaries by value all the time, you’re using the wrong data structures.

  2. writeonly says:

    Let me clarify.

    I find it to be an annoyance because some of the more obvious idioms aren’t the fast ones. Normally, I run into this sorting issue when I want to *output* the sorted dictionary for some reason. And when I do that I might have a million or 10 million of them, so even small differences in timing add up. The rest of the time, during access/storage, dictionaries are a sensible structure for my data.

    I appreciate your general point about choosing the right structures for the right job.

    gL

  3. cj_ says:

    “If you find that you’re having to sort your dictionaries by value all the time, you’re using the wrong data structures”

    Oh bull. Using a hash table as a counter for as-yet-unseen items is a common idiom that drastically simplifies code legibility. I don’t even think there’s a better way to handle this extremely common use-case. Sure if you know exactly what items you’re counting you can allocate a specific counter for each one, but then how do you sort it? Are you suggesting that you should predict exactly what items are being counted, and then add a big mess of if x > y nested logic to handle it afterward? What ARE you suggesting actually?

  4. test says:

    sbv2 code is no generator function (yield missing). Don’t understand the performance difference between sbv2 and sbv1 (code identical).
    Probably run tests with correct generator function, could explain the better performance of sbv2 in comparison to sbv1.
    But than wrong source code for sbv2 is presented.

  5. writeonly says:

    test….

    Sorry about the typo in sbv2. I don’t remember what I was trying to explicitly test in sbv2, but it seems like a dumb test in hindsight, since I’m just “double-generating”, and sbv3 is also a generator based solution, with less overhead. I have corrected the times for sbv2 to reflect this. Good catch!

  6. writeonly says:

    cj_:

    The most charitable interpretation I had of Observer’s comment was that if I was constantly doing order queries against a dict, I should be using a tree (and saving the data in sorted order) rather than sorting a dict over and over again. My problem space is more often that I have to only sort each dict once (at the end, after using them to count occurances), but I have scads of them.

  7. nickg says:

    Hi writeonly, it’s nickg from Digital Sanitation Engineering. You win the battle of best way or sorting by value! I deprecated the original article, and made a new one:

    http://blog.modp.com/2008/09/sorting-python-dictionary-by-value-take.html

    Good job!

  8. nickg says:

    Hi Again.. 1) I hit reject on your last comment by mistake, and 2) someone corrected my original algorithm. It should be lambda(k,v): v (sort by value) NOT lambda(k,v):(v,k) (sort by value then key). It speeds it up quite a bit but itemgetter still wins. –nickg

  9. writeonly says:

    Hey Nick,

    I tried to benchmark your proposed simpler lambda:

    def sbv7(adict,reverse=False): return sorted(adict.iteritems(), key=lambda (k,v): v, reverse=reverse)

    and it didn’t perform much differently than the original. It might if there was more (any?) key duplication in the dictionary, but that’s a story for another day.

    >>> run(“for ii in xrange(10000): sbv7(D, reverse=True)”)
    1020003 function calls in 5.710 CPU seconds

    Ordered by: standard name

    ncalls tottime percall cumtime percall filename:lineno(function)
    10000 0.040 0.000 0.040 0.000 :0(iteritems)
    1 0.000 0.000 0.000 0.000 :0(setprofile)
    10000 3.180 0.000 5.650 0.001 :1(sbv7)
    1000000 2.430 0.000 2.430 0.000 :4()
    1 0.060 0.060 5.710 5.710 :1(?)
    1 0.000 0.000 5.710 5.710 profile:0(for ii in xrange(10000): sbv7(D, reverse=True))
    0 0.000 0.000 profile:0(profiler)

  10. Paul says:

    Thanks for this discussion! It really helped on a project. I’m sure you know this, but I didn’t, so for anyone else who doesn’t, it’s even faster to just get a sorted list of the keys in order of their value, than to return the list of tuples. I added some tests to your code with the following results:

    sorted(d, key=d.__getitem__, reverse=reverse)
    10003 function calls in 0.200 CPU seconds

    While for the fastest above I got:
    20003 function calls in 0.370 CPU seconds

    Until I found this, also in the PEP you referenced, I was using a list comprehension to get just a list of the keys in sorted order out of the return list of tuples. This eliminated that step, and is faster :)

    Thanks again for posting!

  11. Paul says:

    Also, I benchmarked getting only the top 100 results of the sort via a simple slice and via the nlargest function from the heapq module.

    sorted(d, key=d.__getitem__, reverse=reverse)[:100]
    10003 function calls in 0.220 CPU seconds

    nlargest(100, d, d.__getitem__)
    50003 function calls in 0.950 CPU seconds

    Definitely not worth it to use the nlargest function!

  12. writeonly says:

    Paul, thanks for the tips. Looking through the code for nlargest, there’s a lot more going on there. Maybe this would be a good place to do a patch :)

    _nlargest = nlargest
    def nlargest(n, iterable, key=None):
    “””Find the n largest elements in a dataset.

    Equivalent to: sorted(iterable, key=key, reverse=True)[:n]
    “””
    in1, in2 = tee(iterable)
    keys = in1 if key is None else map(key, in1)
    it = zip(keys, map(neg, count()), in2) # decorate
    result = _nlargest(n, it)
    return list(map(itemgetter(2), result)) # undecorate

  13. [...] Gregg Lind (aka “write-only”) replied to the article with a comment pointing to his performance notes. PEP 0265 has the “best answer” that is at least 2x [...]

  14. Andy says:

    Great post. I ran into the same requirement and sv6 seems to address it really well.

    Yet, as also mentioned in PEP 265, this is not really an approach obvious to newbies. Can you give me -and doubtlessly some other newbies- an example of how to get the output back in the dict structure which was passed to sv6?

    Here is an example. Having passed the following data:

    {‘127.0.0.1′: ['no', '9999'], ‘remote_host': ['no', '888'], ‘localhost': ['yes', '9999']}

    i get this back

    [(‘localhost’, ['yes', '9999']), (‘remote_host’, ['no', '8888']), (‘127.0.0.1′, ['no', '9999'])]

    From what i have learned so far, it looks like a list with tuples and lists inside those tuples.

    Thanks in advance.

  15. Gregg Lind says:

    So, in your example, you have a dict of lists, or a dict of “X”s. The general pattern for what you get back is:

    _LIST_ of _TUPLES_, where each TUPLE has 2 elements: (KEY, X).

    It has to be list (or it wouldn’t be orderable), and the tuple makes sense since the data in each position has a meaning (KEY, X).

    I’m glad you found this helpful!

  16. Andy says:

    I seem to have hit submit too quick.
    The input was

    {‘127.0.0.1′: ['yes', '9999'], ‘remote_host': ['no', '8888'], ‘localhost': ['yes', '9999']}

    and the output is

    [(‘remote_host’, ['no', '8888']), (‘127.0.0.1′, ['yes', '9999']), (‘localhost’, ['yes', '9999'])]

  17. Andy says:

    >> It has to be list (or it wouldn’t be orderable), and the tuple makes sense since the data in each position has a meaning (KEY, X).

    In that case, i think i will have to just use the list.

    I already tried to convert it back to a dict, but that caused me to end up with my original input.

    Thanks for your answer.

  18. Gregg Lind says:

    This is maybe non-obvious, but if you convert it back to a dictionary, then you lose the ordering! That’s the whole point of sorting, it to end up with something sorted. Think about it a bit :)

    To use:

    for (k,v) in sbv6(my_dict): do_function(k,v) # example

  19. Andy says:

    In a few years this will be just a little easier for everyone :)

    Ordered dictionaries in 3.1:

    http://docs.python.org/dev/py3k/whatsnew/3.1.html

  20. Gregg Lind says:

    @Andy,

    Python 3.1 ordered dictionaries won’t fix this problem unfortunately. All ordered dictionaries do is ensure that the dictionary always iterates in the same way, based on *insertion order*.

  21. Daniel Drucker says:

    There’s an error in sbv6; you say ‘reverse=True’ instead of ‘reverse=reverse’.

  22. [...] And I’d like the results to be sorted.  A quick trawl across the internet and I found this post about sorting dictionaries in Python.  For an introduction to sorting in Python, this article [...]

  23. [...] is a really neat discussion on writeonly’s[1] blog that covers the testing done by digital sanitation engineering[2]. Here is the [...]

  24. A Unisa Courses official was injured Tuesday during a pre-Independence Day fireworks demonstration in Mount Prospect,
    Ill. Provide illustrations if possible, like pictures
    of the kitchen and the numbers of staff involved. The equipment
    is also known for smothering fires leaving no residue of any kind.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.