Deepcopy is a Pig (For Simple Data)

“Being this easy ain’t cheap.”  “There’s no such thing as a free lunch.”

We’ve all heard these tropes before, right?  Sometimes, without testing, it’s hard to see exactly how much that lunch costs.  This week’s example:  Python’s copy.deepcopy.

I tend to fancy myself as using a lot of functional programming techniques in my code, and as part of that, I try to avoid modifying data by side-effect.  Deepcopy makes it easy to copy the original structure, modify the copy, and return it.  After some profiling and timing work, I saw that, of all things, deepcopy was the bottleneck!

Sure, it’s bulletproof, battle-tested, and designed to do the Right Thing ™ in almost every case!  But for simple data structures, it can be overkill, since it does so much accounting, reference tracking, and the like.

Most of the data I see in my day job has simple formats: mainly dictionaries of lists, sets, strings, tuples, and integers. — the basic python types we know and love, easily representable (in plain text, html, tables), and easy to munge / transmit (using JSON or the like).  In short, they’re nice to work with, and transparent.

As it turns out, when we control the input data, we don’t need to worry as much about robustness.  Sure the code below for “deepish_copy” doesn’t handle classes, and nested iterables, or generators, or even nesting to arbitrary depth.  But, it runs fast, as the speed results below show.

import timeit
from copy import deepcopy

def deepish_copy(org):
    '''
    much, much faster than deepcopy, for a dict of the simple python types.
    '''
    out = dict().fromkeys(org)
    for k,v in org.iteritems():
        try:
            out[k] = v.copy()   # dicts, sets
        except AttributeError:
            try:
                out[k] = v[:]   # lists, tuples, strings, unicode
            except TypeError:
                out[k] = v      # ints

    return out

def test_deepish_copy():
    o1 = dict(name = u"blah", id=1, att0 = (1,2,3), att1 = range(10), att2 = set(range(10)))
    o2 = deepish_copy(o1)
    assert o2 == o1, "not equal, but should be"
    del o2['att1'][-1]
    assert o2 != o1, "are equal, shouldn't be"

#prun for ii in xrange(1000):  o2 = deepcopy(o1)
#prun for ii in xrange(1000):  o2 = dc2(o1)

o1 = dict(name = u"blah", id=1, att0 = (1,2,3), att1 = range(10), att2 = set(range(10)))

a = timeit.Timer("o2 = deepish_copy(o1)","from __main__ import deepish_copy,o1")
b = timeit.Timer("o2 = deepcopy(o1)","from __main__ import deepcopy,o1")

# 64-bit linux, 1 gHz chip, python 2.4.3
a.repeat(3,number=20000)
# [0.45441699028015137, 0.41893100738525391, 0.46757102012634277]
b.repeat(3,number=20000)
# [2.5441901683807373, 2.5316669940948486, 2.4751369953155518]

Using the custom written code speeds things up quite a bit (5 fold!).  For me, where this copying *was* the bottleneck, and I have to iterate over hundreds of thousands of these things, it made a noticible difference in total run time.  Taking the 10 minutes it took to write this code was worth it.   So was profiling (using ipython’s simple %prun macro).

As always, to end with another cliche:  your mileage may vary… but if you’re not relying on the car manufacturers to degisn an engine for exactly your needs,  you can probably improve it.

About these ads

4 Comments on “Deepcopy is a Pig (For Simple Data)”

  1. Ultrasick says:

    Great speed improvement. Works also on Python 2.6.4

  2. bc says:

    I was just testing this as a means of inter-thread communication. It’s faster to do a pickle->unpickle sequence than use copy.deepcopy! I was hoping deepcopy would be faster but no, it’s slower by a factor of ~2.6. This was using an object tree 5 levels deep with each object having 5 children.

  3. Louis Le Coeur says:

    That’s very useful. If you also want to support nested structures, you may wanna use a simple recursive function like the one below (note that it doesn’t seem necessary to use the [:] on strings…)

    def deepcopy(obj): # deep copy (recursive) of simple dictionary/list types
    if isinstance(obj, dict):
    d = obj.copy() # shallow dict copy
    for k,v in d.iteritems():
    d[k] = deepcopy(v)
    elif isinstance(obj, (list,tuple)):
    d = obj[:] # shallow list/tuple copy
    i = len(d)
    while i:
    i -= 1
    d[i] = deepcopy(d[i])
    else:
    d = obj # a string, an int, or whatever…
    return d

  4. qznc says:

    There is a problem, if your data structure forms a DAG instead of a tree. E.g.:

    a = dict()
    org = {“key1″: a, “key2″: a}
    out = deepish_copy(org)

    While ‘org’ has two keys pointing to the identical object, the resulting ‘out’ has two keys pointing to different objects with the same contents.

    If your data is immutable this only wastes memory, but you cannot guarantee that.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.