TCPView saves the day

June 16, 2009

I was having some trouble with OpenVPN (windows) this morning, a confusing error:

TCP/UDP: Socket bind failed on local address [undef]:1194:
Address already in use (WSAEADDRINUSE)

Seems obvious right?  I killed OpenVPN, and looked around for other processes, but none were evident.  A friend suggested using TCPVIew, which like everything Mark Russinovich writes, is easy to install, super useful, and totally solved the problem almost instantly.

The culprit — PIDGIN, of all things!  For some reason, it had port 1194 all to itself!  Killing it, restarting OpenVPN worked like a breeze!

I was having some trouble installing a network card on my home server.  It wasn’t being autodetected.  I’m weak on Linux hardware  stuff, and networking in particular (since it’s always *just worked*), so this was starting from scratch.  I had a few extra handicaps, in that I didn’t have the orginal box, and this card was one I’d bought months ago, physically installed, and forgetten to configure, so my bad!  I didn’t even know the model number!
read on to see how I got it working

“Being this easy ain’t cheap.”  “There’s no such thing as a free lunch.”

We’ve all heard these tropes before, right?  Sometimes, without testing, it’s hard to see exactly how much that lunch costs.  This week’s example:  Python’s copy.deepcopy.

I tend to fancy myself as using a lot of functional programming techniques in my code, and as part of that, I try to avoid modifying data by side-effect.  Deepcopy makes it easy to copy the original structure, modify the copy, and return it.  After some profiling and timing work, I saw that, of all things, deepcopy was the bottleneck!

Sure, it’s bulletproof, battle-tested, and designed to do the Right Thing ™ in almost every case!  But for simple data structures, it can be overkill, since it does so much accounting, reference tracking, and the like.

Most of the data I see in my day job has simple formats: mainly dictionaries of lists, sets, strings, tuples, and integers. — the basic python types we know and love, easily representable (in plain text, html, tables), and easy to munge / transmit (using JSON or the like).  In short, they’re nice to work with, and transparent.

As it turns out, when we control the input data, we don’t need to worry as much about robustness.  Sure the code below for “deepish_copy” doesn’t handle classes, and nested iterables, or generators, or even nesting to arbitrary depth.  But, it runs fast, as the speed results below show.


import timeit
from copy import deepcopy

def deepish_copy(org):
    '''
    much, much faster than deepcopy, for a dict of the simple python types.
    '''
    out = dict().fromkeys(org)
    for k,v in org.iteritems():
        try:
            out[k] = v.copy()   # dicts, sets
        except AttributeError:
            try:
                out[k] = v[:]   # lists, tuples, strings, unicode
            except TypeError:
                out[k] = v      # ints

    return out

def test_deepish_copy():
    o1 = dict(name = u"blah", id=1, att0 = (1,2,3), att1 = range(10), att2 = set(range(10)))
    o2 = deepish_copy(o1)
    assert o2 == o1, "not equal, but should be"
    del o2['att1'][-1]
    assert o2 != o1, "are equal, shouldn't be"

#prun for ii in xrange(1000):  o2 = deepcopy(o1)
#prun for ii in xrange(1000):  o2 = dc2(o1)

o1 = dict(name = u"blah", id=1, att0 = (1,2,3), att1 = range(10), att2 = set(range(10)))

a = timeit.Timer("o2 = deepish_copy(o1)","from __main__ import deepish_copy,o1")
b = timeit.Timer("o2 = deepcopy(o1)","from __main__ import deepcopy,o1")

# 64-bit linux, 1 gHz chip, python 2.4.3
a.repeat(3,number=20000)
# [0.45441699028015137, 0.41893100738525391, 0.46757102012634277]
b.repeat(3,number=20000)
# [2.5441901683807373, 2.5316669940948486, 2.4751369953155518]

Using the custom written code speeds things up quite a bit (5 fold!).  For me, where this copying *was* the bottleneck, and I have to iterate over hundreds of thousands of these things, it made a noticible difference in total run time.  Taking the 10 minutes it took to write this code was worth it.   So was profiling (using ipython’s simple %prun macro).

As always, to end with another cliche:  your mileage may vary… but if you’re not relying on the car manufacturers to degisn an engine for exactly your needs,  you can probably improve it.

Some simple data is surprisingly hard to find.  Case in point: for some mapping projects, I wanted an adjacency list of the US states.  I couldn’t find one easily, so I made one.   Spread it far and wide!  This should be a pretty easy to digest form.  Figure out what states are next to each other (neighbors) with ease!

Potential gotchas:

  1. I say the four corners (UT,CO,NM,AZ) all touch. If you don’t like it, take them out.
  2. DC is a state here, but Guam, USVI and others aren’t.
  3. It’ll be a cold day in hell before I recognize Missouri.

# Author Gregg Lind
# License:  Public Domain.    I would love to hear about any projects you use if it for though!

AK
AL,MS,TN,GA,FL
AR,MO,TN,MS,LA,TX,OK
AZ,CA,NV,UT,CO,NM
CA,OR,NV,AZ
CO,WY,NE,KS,OK,NM,AZ,UT
CT,NY,MA,RI
DC,MD,VA
DE,MD,PA,NJ
FL,AL,GA
GA,FL,AL,TN,NC,SC
HI
IA,MN,WI,IL,MO,NE,SD
ID,MT,WY,UT,NV,OR,WA
IL,IN,KY,MO,IA,WI
IN,MI,OH,KY,IL
KS,NE,MO,OK,CO
KY,IN,OH,WV,VA,TN,MO,IL
LA,TX,AR,MS
MA,RI,CT,NY,NH,VT
MD,VA,WV,PA,DC,DE
ME,NH
MI,WI,IN,OH
MN,WI,IA,SD,ND
MO,IA,IL,KY,TN,AR,OK,KS,NE
MS,LA,AR,TN,AL
MT,ND,SD,WY,ID
NC,VA,TN,GA,SC
ND,MN,SD,MT
NE,SD,IA,MO,KS,CO,WY
NH,VT,ME,MA
NJ,DE,PA,NY
NM,AZ,UT,CO,OK,TX
NV,ID,UT,AZ,CA,OR
NY,NJ,PA,VT,MA,CT
OH,PA,WV,KY,IN,MI
OK,KS,MO,AR,TX,NM,CO
OR,CA,NV,ID,WA
PA,NY,NJ,DE,MD,WV,OH
RI,CT,MA
SC,GA,NC
SD,ND,MN,IA,NE,WY,MT
TN,KY,VA,NC,GA,AL,MS,AR,MO
TX,NM,OK,AR,LA
UT,ID,WY,CO,NM,AZ,NV
VA,NC,TN,KY,WV,MD,DC
VT,NY,NH,MA
WA,ID,OR
WI,MI,MN,IA,IL
WV,OH,PA,MD,VA,KY
WY,MT,SD,NE,CO,UT,ID

In some of our server code, we like to insure we get unbuffered output, like in perl.  In Python, this is easy to do:


import sys

sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0)

However, using the wonderful nosetest for testing will barf on this code, because it reassigns stdout to a cStringIO for capturing.

This is a workaround:


import sys
import os

try:   # get unbuffered output
    sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0)
except AttributeError, exc: # under nose, sys.stdout is reassigned to a string buffer
    pass

def test():
    assert 1

Alternately, run nosetests -s, which disables the output capture feature.

Cf:  my embarrassing bug report over at nose

Quick hint on unix:


$ date -u "+%c" -d @1234567890
Fri 13 Feb 2009 11:31:30 PM UTC

cf:   Convert timestamp to date in Bash

Sometimes as part of a sqlite query, I want to use a list or other iterable as an argument.  Most commonly, this is useful for “IN” queries.  The sqlite (or possibly the python-sqlite wrapper), can’t properly interpret this sort of named place holder (cf. using lists as sqlite arguments).

For example:


SELECT  *  FROM  table  WHERE   id   IN (:arglist)

This won’t work in sqlite.   A workaround is given below.  Use the function “sql_in_to_org” change this construct to from an “IN” to an “OR”, and get a dict of the arguments.


def sql_in_to_or(sqlvar, args, name="arg" ):
    '''
    convert: sqlvar IN (args) to:
      sqlvar=:arg0 OR sqlvar=:arg1 ....

    returns "new query", dict of args

    To use in execute:

    Q = 'select * from table where (%(where)s) and time > :ts'
    iq, iq_args = sql_in_to_or( "table.id", range(5), name='id')
    placeholders = dict(ts=12323425)
    conn.execute(Q % dict(where=iq), placeholders.update(iq_args))     
    '''
    q_parts = list()
    q_args = dict()
    for (ii, arg) in enumerate(args):
        argname = "%s%i" % (name, ii)
        q_parts.append( "%s=:%s" % (sqlvar,argname) )
        q_args[argname] = arg

    query =  " OR ".join( q_parts )
    return (query, q_args)

(Thanks to Jon Nelson of the Pycurious Blog for the idea (we work together at Renesys))

Over at TheDailyWtf, hidden among some comments was an interesting dynamic programming problem:

Consider this problem:

George bought a sack of 100 pieces of candy at the store. 90 of the pieces are lemon flavored and ten are cherry flavored. Of the two, George prefers the lemon flavored candies.

Every day George randomly picks a piece of candy out of the bag. If it is lemon flavored, he eats it and puts the bag away for the next day.

But if the candy he chose is cherry flavored, he puts it back in the bag and then randomly picks a candy out of the bag and eats it regardless of the flavor. In other words, he’ll only put a piece of candy back at most once per day.

What are the odds that when one piece of candy remains, it will be lemon flavored?

I posed the problem at a company where I used to work. All but one person tried to do it recursively. The remaining person tried to do it using an Excel spreadsheet!!!

Maybe I (and some of the other posters on that thread) are morons, but Excel (or in my case, OpenOffice) seemed like a fine way to solve it, so I did.

Read more about the Lemon-Cherry problem, and download the speadsheets used to solve it

It can be awfully tempting to make some changes to an existing open-source project [1]. Some of that excitement diminishes when one realizes how long a git-svn clone will take on a large project repo, like Python. The gain git-svn gives you in terms of quick history lookup is taken as cost in the beginning.

Instead, we can do a “shallow-copy” to get the last few revisions. It seems that you need to use actual revisions numbers for the first argument to -r, but I could be wrong. I tried using HEAD~1000:HEAD

$ git-svn clone http://svn.python.org/projects/python/trunk/ python-dev -r 65000:HEAD.

If you find this is *still* taking too long, try canceling, changing into the directory and issue a:

$ git svn fetch

Good luck all!

Notes

  1. Finally got my first one into python, #4568: remove limitation in varargs callback example.

Let’s face facts.

  1. Unicode is a hassle [1]
  2. Not using unicode is a hassle, especially if you have one of thoese “weirdo” languages, and *gasp*, you want to read text *in your own language*.

I was faced with a simple task.  Take some text, process it, and print out some results (in JSON).  This should be trivial, and in a world where programming
was invented by a multinational consortium, and designed from the first day to be compatible with all text, maybe it would be.  Instead we have a world with a rich history of mutual incompatibility.

Find out how I solved this in Python