May | 2010 | Industry Best Practice is another way of saying “Follow the Herd”

This post comes from my ever-growing file of ‘how the hell did I not know that’ issues.

On the cluster at work we have a bunch of machines with names like mysite12 or mysite237. So quite often I’m writing shell scripts to loop through all these boxes to get info. So I do something like this:

for number in `seq 12 256`
do
    node=mysite$number
    echo $node
done

Which produces
mysite12 mysite13 ..... mysite256
It occurred to me today that this is such a staggeringly common thing to do that seq probably has a way of doing it already. Sure enough after reading the man page it turns out that you can hand seq a PRINTF style format command. So I can create my node names purely in seq

for node in `seq -f "mysite%g" 12 255"
do
    echo $node
done

mysite12 mysite13 .... mysite255
How has it taken me ten years to work that out?

In order to report usage on our PBSPro compute cluster at work I wrote a simple set of python scripts to dump the accounting information into a MySQL database. This has been working fine for the last year churning out reports every month.
This week I had cause to generate some statics aggregated across the whole three years of the data in the database. I’m using a mixture of Elixir and SQLalchemy to talk to the database. Normally I would do something like this:

mybigtablequery = MyBigTable.query()

for job in mybigtablequery:
    if job.attribute = "thing":
        dosomething()

Which worked fine when the database was quite small. I was horrified to see that as this loop went on my python process used more and more memory because the database connection object never throws away a row once it has been loaded. Fortunately I found an answer on stackoverflow.
So I ended up doing the following:

def batch_query(query,batch=10000):
    offset = 0
    while True:
        for elem in query.limit(batch).offset(offset):
            r = True
            yield elem
        offset += batch
        if not r:
            break
        r = False

mybigtablequery = MyBigTable.query()

for job in batch_query(mybigtablequery,50000):
    if job.attribute = "thing":
        dosomething()

“batch” is just an integer defining how many rows will be fetched by each query. The larger this is the more memory the python interpreter will use but the more efficiently the code will run.

Industry Best Practice is another way of saying “Follow the Herd”

Monthly Archives: May 2010

Being Slightly Smarter With seq

Stepping Through Large Database Tables in Python