Limiting Parallelism

Limiting Parallelism 2006-05-22

Concurrency can be a great way to speed things up, but what happens when you have too much concurrency? Overloading a system or a network can be detrimental to performance. Often there is a peak in performance at a particular level of concurrency. Executing a particular number of tasks in parallel will be easier than ever with Twisted 2.5 and Python 2.5:

from twisted.internet import defer, task

def parallel(iterable, count, callable, *args, **named):
    coop = task.Cooperator()
    work = (callable(elem, *args, **named) for elem in iterable)
    return defer.DeferredList([coop.coiterate(work) for i in xrange(count)])

Here’s an example of using this to save the contents of a bunch of URLs which are listed one per line in a text file, downloading at most fifty at a time:

from twisted.python import log
from twisted.internet import reactor
from twisted.web import client

def download((url, fileName)):
    return client.downloadPage(url, file(fileName, 'wb'))

urls = [(url, str(n)) for (n, url) in enumerate(file('urls.txt'))]
finished = parallel(urls, 50, download)
finished.addErrback(log.err)
finished.addCallback(lambda ign: reactor.stop())
reactor.run()

[Edit: The original generator expression in this post was of the form ((yield foo()) for x in y). The yield here is completely superfluous, of course, so I have removed it.]

[Edit: The original post talked about Twisted 2.4 and Python 2.5. It has since turned out that Python 2.5 is too disimilar to Python 2.4 for Twisted 2.4 to run on it. Twisted 2.5 is required to use Python 2.5.]