Monday, May 22, 2006

Limiting Parallelism

Concurrency can be a great way to speed things up, but what happens when you have too much concurrency? Overloading a system or a network can be detrimental to performance. Often there is a peak in performance at a particular level of concurrency. Executing a particular number of tasks in parallel will be easier than ever with Twisted 2.5 and Python 2.5:


from twisted.internet import defer, task

def parallel(iterable, count, callable, *args, **named):
coop = task.Cooperator()
work = (callable(elem, *args, **named) for elem in iterable)
return defer.DeferredList([coop.coiterate(work) for i in xrange(count)])


Here's an example of using this to save the contents of a bunch of URLs which are listed one per line in a text file, downloading at most fifty at a time:


from twisted.python import log
from twisted.internet import reactor
from twisted.web import client

def download((url, fileName)):
return client.downloadPage(url, file(fileName, 'wb'))

urls = [(url, str(n)) for (n, url) in enumerate(file('urls.txt'))]
finished = parallel(urls, 50, download)
finished.addErrback(log.err)
finished.addCallback(lambda ign: reactor.stop())
reactor.run()


[Edit: The original generator expression in this post was of the form ((yield foo()) for x in y). The yield here is completely superfluous, of course, so I have removed it.]



[Edit: The original post talked about Twisted 2.4 and Python 2.5. It has since turned out that Python 2.5 is too disimilar to Python 2.4 for Twisted 2.4 to run on it. Twisted 2.5 is required to use Python 2.5.]

11 comments:

  1. The Cooperator-based solution will only create 51 Deferreds. The DeferredSemaphore solution will create len(urls) + 1 Deferreds, all at once.

    The Cooperator-based solution also supports pluggable schedulers, though I haven't taken advantage of that in this case.

    ReplyDelete
  2. Unless I'm mistaken, the Cooperator version only downloads 50 urls total, and the deferredSemaphore downloads all the urls.

    If we said:
    - finished = defer.DeferredList(downloads)
    + finished = defer.DeferredList(downloads[:50])

    Now what advantage does using Cooperator have over using DeferredSemaphore?

    Pluggable schedulers sounds interesting as well - can the schedulers introspect the tasks it handles, so we could for example (in this case) schedule only one download per host?

    ReplyDelete
  3. Alas, you are mistaken. The Cooperator version downloads all the URLs in the given file. The DeferredList only has 50 elements because only 50 parallel download "tasks" will be created. Each will download as quickly as it can and stop (fire its Deferred) when there are no more URLs to be downloaded.

    ReplyDelete
  4. Regarding schedulers - one could write a scheduler which did this, but it would involve changing the download function and the work generator into more easily introspectable objects. Right now the scheduler could only see that there is a generator expression and a Deferred, it would have no way of finding out what host a particular task would grab an URL from.

    ReplyDelete
  5. Very, very nifty - I'm already toying with some code using it (as, perhaps I should have done before posting). Thanks for spreading the word on this :)

    ReplyDelete
  6. how would you make the generator into a more easily introspectable object?

    ReplyDelete
  7. Your post is dated as 2006, would it work for twisted 8.x/10.x too?

    I tried it with 8.x, but after 1K downloads it gets hanging somewhere.

    Another issue is 100% CPU load. Albeit, yes, the network bandwidth of 1Mbyte/s is used fully during downloading of the first 1K pages.
    (The usual multithreaded download implemented in Python gets my CPU almost idling.)

    DISCLAIMER: I am newbie in twisted.

    ReplyDelete
  8. > Your post is dated as 2006, would it work for twisted 8.x/10.x too?

    Sure.

    > I tried it with 8.x, but after 1K downloads it gets hanging somewhere.

    The code is sort of a toy. It demonstrates task.Cooperator, but it's missing proper error handling, logging, and timeouts. There are lots of reasons it might hang when used to download lots of real URLs. For example, if there are enough very slow web servers being hit, then eventually the client will find itself just waiting for them. This isn't really "hung", but will appear as though it is. Another possibility is a network disruption, which leaves a connection to a server broken but never noticed. If enough of these accumulate, then no new work can be processed.

    It's always a good idea to include timeouts on network operations, otherwise you're left at the whim of a random internet service which doesn't necessarily have your best interests in mind.

    For this particular case, note that `getPage` accepts a timeout parameter.

    > Another issue is 100% CPU load. Albeit, yes, the network bandwidth of 1Mbyte/s is used fully during downloading of the first 1K pages.

    It's hard for me to say anything meaningful about this. As a first approximation, I'll say great! A busy CPU means you're getting things done and is nothing to complain about. If you want a different story, you need to present more information. :) I suggest #twisted on irc.freenode.net or the twisted-python mailing list for any more in-depth conversations, though.

    ReplyDelete
  9. As I understand it (from _tasksWhileNotStopped) Cooperator is round-robin, right?

    I'd like to exhaust a list of queues by always scheduling from the longest queue.

    Are the lower-level primitives like DeferredSemaphore the way to go?

    ReplyDelete
  10. @neurosurg: Yes, it still works with 8.x and 10.x. Hard to say what might cause it to "hang" without knowing exactly what code you're running. Did you use my example code with a longer list of URLs? As far as CPU usage goes, I've definitely been able to pull at least 1Mbyte/sec and do processing on each page (like parse the HTML with a lenient parser). My CPU bottlenecks always came from the per-page processing, never from the downloading itself. Again, depends on the exact details of the code. :) It's possible there's something in Twisted which should be performing better though! (most software can be faster somehow :).

    ReplyDelete
  11. @manigel: Yes, it's round-robin. For alternate scheduling algorithms, I'd probably recommend building something very much like Cooperator, but which isn't simply round-robin. It would be very cool if the scheduling algorithm could even be a parameter to Cooperator, but the current iterator-based interface might rule that out.

    ReplyDelete