Saturday, December 20, 2014

Asynchronous Object Initialization - Patterns and Antipatterns

I caught Toshio Kuratomi's post about asyncio initialization patterns (or anti-patterns) on Planet Python. This is something I've dealt with a lot over the years using Twisted (one of the sources of inspiration for the asyncio developers).

To recap, Toshio wondered about a pattern involving asynchronous initialization of an instance. He wondered whether it was a good idea to start this work in __init__ and then explicitly wait for it in other methods of the class before performing the distinctive operations required by those other methods. Using asyncio (and using Toshio's example with some omissions for simplicity) this looks something like:

class Microblog:
    def __init__(self, ...):
        loop = asyncio.get_event_loop()
        self.init_future = loop.run_in_executor(None, self._reading_init)

    def _reading_init(self):
        # ... do some initialization work,
        # presumably expensive or otherwise long-running ...

   @asyncio.coroutine
    def sync_latest(self):
        # Don't do anything until initialization is done
        yield from self.init_future
        # ... do some work that depends on that initialization ...

It's quite possible to do something similar to this when using Twisted. It only looks a little bit difference:

class Microblog:
    def __init__(self, ...):
        self.init_deferred = deferToThread(self._reading_init)

    def _reading_init(self):
        # ... do some initialization work,
        # presumably expensive or otherwise long-running ...

   @inlineCallbacks
    def sync_latest(self):
        # Don't do anything until initialization is done
        yield self.init_deferred
        # ... do some work that depends on that initialization ...

Despite the differing names, these two pieces of code basical do the same thing:

  • run _reading_init in a thread from a thread pool
  • whenever sync_latest is called, first suspend its execution until the thread running _reading_init has finished running it

Maintenance costs

One thing this pattern gives you is an incompletely initialized object. If you write m = Microblog() then m refers to an object that's not actually ready to perform all of the operations it supposedly can perform. It's either up to the implementation or the caller to make sure to wait until it is ready. Toshio suggests that each method should do this implicitly (by starting with yield self.init_deferred or the equivalent). This is definitely better than forcing each call-site of a Microblog method to explicitly wait for this event before actually calling the method.

Still, this is a maintenance burden that's going to get old quickly. If you want full test coverage, it means you now need twice as many unit tests (one for the case where method is called before initialization is complete and another for the case where the method is called after this has happened). At least. Toshio's _reading_init method actually modifies attributes of self which means there are potentially many more than just two possible cases. Even if you're not particularly interested in having full automated test coverage (... for some reason ...), you still have to remember to add this yield statement to the beginning of all of Microblog's methods. It's not exactly a ton of work but it's one more thing to remember any time you maintain this code. And this is the kind of mistake where making a mistake creates a race condition that you might not immediately notice - which means you may ship the broken code to clients and you get to discover the problem when they start complaining about it.

Diminished flexibility

Another thing this pattern gives you is an object that does things as soon as you create it. Have you ever had a class with a __init__ method that raised an exception as a result of a failing interaction with some other part of the system? Perhaps it did file I/O and got a permission denied error or perhaps it was a socket doing blocking I/O on a network that was clogged and unresponsive. Among other problems, these cases are often difficult to report well because you don't have an object to blame the problem on yet. The asynchronous version is perhaps even worse since a failure in this asynchronous initialization doesn't actually prevent you from getting the instance - it's just another way you can end up with an incompletely initialized object (this time, one that is never going to be completely initialized and use of which is unsafe in difficult to reason-about ways).

Another related problem is that it removes one of your options for controlling the behavior of instances of that class. It's great to be able to control everything a class does just by the values passed in to __init__ but most programmers have probably come across a case where behavior is controlled via an attribute instead. If __init__ starts an operation then instantiating code doesn't have a chance to change the values of any attributes first (except, perhaps, by resorting to setting them on the class - which has global consequences and is generally icky).

Loss of control

A third consequence of this pattern is that instances of classes which employ it are inevitably doing something. It may be that you don't always want the instance to do something. It's certainly fine for a Microblog instance to create a SQLite3 database and initialize a cache directory if the program I'm writing which uses it is actually intent on hosting a blog. It's most likely the case that other useful things can be done with a Microblog instance, though. Toshio's own example includes a post method which doesn't use the SQLite3 database or the cache directory. His code correctly doesn't wait for init_future at the beginning of his post method - but this should leave the reader wondering why we need to create a SQLite3 database if all we want to do is post new entries.

Using this pattern, the SQLite3 database is always created - whether we want to use it or not. There are other reasons you might want a Microblog instance that hasn't initialized a bunch of on-disk state too - one of the most common is unit testing (yes, I said "unit testing" twice in one post!). A very convenient thing for a lot of unit tests, both of Microblog itself and of code that uses Microblog, is to compare instances of the class. How do you know you got a Microblog instance that is configured to use the right cache directory or database type? You most likely want to make some comparisons against it. The ideal way to do this is to be able to instantiate a Microblog instance in your test suite and uses its == implementation to compare it against an object given back by some API you've implemented. If creating a Microblog instance always goes off and creates a SQLite3 database then at the very least your test suite is going to be doing a lot of unnecessary work (making it slow) and at worst perhaps the two instances will fight with each other over the same SQLite3 database file (which they must share since they're meant to be instances representing the same state). Another way to look at this is that inextricably embedding the database connection logic into your __init__ method has taken control away from the user. Perhaps they have their own database connection setup logic. Perhaps they want to re-use connections or pass in a fake for testing. Saving a reference to that object on the instance for later use is a separate operation from creating the connection itself. They shouldn't be bound together in __init__ where you have to take them both or give up on using Microblog.

Alternatives

You might notice that these three observations I've made all sound a bit negative. You might conclude that I think this is an antipattern to be avoided. If so, feel free to give yourself a pat on the back at this point.

But if this is an antipattern, is there a pattern to use instead? I think so. I'll try to explain it.

The general idea behind the pattern I'm going to suggest comes in two parts. The first part is that your object should primarily be about representing state and your __init__ method should be about accepting that state from the outside world and storing it away on the instance being initialized for later use. It should always represent complete, internally consistent state - not partial state as asynchronous initialization implies. This means your __init__ methods should mostly look like this:

class Microblog(object):
    def __init__(self, cache_dir, database_connection):
        self.cache_dir = cache_dir
        self.database_connection = database_connection

If you think that looks boring - yes, it does. Boring is a good thing here. Anything exciting your __init__ method does is probably going to be the cause of someone's bad day sooner or later. If you think it looks tedious - yes, it does. Consider using Hynek Schlawack's excellent characteristic package (full disclosure - I contributed some ideas to characteristic's design and Hynek ocassionally says nice things about me (I don't know if he means them, I just know he says them)).

The second part of the idea an acknowledgement that asynchronous initialization is a reality of programming with asynchronous tools. Fortunately __init__ isn't the only place to put code. Asynchronous factory functions are a great way to wrap up the asynchronous work sometimes necessary before an object can be fully and consistently initialized. Put another way:

class Microblog(object):
    # ... __init__ as above ...

    @classmethod
    @asyncio.coroutine
    def from_database(cls, cache_dir, database_path):
        # ... or make it a free function, not a classmethod, if you prefer
        loop = asyncio.get_event_loop()
        database_connection = yield from loop.run_in_executor(None, cls._reading_init)
        return cls(cache_dir, database_connection)

Notice that the setup work for a Microblog instance is still asynchronous but initialization of the Microblog instance is not. There is never a time when a Microblog instance is hanging around partially ready for action. There is setup work and then there is a complete, usable Microblog.

This addresses the three observations I made above:

  • Methods of Microblog never need to concern themselves with worries about whether the instance has been completely initialized yet or not.
  • Nothing happens in Microblog.__init__. If Microblog has some methods which depend on instance attributes, any of those attributes can be set after __init__ is done and before those other methods are called. If the from_database constructor proves insufficiently flexible, it's easy to introduce a new constructor that accounts for the new requirements (named constructors mean never having to overload __init__ for different competing purposes again).
  • It's easy to treat a Microblog instance as an inert lump of state. Simply instantiating one (using Microblog(...) has no side-effects. The special extra operations required if one wants the more convenient constructor are still available - but elsewhere, where they won't get in the way of unit tests and unplanned-for uses.

I hope these points have made a strong case for one of these approaches being an anti-pattern to avoid (in Twisted, in asyncio, or in any other asynchronous programming context) and for the other as being a useful pattern to provide both convenient, expressive constructors while at the same time making object initializers unsurprising and maximizing their usefulness.

3 comments:

  1. That's called Dependency Injection. Otherwise known also as passing arguments to functions.

    ReplyDelete
  2. Thanks exarkun! This is a great explanation of why not to delay initialization while creating an object and where to perform long running, optional initialization steps instead.

    ReplyDelete
  3. One question, though: in this pattern, what's the best way to instantiate the class for posting-only usage (addressing the Loss of Control point)? Do you, in your application programmer role instead of the library programmer role, just make the decision to instantiate the class with a fake value for database_connection?

    ReplyDelete