Possible PyCon paper

Possible PyCon paper 2003-10-31

Early early early early draft. Just a bunch of notes, really.

On a framework for the transparent migration of user services from one
process to another

Circumstances exist where it is desirable to change the behavior of a
running application without interrupting the service of those users whom it
is currently serving. In high load environments, it is often the case where
there will always be a significant number of users relying on the service.
In these environments, the simplistic approach of waiting for all users to
sign off, then shutting down and restarting the server software are not
feasible.

Several solutions will be discussed including techniques similar to those
for dealing with hardware failure, dynamic code load/unloading approaches,
process re-execution, capability transfer, and gradual service transfers.

hardware failure -

if you have several machines serving in a cluster arrangement, you need
to deal with one of them falling over anyway

load balancer in front of the cluster can keep track of which are up and
which are down

when software is to be upgraded, each machine can be taken down and
brought back up in turn

this looks just like a hardware fall over (”fail over”) to the manager
and service to other customers can be maintained

with a little more work, individual machines in the cluster can be
brought down more gracefully (eg, stop giving them new connections, and wait
for all their existing connections to finish, *then* perform the procedure)

drawbacks -

requires cluster and manager

client connections may be long lasting, leaving a server out of
comission for hours or days

requires any server in the cluster to be able to handle any user

this is somewhat tractable, if the database can be connected to
remotely and is separate from the application running on each cluster
machine

dynamic code loading/unloading -

many systems already use module systems

java/python can load modules on the fly from various locations; C has dlopen()

but support for unloading/reloading is harder; dlclose() doesn’t
release the shared library on many platforms

requires much of the system to be carefully architected beforehand (hmm,
this may be a requirement for any successful system); the module loading
code cannot be modified in this scheme (eg, the jvm cannot be upgraded)
meaning restarts may still be necessary from time to time.

process re-execution -

this involves storing state somewhere, execl()’ing the binary of the
running program, and restoring internal state from storage.

preserves file descriptors automatically, but little other state

allows changes to be made to any part of the application

brief suspension in service while state is serialized/deserialized

may be allowable for certain applications

only works on platforms with execl()

capability transfers -

like process re-execution

a new process is started, running the new code

all state required for communicating with clients (sessions, file
descriptors, etc) are handed to the new process when it is ready

the old process then shuts down

drawbacks -

requires system that can pass file descriptors between processes

brief suspension in service while transfer is made

system for passing current user state to a new process is complicated,
hard to generalize

gradual service transfer -

like capability transfers

new process is brought up, when it is ready, old process stops accepting
new clients and new one begins doing so

old process continues serving existing clients

when all clients have signed off, old process exits