watch this The wheels are turning, slowly turning. home
Doubling bytestreams 2005-05-24

For a while now, every day, we have been moving a somewhat hefty chunk of bytes, about 15GB worth from one machine onto another. The bytes are a tar file, generated in realtime and piped to ssh (connected to a host untaring the bytes). Pretty standard stuff, really. A while back we decided we wanted to send this tar to two hosts, instead of one. No big deal, we just ran tar twice, piping the output to an ssh process connected to a different host each time. Worked like a charm. Recently, we decided the load incurred by the second copy was heavy enough to be worth avoiding. Obvious solution: pipe tar to tee, send one of tee’s outputs to one ssh, the other to the other.

That doesn’t work. Woops.

For whatever reason, tee chokes after a bit less than 8 GB of data. “write error” it says, and poof one of the streams is dead (always the same one, interestingly, and the other one always carries on just fine). Rather than waste too much time trying to figure out who is at fault here (or perhaps as a way of doing so ;), I wrote this quickie:

#!/usr/bin/python

"""Write bytes read from one file to one or more files.
"""

import os, sys

from twisted.python import usage, log

class Options(usage.Options):
    def opt_out(self, arg):
        vars(self).setdefault('out', []).append(arg)
    opt_o = opt_out

    def postOptions(self):
        self.outfds = []
        for fname in vars(self).get('out', []):
            self.outfds.append(os.open(fname, os.O_WRONLY | os.O_CREAT))

def main(infd, *outfds):
    while 1:
        bytes = os.read(infd, 2 ** 16)
        if not bytes:
            break
        for i in xrange(len(outfds) - 1, -1, -1):
            try:
                os.write(outfds[i], bytes)
            except:
                log.msg("Error writing to %d" % (outfds[i],))
                log.err()
                del outfds[i]

if __name__ == '__main__':
    o = Options()
    try:
        o.parseOptions()
    except:
        raise # sys.exit(str(sys.exc_info()[1]))
    else:
        log.startLogging(sys.stdout)
        main(0, *o.outfds)

I called it yjoint, uncreative clod that I am (Hey, at least I didn’t call it pytee). It’s not exactly a drop-in tee replacement. We use it more or less like this:

exarkun@boson:~/$ tar c yummy_data | yjoint \
> --out >(ssh host1 tar x) \
> --out >(ssh host2 tar x)

Swapped tee out and yjoint in, and suddenly we are in business again.

I wonder what the deal with tee is?