Sunday, September 24, 2017

Finishing the txflashair Dockerfile

Some while ago I got a new wifi-capable camera. Of course, it has some awful proprietary system for actually transferring images to a real computer. Fortunately, it's all based on a needlessly complex HTTP interface which can fairly easily be driven by any moderately capable HTTP client. I played around with FlashAero a bit first but it doesn't do quite what I want out of the box and the code is a country mile from anything I'd like to hack on. It did serve as a decent resource for the HTTP interface to go alongside the official reference which I didn't find until later.

Fast forward a bit and I've got txflashair doing basically what I want - essentially, synchronizing the contents of the camera to a local directory. Great. Now I just need to deploy this such that it will run all the time and I can stop thinking about this mundane task forever. Time to bust out Docker, right? It is 2017 after all.

This afternoon I took the Dockerfile I'd managed to cobble together in the last hack session:

FROM python:2-alpine

COPY . /src
RUN apk add --no-cache python-dev
RUN apk add --no-cache openssl-dev
RUN apk add --no-cache libffi-dev
RUN apk add --no-cache build-base

RUN pip install /src

VOLUME /data

ENTRYPOINT ["txflashair-sync"]
CMD ["--device-root", "/DCIM", "--local-root", "/data", "--include", "IMG_*.JPG"]
and turn it into something halfway to decent and that produces something actually working to boot:
FROM python:2-alpine

RUN apk add --no-cache python-dev
RUN apk add --no-cache openssl-dev
RUN apk add --no-cache libffi-dev
RUN apk add --no-cache build-base
RUN apk add --no-cache py-virtualenv
RUN apk add --no-cache linux-headers

RUN virtualenv /app/env

COPY requirements.txt /src/requirements.txt
RUN /app/env/bin/pip install -r /src/requirements.txt

COPY . /src

RUN /app/env/bin/pip install /src

FROM python:2-alpine

RUN apk add --no-cache py-virtualenv

COPY --from=0 /app/env /app/env

VOLUME /data

ENTRYPOINT ["/app/env/bin/txflashair-sync"]
CMD ["--device-root", "/DCIM", "--local-root", "/data", "--include", "IMG_*.JPG"]

So, what have I done exactly? The change to make the thing work is basically just to install the missing py-virtualenv. It took a few minutes to track this down. netifaces has this as a build dependency. I couldn't find an apk equivalent to apt-get build-dep but I did finally track down its APKBUILD file and found that linux-headers was probably what I was missing. Et voila, it was. <>Perhaps more interesting, though, are the changes to reduce the image size. I began using the new-ish Docker feature of multi-stage builds. From the beginning of the file down to the 2nd FROM line defines a Docker image as usual. However, the second FROM line starts a new image which is allowed to copy some of the contents of the first image. I merely copy the entire virtualenv that was created in the first image into the second one, leaving all of the overhead of the build environment behind to be discarded.

The result is an image that only has about 50MiB of deltas (compressed, I guess; Docker CLI presentation of image/layer sizes seems ambiguous and/or version dependent) from the stock Alphine Python 2 image. That's still pretty big for what's going on but it's not crazy big - like including all of gcc, etc.

The other changes involving virtualenv are in support of using the multi-stage build feature. Putting the software in a virtualenv is not a bad idea in general but in this case it also provides a directory containing all of the necessary txflashair bits that can easily be copied to the new image. Note that py-virtualenv is also copied to the second image because a virtualenv does not work without virtualenv itself being installed, strangely.

Like this kind of thing? Check out Supporing Open Source on the right.

Friday, September 15, 2017

SSH to EC2 (Refrain)

Recently Moshe wrote up a demonstration of the simple steps needed to retrieve an SSH public key from an EC2 instance to populate a known_hosts file. Moshe's example uses the highly capable boto3 library for its EC2 interactions. However, since his blog is syndicated on Planet Twisted, reading it left me compelled to present an implementation based on txAWS instead.

First, as in Moshe's example, we need argv and expanduser so that we can determine which instance the user is interested in (accepted as a command line argument to the tool) and find the user's known_hosts file (conventionally located in ~):

from sys import argv
from os.path import expanduser
Next, we'll get an abstraction for working with filesystem paths. This is commonly used in Twisted APIs because it saves us from many path manipulation mistakes committed when representing paths as simple strings:
from filepath import FilePath
Now, get a couple of abstractions for working with SSH. Twisted Conch is Twisted's SSH library (client & server). KnownHostsFile knows how to read and write the known_hosts file format. We'll use it to update the file with the new key. Key knows how to read and write SSH-format keys. We'll use it to interpret the bytes we find in the EC2 console output and serialize them to be written to the known_hosts file.
from twisted.conch.client.knownhosts import KnownHostsFile
from twisted.conch.ssh.keys import Key
And speaking of the EC2 console output, we'll use txAWS to retrieve it. AWSServiceRegion is the main entrypoint into the txAWS API. From it, we can get an EC2 client object to use to retrieve the console output.
from txaws.service import AWSServiceRegion
And last among the imports, we'll write the example with inlineCallbacks to minimize the quantity of explicit callback-management code. Due to the simplicity of the example and the lack of any need to write tests for it, I won't worry about the potential problems with confusing tracebacks or hard-to-test code this might produce. We'll also use react to drive the whole thing so we don't need to explicitly import, start, or stop the reactor.
from twisted.internet.defer import inlineCallbacks
from twisted.internet.task import react
With that sizable preamble out of the way, the example can begin in earnest. First, define the main function using inlineCallbacks and accepting the reactor (to be passed by react) and the EC2 instance identifier (taken from the command line later on):
@inlineCallbacks
def main(reactor, instance_id):
Now, get the EC2 client. This usage of the txAWS API will find AWS credentials in the usual way (looking at AWS_PROFILE and in ~/.aws for us):
    region = AWSServiceRegion()
    ec2 = region.get_ec2_client()
Then it's a simple matter to get an object representing the desired instance and that instance's console output. Notice these APIs return Deferred so we use yield to let inlineCallbacks suspend this function until the results are available.
    [instance] = yield ec2.describe_instances(instance_id)
    output = yield ec2.get_console_output(instance_id)
Some simple parsing logic, much like the code in Moshe's implementation (since this is exactly the same text now being operated on). We do take the extra step of deserializing the key into an object that we can use later with a KnownHostsFile object.
    keys = (
        Key.fromString(key)
        for key in extract_ssh_key(output.output)
    )
Then write the extracted keys to the known hosts file:
    known_hosts = KnownHostsFile.fromPath(
        FilePath(expanduser("~/.ssh/known_hosts")),
    )
    for key in keys:
        for name in [instance.dns_name, instance.ip_address]:
            known_hosts.addHostKey(name, key)
    known_hosts.save()
There's also the small matter of actually parsing the console output for the keys:
def extract_ssh_key(output):
    return (
        line for line in output.splitlines()
        if line.startswith(u"ssh-rsa ")
    )
And then kicking off the whole process:
react(main, argv[1:])
Putting it all together:
from sys import argv
from os.path import expanduser

from filepath import FilePath

from twisted.conch.client.knownhosts import KnownHostsFile
from twisted.conch.ssh.keys import Key

from txaws.service import AWSServiceRegion

from twisted.internet.defer import inlineCallbacks
from twisted.internet.task import react

@inlineCallbacks
def main(reactor, instance_id):
    region = AWSServiceRegion()
    ec2 = region.get_ec2_client()

    [instance] = yield ec2.describe_instances(instance_id)
    output = yield ec2.get_console_output(instance_id)

    keys = (
        Key.fromString(key)
        for key in extract_ssh_key(output.output)
    )

    known_hosts = KnownHostsFile.fromPath(
        FilePath(expanduser("~/.ssh/known_hosts")),
    )
    for key in keys:
        for name in [instance.dns_name, instance.ip_address]:
            known_hosts.addHostKey(name, key)
    known_hosts.save()

def extract_ssh_key(output):
    return (
        line for line in output.splitlines()
        if line.startswith(u"ssh-rsa ")
    )

react(main, argv[1:])

So, there you have it. Roughly equivalent complexity to using boto3 and on its own there's little reason to prefer this to what Moshe has written about. However, if you have a larger Twisted-based application then you may prefer the natively asynchronous txAWS to blocking boto3 calls or managing boto3 in a thread somehow.

Also, I'd like to thank LeastAuthority (my current employer and operator of the Tahoe-LAFS-based S4 service which just so happens to lean heavily on txAWS) for originally implementing get_console_output for txAWS (which, minor caveat, will not be available until the next release of txAWS is out).

As always, if you like this sort of thing, check out the support links on the right.