Shipping Python Applications in Docker

Fri 17 March 2017 by Moshe Zadka

Introduction

When looking in open source examples, or tutorials, one will often see Dockerfiles that look like this:

FROM python:3.6
COPY setup.py /mypackage
COPY src /mypackage/src
COPY requirements.txt /
RUN pip install -r /requirements.txt
RUN pip install /mypackage
ENTRYPOINT ["/usr/bin/mypackage-console-script", "some", "arguments"]

This typical naive example has multiple problems, which this article will show how to fix.

  • It uses a container with a functioning build environment
  • It has the potential for "loose" and/or incomplete requirements.txt, so that changes in PyPI (or potential compromise of it) change what is installed in the container.
  • It is based on python:3.6, a label which can point to different images at different times (for example, when Python 3.6 has a patch release).
  • It does not use virtual environments, leading to potential bad interactions.

In general, when thinking about the Docker container image as the "build" output of our source tree, it is good to aim for a reproducible build: building the same source tree should produce equivalent results. (Literally bit-wise identical results are a good thing to aim for, but there are many potential sources for spurious bitwise changes such as dates embedded in zip files.) One important reason is for bug fixes: for example, a bug fix leads to change a single line of Python code. That is usually a safe change, that is reasonably easy to understand and to test for regressions. However, if it also can update Python from 3.6.1 to 3.6.1, or update Twisted from 17.1 to 17.2, regressions are more likely and the change has to be tested more carefully.

Managing Base Images

Docker images are built in layers. In a Dockerfile, the FROM line imports the layers from the source image. Every other line adds exactly one layer.

Container registries store images. Internally, they deduplicate identical layers. Images are indexed by user (or organization), repository name and tag. These are not immutable: it is possible to upload a new image, and point an old user/repository/tag at it.

In theory, it is possible to achieve immutability by content-addressing. However, registries usually garbage collect layers with no pointers. The only safe thing to do is to own images on one's own user or organization. Note that to "own" the images, all we need to do is to push them under our own name. The push will be fast -- since the actual content is already on the Docker registry side, it will be short-circuit the actual upload.

Especially when working using a typical home internet connection, or an internet cafe connection, download speeds are usually reasonable but upload speeds are slow. This upload does not depend on fast upload speed, since it only uploads small hashes. However, it will make sure the images keep forever, and we have a stable pointer to them.

The following code shows an example of fork-tagging in this way into a user account. We use time, precise to the microsecond, to name the images. This means that running this script is all but guaranteed to result in unique image labels, and they even sort correctly.

import datetime, subprocess
tag = datetime.datetime.utcnow().isoformat()
tag = tag.replace(':', '-').replace('.', '-')
for ext in ['', '-slim']:
    image = "moshez/python36{}:{}".format(ext, tag)
    orig = "python:3.6{}".format(ext)
    subprocess.check_call(["docker", "pull", orig])
    subprocess.check_call(["docker", "tag", orig, image])
    subprocess.check_call(["docker", "push", image])

(This assumes the Docker client is already pre-authenticated, via running docker login.)

It produces images that look like:

moshez/python36-slim:2017-03-10T02-18-12-843046
   b5b6550a858c
   198.6 MB
moshez/python36:2017-03-10T02-18-12-843046
   a1782fa44ef7
   687.2 MB

The script is safe to run -- it will not clobber existing labels. It will require a source change to use the new images -- usually by changing a Dockerfile.

Python Third Party Dependencies

When installing dependencies, it is good to ensure that those cannot change. Modern pip gives great tools to do that, but those are rarely used. The following describes how to allow reproducible builds while still allowing for easy upgrades.

"Loose" requirements specify programmer intent. They indicate what dependencies the programmer cares about: e.g., wanting a minimum version of a library. They are handwritten by the developer. Note that often that "hand-writing" can be little more than the single line of the "main" package (the one that contains the code which is written in-house). It will usually be a bit more than that: for example, pex and docker-py to support the build process.

Strict dependencies indicate the entire dependency chain: complete, with specific versions and hashes. Note that both must be checked in. Re-generating the strict dependencies is a source-tree change operation, no different from changing the project's own Python code, because it changes the build output.

The workflow for regenerating the strict dependencies might look like:

$ git checkout -b updating-third-party
$ docker build -t temp-image -f harden.docker .
$ docker run --rm -it temp-image > requirements.strict.txt
$ git commit -a -m 'Update 3rd party'
$ git push
$ # Follow code workflow

There is no way, currently, to statically analyze dependencies. Docker allows to install the packages in an ephemeral environment, analyze the dependencies and check the hashes, and then output the results. The output can be checked in, while the actual installation is removed: this is what the --rm flag above means.

For the workflow above to work, we need a Dockerfile and a script. The Dockerfile is short, since most of the logic is in the script. We use the reproducible base images, (i.e., the images we fork-tagged), copy the inputs, and let the entry point run the script.

# harden.docker
FROM moshez/python36:2017-03-09T04-50-49-169150
COPY harden-requirements requirements.loose.txt /
ENTRYPOINT ["/harden-requirements"]

The script itself is divided into two parts. The first part installs the loose requirements in a virtual environment, and then runs pip freeze. This gives us the "frozen" requirements -- with specific versions, but without hashes. We use --all to force freezing of packages pip thinks we should not. One of those packages is setuptools, which pex needs specific versions of.

# harden-requirements
import subprocess, sys
subprocess.check_output([sys.executable, "-m", "venv",
                         "/envs/loose"])
subprocess.check_output(["/envs/loose/bin/pip", "install", "-r",
                         "/requirements.loose.txt"])
frozen = subprocess.check_output(["/envs/loose/bin/pip",
                                  "freeze", "--all"])
with open("/requirements.frozen.txt", 'wb') as fp:
    fp.write(frozen)

The second part takes the frozen requirements, and uses pip-compile (part of the pip-tools package) to generate the hashes. The --allow-unsafe is the scarier looking, but semantically equivalent, version of pip's --all.

subprocess.check_output(["/envs/loose/bin/pip", "install",
                         "pip-tools"])
output = subprocess.check_output(["/envs/loose/bin/pip-compile",
                                  "--allow-unsafe",
                                  "--generate-hashes",
                                  "requirements.frozen.txt"])
for line in output.decode('utf-8').splitlines():
    print(line)

Docker-in-Docker

(Thanks to Glyph Lefkowitz for explaining that well.)

Too many examples in the wild use the same container to build and deploy. Those end up shipping a container full of build tools, such as make and gcc, to production. That has many downsides -- size, performance, security and isolation.

In order to properly separate those two tasks, we will need to learn how to run containers from inside containers. Running a container as a daughter of a container is a bad idea. Instead, we take advantage of the client/server architecture of Docker.

Contrary to some misconception, docker run does not run a container. It connects to the Docker daemon, which runs the container. The client knows how to connect to the daemon based on environment variables. If the environment is vanilla, the client has a default: it connects to the UNIX domain socket at /var/run/docker.sock. The daemon always listens locally on that socket.

Thus, we can run a container from inside a container that has access to the UNIX domain socket. Such a socket is simply a file, in keeping with UNIX's "everything is a file" philosophy. Therefore, we can pass it in to the container by using host volume mounting.

$ docker run -v /var/run/docker.sock:/var/run/docker.sock ...

A Docker client running inside such a container will connect to the outside daemon, and can cause it to run other containers.

Python Output Formats

Wheel

Wheels are specially structured zip files. They are designed so a simple unzip is all that is needed to install them. The pip program will build, and cache, wheels for any package it installs.

Pex

Pex is an executable Python program. It does not embed the interpreter, or the standard library. It does, however, embed all third-party packages: running the Pex file on a vanilla Python installation works, as long as the Python versions are compatible.

Pex works because running python somefile.zip works by adding somefile.zip to sys.path, and then running __main__.py from the archive as its main file. The Zip format is end-based: adding an arbitrary prefixes to the content does not change the semantics of a zip file. Pex adds the prefix #!/usr/bin/env python.

Building a Simple Service Container Image

The following will use remotemath package, which does slow remote arithmetic operations (mutliplication and negation). It does not have much utility for production, but is useful for pedagogical purposes, such as this one.

In general, build systems can get complicated. For our examples, however, a simple Python script is all the build infrastructure we need.

Much like in the requirements hardening example, we will follow the pattern of creating a docker container image and running it immediately. Note that the first two lines do not depend on the so-called "context" -- the files that are being copied into the docker container from the outside. Because of this, the docker build process will know to cache them, leading to fast rebuilds.

However, building like this allows us to ignore how to mount files into the docker container -- and allows us to ignore the sometimes subtle semantics of such mounting.

# build.docker
FROM moshez/python36:2017-03-09T04-50-49-169150
RUN python3 -m venv /buildenv && mkdir /wheelhouse
COPY requirements.strict.txt build-script remotemath.docker /
ENTRYPOINT ["/buildenv/bin/python", "/build-script"]

This Docker file depends on two things we have not mentioned: a build script and the remotemath.docker file. The build script itself has one subtlety -- because it calls out to pip, it cannot import the docker module until it has been installed. Indeed, since it starts out in a vanilla virtual environment, it does not have access to any non-built-in packages. We do need to make sure the docker-py PyPI package makes it into the requirements file.

# build-script
import subprocess

def run(cmd, *args):
    args = [os.path.join('/buildenv/bin', cmd)] + list(args[1:])
    return subprocess.check_call(args)

os.makedirs('/mnt/output')
shutil.copy('/remotemath.docker',
            '/mnt/output/remotemath.docker')
run('pip', 'install', '--require-hashes',
                      '-r', 'requirements.strict.txt')
run('pip', 'wheel', '--require-hashes',
                    '-r', 'requirements.strict.txt',
                    '--wheel-dir', '/wheelhouse/')
run('pex', '-o', '/mnt/output/twist.pex',
           '-m', 'twisted',
           '--repo', '/wheelhouse', '--no-index',
           'remotemath')

import docker
client = docker.from_env()
client.images.build(
    path='/mnt/output/',
    dockerfile='/mnt/output/remotemath.docker',
    tag='/moshez/remotemath:{}'.format(sys.argv[1]),
    rm=True,
)
## TODO: Push here

Docker has no way to "build and run" in one step. We build an image, tagged as temp-image, like we did for hardening. When running the build on a shared machine are likely, this needs to be a uuid, and have some garbage collection mechanism.

$ docker build -t temp-image -f build.docker .
$ docker run --rm -it \
         -v /var/run/docker.sock:/var/run/docker.sock \
         temp-image $(git rev-parse HEAD)

The Dockerfile for the production service is also short: here, we take advantage of two nice properties of the python -m twisted command (wrapped into pex as twist.pex):

  • It accepts a port on the command-line. Since arguments to Docker run are appended to the command-line, this can be run as docker run moshez/remotemath:TAG --port=tcp:8080.
  • It reaps adopted children, and thus can be used as PID 1.
FROM moshez/python36-slim:2017-03-09T04-50-49-169150
COPY twist.pex /
ENTRYPOINT ["/twist.pex", "remotemath"]

Building a Container Image With Custom Code

So far we have packaged a ready-made application. That was non-trivial in itself, but now let us write an application.

Pyramid App

We are using the Pyramid framework to write an application. It will have one page, which will use the remote math service to calculate 3*4.

# setup.py
import setuptools
setuptools.setup(name='fancycalculator',
    packages=setuptools.find_packages(where='src'),
    package_dir={"": "src"}, install_requires=['pyramid'])
# src/fancycalculator/app.py
import os, requests, pyramid.config, pyramid.response
def multiply(request):
    res = requests.post(os.environ['REMOTE_MATH']+'/multiply', json=[3, 4])
    num, = res.json()
    return pyramid.response.Response('Result:{}'.format(num))
cfg = pyramid.config.Configurator()
cfg.add_route('multiply', '/');
cfg.add_view(multiply, route_name='multiply')
app = cfg.make_wsgi_app()

Changes

In many cases, one logical source code repository will be responsible for more than one Docker image. Obviously, in the case of a monorepo, this will happen. But even when using a repo per logical unit, often two services share so much code that it makes sense to separate them into two images.

For example, the admin panel of a web application and the application itself. It might make sense to build them as two repositories because they will not share all the code and running them on separate containers allows separating permissions effectively. However, they will share enough code between them that is not general interest that building them out of the same repository is probably a good idea.

This is what we will do in this example. We will change our build files, above, to also build the "fancy calculator" Docker image.

Two files need append-only changes: build.docker needs to add the new source files and build-script needs to actually build the new container.

# build.docker
RUN mkdir /fancycalculator
COPY setup.py /fancycalculator/
COPY src /fancycalculator/src/
COPY fancycalculator.docker /
# build-script
run('pip', 'wheel', '/fancycalculator',
    '--wheel-dir', '/wheelhouse')
run('pex', '-o', '/mnt/output/fc-twist.pex',
    '-m', 'twisted',
    '--repo', '/wheelhouse', '--no-index',
    'twisted', 'fancycalculator')
client.images.build(path='/mnt/output/',
    dockerfile='/mnt/output/fancycalculator.docker',
    tag='/moshez/fancycalculator:{}'.format(sys.argv[1]),
    rm=True,
)

Again, in more realistic scenario where a build system is already being used, it might be that a new build configuration file needs to be added -- and if all the source code is already in a subdirectory, that directory can be added as a whole in the build Dockerfile.

We need to add a Dockerfile -- the one for production fancycalculator. Again, using the Twisted WSGI container at the root allows us to avoid some complexity.

# fancycalculator.docker
FROM moshez/python36-slim:2017-03-09T04-50-49-169150
COPY fc-twist.pex /twist.pex
ENTRYPOINT ["/twist.pex", "web", "--wsgi", \
            "fancycalculator.app.app"]

Running Containers

Orchestration Framework

There are several popular orchestration framework: Kubernetes (sometimes shortened to k8s), Docker Swarm, Mesosphere and Nomad. Of those, Swarm is probably the easiest to get up and running.

In any case, it is best to make sure specific containers are written in an OF-agnostic manner. The easiest way to achieve that is to have containers expect to connect to a DNS name for other services. All Orchestation Frameworks support some DNS integration in order to do service discovery.

Configuration and Secrets

Secrets can be bootstrapped from the secret system the Orchestration Framework uses. When not using an Orchestration Framework, it is possible to mount secret files directly into containers with -v. The reason "bootstrapped" is used is because often it is easier to have the secret at that level be a PyNaCl private key, and to add application-level secrets by encrypting them with a PyNaCl corresponding public key (that can be checked into the repository) and putting them directly in the image. This adds developer velocity by allowing develops to add secrets directly without giving them any special privileges.

Storage

In general, much of Docker is used for "stateless" services. When needing to keep state, there are several options. One is to avoid an Orchestration Framework for the stateful services: deploy with docker, but mount the area where long-term data is kept from the host. Backup, redundancy and fail-over solutions still need to be implemented, but this is no worse than without containers, and containers at least give an option of atomic upgrades.