Managing Dependencies

Sun 02 September 2018 by Moshe Zadka

(Thanks to Mark Rice for his helpful suggestions. Any mistakes or omissions that remain are my responsibility.)

Some Python projects are designed to be libraries, consumed by other projects. These are most of the things people consider "Python projects": for example, Twisted, Flask, and most other open source tools. However, things like mu are sometimes installed as an end-user artifact. More commonly, many web services are written as deployable Python applications. A good example is the issue tracking project trac.

Projects that are deployed must be deployed with their dependencies, and with the dependencies of those dependencies, and so forth. Moreover, at deployment time, a specific version must be deployed. If a project declares a dependency of flask>=1.0.1, for example, something needs to decide whether to deploy flask 1.0.1 or flask 1.0.2.

For clarity, in this text, we will refer to the declared compatibility statements in something like setup.py (e.g., flask>=1.0.1) as "intent" dependencies, since they document programmer intent. The specific dependencies that are eventually deployed will be referred as the "expressed" dependencies, since they are expressed in the actual deployed artifact (for example, a Docker image).

Usually, "intent" dependencies are defined in setup.py. This does not have to be the case, but it almost always is: since there is usually some "glue" code at the top, keeping everything together, it makes sense to treat it as a library -- albeit, one that sometimes is not uploaded to any package index.

When producing the deployed artifact, we need to decide on how to generate the expressed dependencies. There are two competing forces. One is the desire to be current: using the latest version of Django means getting all the latest bug fixes, and means getting fixes to future bugs will require moving less versions. The other is the desire to avoid changes: when deploying a small bug fix, changing all library versions to the newest ones might introduce a lot of change.

For this reason, most projects will check in the "artifact" (often called requirements.txt) into source control, produce actual deployed versions from that, and some procedure to update it.

A similar story can be told about the development dependencies, often defined as extra [dev] dependencies in setup.py, and resulting in a file dev-requirements.txt that is checked into source control. The pressures are a little different, and indeed, sometimes nobody bothers to check in dev-requirements.txt even when checking in requirements.txt, but the basic dynamic is similar.

The worst procedure is probably "when someone remembers to". This is not usually anyone's top priority, and most developers are busy with their regular day-to-day task. When an upgrade is necessary for some reason -- for example, a bug fix is available, this can mean a lot of disruption. Often this disruption manifests in that just upgrading one library does not work. It now depends on newer libraries, so the entire dependency graph has to be updated, all at once. All intermediate "deprecation warnings" that might have been there for several months have been skipped over, and developers are suddenly faced with several breaking upgrades, all at once. The size of the change only grows with time, and becomes less and less surmountable, making it less and less likely that it will be done, until it ends in a case of complete bitrot.

Sadly, however, "when someone remembers to" is the default procedure in the absence of any explicit procedure.

Some organizations, having suffered through the disadvantages of "when someone remembers to", decide to go to the other extreme: avoiding to check in the requirements.txt completely, and generating it on every artifact build. However, this means causing a lot of unnecessary churn. It is impossible to fix a small bug without making sure that the code is compatible with the latest versions of all libraries.

A better way to approach the problem is to have an explicit process of recalculating the expressed dependencies from the intent dependencies. One approach is to manufacture, with some cadence, code change requests that update the requirements.txt. This means they are resolved like all code changes: review, running automated tests, and whatever other local processes are implemented.

Another is to do those on a calendar based event. This can be anything from a manually-strongly-encouraged "update Monday", where on Monday morning, one of a developer tasks is to generate a requirements.txt updates for all projects they are responsible for, to including it as part of a time-based release process: for example, generating it on a cadence that aligns with agile "sprints", as part of the release of the code changes in a particular sprints.

When updating does reveal an incompatibility it needs to be resolved. One way is to update the local code: this certainly is the best thing to do when the problem is that the library changed an API or changed an internal implementation detail that was being used accidentally (...or intentionally). However, sometimes the new version has a bug in it that needs to be fixed. In that case, the intent is now to avoid that version. It is best to express the intent exactly as that: !=<bad version>. This means when an even newer version is released, hopefully fixing the bug, it will be used. If a new version is released without the bug fix, we add another != clause. This is painful, and intentionally so. Either we need to get the bug fixed in the library, stop using the library, or fork it. Since we are falling further and further behind the latest version, this is introducing risk into our code, and the increasing != clauses will indicate this pain: and encourage us to resolve it.

The most important thing is to choose a specific process for updating the expressed dependencies, clearly document it and consistently follow it. As long as such a process is chosen, documented and followed, it is possible to avoid the bitrot issue.

Categories
misc