Portable Python Binary Wheels

Mon 05 April 2021 by Moshe Zadka

Thanks to SurveyMonkey for encouraging me to do the research this post is based on.

It is possible to work with Python quite a bit and not be aware of some of the subtler details of package management. Since Python is a popular “glue” language, one of its core strengths is integrating with libraries written in other languages: from database drivers written in C, numerical algorithms written in Fortran, to cryptographic algorithms written in Rust. In all these cases, one way to avoid error-prone and frustrating installation errors in the target environment is to distribute pre-built code. However, while source code can be made portable, making the build output portable is a lot more complicated.

The Python manylinux project, composed of three peps, two software repositories, and support in pip, addresses how to accomplish that. These problems are hard, and few other ecosystems solve them as well as Python. The solution has many moving parts, developed over the course of ten years. Unfortunately, this means that understanding all of those is not easy.

While this post cannot make it easy, it can at least make it easier, by making sure all the details are in one place.

Wheels

Python packages come in two main forms:

  • Source
  • Wheels

Wheels are "pre-built" packages that are easier and faster to install. The name comes originally from a bad joke: the Monty Python Cheese Shop sketch, since PyPI used to be called "Cheese Shop" and cheese is sometimes sold in wheels. The name has been retconned for another bad joke, as a reference to the phrase "reinventing the wheel", allowing Python packaging talks to make cheap puns. For the kind of people who give packaging talks, or write explainers about packaging formats, these cheap jokes fill the void in what would otherwise be their soul.

Even for packages that include no native code, only pure Python, wheels have some advantages. They do not execute any potentially-fragile code on installation, and querying their dependencies can be done without a Python interpreter.

However, when packages do include native code the story is more complicated.

C library

Let's start with the relatively straightforward part: portable binary wheels for Linux are called manylinux, not alllinux. This is because it relies on the GNU C library, and specific features of it. There is another popular libc for Linux: musl. There is absolutely no attempt to be compatible with musl-based Linux distributions, the most famous among them is Alpine Linux.

However, most other distributions derive from either Debian (for example, Ubuntu) or from Fedora (CentOS, RHEL, and more). Those all use the GNU C library.

GNU C library

GNU libc has an official "infinite backwards compatibility" policy: libc6 version X.Y is compatible with W.Z if X>=W or X=W and Y>=Z.

Aside: the 6 in libc6 does not refer to the version of the GNU C Library: Linux only moved to adopt the GNU C Library in libc6. The libc4 library was written from scratch, while libc5 combined code from GNU C Library version 1 and some bits from BSD C library. In libc6, Linux moved to rely on GNU C Library version 2.x, first released in January 1997. The GNU C Library is still, over twenty years later, on major version 2. We will ignore some nuances, and just treat all GNU C Library versions as 2.X.

The infinite compatibility policy means that binaries built against libc6 version 2.17, for example, are compatible with libc6 version 2.32.

Manylinux history

The relevant PEP is dense but worth reading. "Portable" is a loaded word, and unpacking it is important. The specific meaning of "portable" is encoded in the auditwheel policy file. This file concedes the main point: portability is a spectrum.

When the manylinux project started, in 2016, the oldest security-supported open source distribution was CentOS: specifically, CentOS 5.11. It was released in 2014. However, because CentOS tracks RHEL, and RHEL is conservative, the GNU C library (glibc, from now on) it used was 2.5: a version released in 2006.

Even then, it was clear that the "minimum" compatibility level will be a moving target. Because of that, that compatibility level was named manylinux1.

In 2018, the manylinux project moved to a more transparent naming scheme: the date in which the relevant compatible CentOS release was first released. Thus, instead of manylinux1, the next compatibility target (defined in 2018) was called manylinux2010, referencing CentOS 6.

In April 2019, manylinux2014 was defined as a compatibility tag, referencing CentOS 7.

In the beginning of 2021, Red Hat, in a controversial move, changed the way CentOS works, effectively nullifying the value any future releases have as a way of specifying a minimum glibc version support.

The Python community decided to switch to a new scheme: directly naming the version of glibc supported. The first such tag, manylinux_2_24, was added in November 2020. The next release of auditwheel, 4.0, moves all releases to glibc-based tags, while keeping the original names as "aliases". It also adds a compatibility level manylinux_2_27.

Libc compatibility and beyond

The compatibility level of a manylinux wheel is defined by the glibc symbols it links against. However, this is not the only compatibility manylinux wheels care about: this just puts them on a serial line from "most compatible" to "least compatible".

Each compatibility level also includes A list of allowed libraries to dynamically link against. Specific symbol versions and ABI flags that depend on both glibc and gcc.

However, many Python extensions include native code precisely because they need to link against a C library. As a concrete example, the mysqlclient wheel would not compile if the libmysql headers are not installed, and would not run if the libmysql shared library (of a version that matches the one the package was compiled against) is not installed.

It would seem that portable binary wheels are only of limited utility if they do not support the main use case. However, the :code`auditwheel` tool includes one more twist: patching ELF.

Elves

Elves predate Tolkien's Middle-Earth. They appear in many Germanic and Nordic mythologies: sometimes as do-gooders, sometimes as evil-doers, but always associated with having powerful magic.

Our context is no less magical, but more modern. ELF ("Executable and Loader Format") is the format of executable and shared libraries in Linux, since libc5 (before that, Linux used the so-called a.out format).

When auditwheel is asked to repair a wheel for a specific platform version, it checks for any shared libraries it links against that are not part of the pre-approved list. If it finds any, it patches them directly into the module. This means that post repair, the new ("repaired") wheel will not depend on any libraries outside the approved list.

These repaired binary wheels will include the requested manylinux tag and the patched modules. They can be uploaded to PyPI or other Python packaging repositories (such as DevPI).

For pip to install the correct wheels it needs to be up-to-date in order to self-check the OS and decide which manylinux tags are compatible.

Installing Binary Wheels

Because wheels tagged as linux_<cpu architecture> (for example, linux_x86_64) cannot be assumed on any platform other than the one they have been compiled for, PyPI rejects those. In order to upload a binary wheel for Linux to PyPI, it has to be tagged with a manylinux tag. It is possible to upload multiple manylinux wheels for a single package, each with a different compatibility target.

When installing packages, pip will prefer to use a wheel, if available, instead of a source distribution. When pip checks the availability of a wheel, it will introspect the platform it is running it, and map it to the list of compatible manylinux distributions. Since the list is changing, it is possible that a newer pip will recognize more compatibilities than an older pip.

Once pip finds the list of manylinux tags compatible with its platform, it will install the least-compatible wheel that is still compatible with the platform: for example, it will prefer manylinux2014 to manylinux2010 if both are compatible. If there are no binary wheels available, pip will fall back to installing from source. As mentioned before, installing from source, at the very least, requires a functional compiler and Python header files. It might also have specific build-time dependencies, depending on the package.