Interesting text encodings (and the people who love them)

Wed 13 December 2017 by Moshe Zadka

(Thanks to Tom Prince and Nelson Elhage for suggestions for improvement.)

Nowadays, almost all text will be encoded in UTF-8 -- for good reasons, it is a well thought out encoding. Some of it will be in Latin 1, AKA ISO-8859-1, which is popular in the western world. Less of it will be in other members of the ISO-8859 family (-2 or higher). Some text from Japan will occasionally still be in Shift-JIS. These encodings are all reasonable -- too reasonable.

What about more interesting encodings?

EBCDIC

Encodings turn a sequence of logical code points into a sequence of bytes. Bytes, in turn, are just sequences of ones and zeroes. Usually, we think of the ones and zeroes as mostly symmetric -- it wouldn't matter if the encoding was to the "dual" byte, where every bit was flipped. SSD drives do not like long sequences of zeroes -- but neither do they like long sequences of ones.

What if there was no symmetry? What if every "one" weakened your byte?

This is the history of one of the most venerable media to carry digital information -- predating the computer by its use in automated weaving machines -- the punched card. It was called so because to make a "one", you would punch a hole -- that was detected by the card reader by an electric circuit being completed. Punching too many holes made cards weak: likely to rip in the wear and tear the automated reading machines inflicted upon them, in the drive to read cards ever faster.

EBCDIC (Extended Binary Coded Decimal Interchange Code) was the solution. "Extended" because it extends the Binary Coded Decimal standard -- numbers are encoded using one punch, which makes them easy to read with a hex editor. Letters are encoded with two. Nothing sorts correctly, of course, but that was not a big priority. Quoting from Wikipedia:

"The distinct encoding of 's' and 'S' (using position 2 instead of 1) was maintained from punched cards where it was desirable not to have hole punches too close to each other to ensure the integrity of the physical card.

Of course, it wouldn't be IBM if there weren't a whole host of encodings, subtly incompatible, all called EBCDIC. If you live in the US, you are supposed to use code page 1140 for your EBCDIC needs.

Luckily, if you ever need to connect your Python interpreter to a card-punch machine, the Unicode encodings have got you covered:

>>> "hello".encode('cp1140')
b'\x88\x85\x93\x93\x96'

If you came to this post to learn skills immediately relevant to your day to day job and are entirely not obsolete, you're welcome.

KOI-8

Suppose you're a Russian speaker. You write your language using the Cyrrilic alphabet, suspiciously absent from the American Standard Code for Information Interchange (ASCII), developed during the height of the cold war between the US of A and the USSR. Some computers are going to have Cyrrilic fonts installed -- and some are not. Suppose that it is the 80s, and the only language that runs fast enough on most computers is assembly or C. You want to make a character encoding that

  • Will look fine if someone has the Cyrrilic installed
  • Can be converted to ASCII that will look kinda-sorta like the Cyrrilic with a program that is trivial to write in C.

KOI-8 is the result of this not-quite-thought experiment.

The code to convert from KOI-8 to kinda-sorta-look-a-like ASCII, written in Python, would be:

MASK = (1 << 8) - 1
with open('input', 'rb') as fin, open('output', 'wb') as fout:
    while True:
        c = fin.read(1)
        if not c:
            break
        c = c & MASK # <--- this right here
        fout.write(c)

The MASK constant, written in binary, is just 0b1111111 (seven ones). The line with the arrow masks out the "high bit" in the input character.

Sorting KOI-8 by byte value gives you a sort that is not even a little bit right for the alphabet: the letters are all jumbled up. But it does mean that trivial programs in C or assembly -- or sometimes even things that would try to read words out of old MS Word files -- could convert it to something that looks semi-readable on a display that is only configured to display ASCII characters, possibly as a deep hardware limitations.

Punycode

How lovely it is, of course, to live in 2017 -- the future. We might not have flying cars. We might not even be wearing silver clothing. But by jolly, at least our modern encodings make sense.

We send e-mails in UTF-8 to each other, containing wonderful emoji like "eggplant" or "syringe".

Of course, e-mail is old technology -- we send our eggplants, syringes and avocadoes via end-to-end encrypted Signal chat messages, unreadable by any but our intended recipient.

It is also easy to register our own site, and use an off-the-shelf SaaS offering, such as Wordpress or SquareSpace, to power it. And no matter what we want to put as our domain, we can...as long as it is ASCII-compatible, because DNS is also older than the end of the cold war, and assumes English only.

Seems like this isn't the future after all, which the suspicious lack of flying cars and silver clothing should really have alerted us to.

In our current times, which will be a future generation's benighted past, we must use yet another encoding to put our avocadoes and eggplans in the names of websites, where they rightly belong.

Enter Punycode, an encoding that is not afraid to ask the hard questions like "are you sure that the order encoded bits in the input and the output has to be the same"?

That is, if one string is a prefix of another, should its encoding be a prefix of the other? Just because UTF-8, EBCDIC, KOI-8 or Shift-JIS adhere to this rule doesn't mean we can't think outside the box!

Punycode rearranges the encoding so that all ASCII compatible characters go to the beginning of the string, followed by a hyphen, followed by a complicated algorithm designed to minimize the number of output bytes by assuming the encoded non-ASCII characters are close together.

Consider a simple declaration of love: "I<Red heart emoji>U".

>>> source = b'I\xe2\x9d\xa4U'
>>> declaration = source.decode('utf-8')
>>> declaration.encode('punycode')
b'IU-ony'

Note how, like a well-worn pickup line, I and U were put together, while the part that encodes the heart is at the end.

Consider the slightly more selfish declaration of self-love:

>>> source = b'I\xe2\x9d\xa4me'
>>> source.decode('utf-8').encode('punycode')
b'Ime-4r6a'

Note that even though the selfish declaration and the true love declaration both share a two-character prefix, the result only shares one byte of prefix: the heart got moved to the end -- and not the same heart. Truly, every love is unique.

Punycode's romance with DNS, too, was frought with drama: indeed, many browsers now will not display unicode in the address bar, instead showing "xn--<punycode ASCII>" (the "xn--" in the beginning indicates this is a punycoded string) as a security measure against phishing: it turns out there are a lot of characters in Unicode that look a lot like "a", leading to many interesting variants on "Paypal.com" and "Gmail.com", which look indistinguishable to most humans -- and turns out, most users of the web are indeed of the homo sapiens species.