Text encoding survival lessons (part 1 of 2)

If you’ve ever manipulated text data, you must have suffered from the encoding syndrome: when strings start to look like they’ve gone all �����. My aim in this post (the first in a series of two) is to provide non-technical readers with the background knowledge about text encoding that should help them understand how to keep their level of ���� to a strict minimum in the future. The second post will summarize best practices related to text encoding that can be derived from this background knowledge.

Strings are stored in binary format on computers

In fact everything is stored as a sequence of 0’s and 1’s on a computer—that’s one of the most well-known fact about computers outside of computer science, and the reason why binary strings have become a common symbol for anything digital. As regards terminology, a single 0 or 1 is called a bit, and a sequence of 8 bits is called a byte. Bytes matter because they’re the standard way of chunking binary strings (hence the b in Kb, Mb, and so on). Just as a bit can take 2 values (0 and 1), a byte can take exactly 256 distinct values, namely 00000000, 00000001, 00000010, 00000011, …, 11111110, 11111111 (256 is 2 to the 8th, for the mathematically inclined).

Returning to text data with these notions at hand, we may summarize the problem of text encoding as that of mapping the internal representation of strings, which is in bytes, to their representation as hopefully meaningful sequences of characters in some language’s alphabet. And it would be a straightforward process if there was only one way of doing it, which is obviously not the ����.

The ASCII code

Back in the 60’s, when computer science was purely anglocentric, it was assumed that as far as alphabet was concerned, not much was needed besides a to z and A to Z. This assumption didn’t suit many languages of the world very well, but anyway, it was the basis of the first widely spread encoding, known as the American Standard Code for Information Interchange (ASCII), which is still in use half a century later.

The ASCII code is really just a conventional way of associating unaccented letters, numbers, and a bunch of useful characters (such as whitespace, dot, comma, underscore, ampersand, dollar, and so on) to bytes—or more precisely, to sequences of 7 bits. Why only 7? Because that’s what’s needed to represent the set of 128 characters in question. So for instance, A is 1000001, B is 1000010, a is 1100001, and so on (cf. this table).

Code pages, or the return of the 8-th bit

Now almost none of the world’s languages can be represented properly with the 128 characters of the ASCII code. Even those languages that do use these characters require extra ones such as é, ñ, and the like, and they form a rather influential group. Luckily for them, since binary data are chunked in bytes, every character in an ASCII-encoded string contains a free, extra bit that can be put to good use.

So at some point in the 80’s, many people had the same idea: let’s prepend each of the 128 existing ASCII 7-bit strings with a 0 (so e.g. A goes from 1000001 to 01000001), and we’ll get 128 other bytes (those that begin with 1, such as 11000001) to assign to é, ñ & co. That was a smart move, although obviously no single set of 128 extra characters could satisfy the needs of every language, even in the restricted group of languages with the largest economic weight at that time.

As a result, while everyone agreed about what came to be known as low ASCII (namely from 00000000 to 01111111) a number of distinct proposals were made for high or extended ASCII (from 10000000 to 11111111), and each such proposal was called a code page. In particular, the ISO-8859 norm is a widely used set of 16 code pages which assign various characters to each high-ASCII byte: for instance, 11110001 is ñ in ISO-8859-1 (aka Latin-1), ń in ISO-8859-2, ё in ISO-8859-5, ρ in ISO-8859-7, and so on. And many more code pages have been proposed, notably by major players in the computer and software industry.

Unicode

For work that does not require using more than one code page at a time, the code page system works rather well—as long as the elements of your work environment (hardware, software, data) are all attuned. When they cease to be, for instance because your text file is in ISO-8859-1 and your text processor believes it to be in plain ASCII, there comes the ����. With the advent of the World Wide Web in the 90’s, having to deal with documents in a wealth of mostly unidentified code pages became normal, and the need for a better solution became more obvious than ever. The solution that has been most widely adopted is Unicode.

As suggested by its name, Unicode has the considerable ambition of offering a single, unified solution to the problem of text encoding, one that works for all languages of the known world and beyond. It is based on couple of simple and powerful insights, the most important of which is arguably that all characters of all alphabets should be assigned a unique and permanent number (a “code point”, in technical jargon), regardless of the scheme that will be used to represent these numbers in the memory of a given device. Building and maintaining this huge directory of characters (113,021 as of June 2014) has kept the Unicode Consortium busy for more than two decades.

Unicode code points still have to be represented in memory somehow, and this is where things become more complicated (again). Indeed there are several ways of doing so (yes, again), the details of which are beyond the scope of this post. The most often used Unicode encodings are UTF-16 (which itself comes in two flavors called low-endian and big-endian), and UTF-8. UTF-8 encodes each code point in a variable number of bytes (from 1 to 4), with the nice feature that the 128 ASCII characters are encoded with a single byte, exactly like in the ASCII code; thus every ASCII-encoded string is also, in fact, a valid UTF-8 string and vice-versa. In a domain that remains strongly anglocentric, this property of UTF-8 has probably played an important role in its impressive adoption rate.

Coda and acknowledgments

There is a lot more to say about text encoding (including byte-order marks, combining characters, and so on), and I intend to say more in other posts, in particular in a follow-up about best practices. For the time being, I believe that this post provides enough background knowledge to understand why I recommend those practices—that is, when they’re posted. If you think otherwise, or if you think anything about this material, by all means leave a reply and let me know.

A lot of the inspiration for this post comes from an excellent and classical piece by Joel Spolsky. I’ve basically tried to do for an intended audience of non-technical users with a specific interest in text data what he did brilliantly for software developers.

Leave a Reply

Your email address will not be published. Required fields are marked *