Tag Archives: Unicode

Text encoding survival lessons (part 2 of 2)

In this second post on text encoding, I’d like to spell out a set of best practices, the rationale for which can be best understood at the light of the first post in this series. If you read only one of them, however, it should be this one.

1. Know thy encoding

Every single text string is associated with an encoding (ASCII, Latin-1, utf-8, etc.), and the computer needs to know which one to display the text correctly. In some cases a computer program might be able to guess, but the process is not entirely error-free. In yet other cases, the information will be explicitly indicated in the text itself (see 2 below), but it could be wrong. There’s no better configuration for a text user than actually knowing which encoding is associated to a given string.

2. If a dedicated metadata field exists, use it

In particular, html and xml files offer the possibility to inscribe their encoding directly in their content (here are several ways of doing so). Importantly, make sure to update this information if you create alternate versions of a file with different encodings.

3. Consider inscribing the encoding in the name of files

This is especially useful to recognize at first glance variant versions of the same text encoded in different ways. The metadata approach (see 2 above) does the job as well, but it requires opening the file and viewing its content. Both techniques may be profitably used in conjunction, file naming being primarily targeted at the user while metadata can be easily interpreted by the machine; don’t forget to keep them synchronized then.

If you choose to inscribe the encoding in file names, make sure to do it in a way that’s consistent with file naming best practices.

4. Adopt standard encodings

To some extent, every known encoding is standard, so a better way to say this would be “adopt the most widely used encodings”. This will maximize the odds that your software, as well as software of people you collaborate with, are able to deal with your data seamlessly. For western European languages, that would be Latin-1 (aka iso-8859-1) or utf-8 (ASCII too, but it’s only useful for English—or Latin).

5. Keep a master copy in the most restricted possible encoding

The character sets associated with some encodings are in a relationship of strict inclusion: from that point of view, ASCII is a subset of Latin-1, which in turn is a subset of utf-8, for instance. Converting data to a less restricted encoding (e.g. from ASCII to Latin-1) is a trivial operation that can usually be done using your favorite text editor (or with Orange Textable, for the adventurous). By contrast, converting to a more restricted encoding (e.g. from utf-8 to Latin-1) can imply a loss of information, because you’ll need to perform systematic substitutions such as œ to oe, for instance.

Moreover, while utf-8 (and Unicode in general) can be expected to become the most widely supported encoding in the future, many applications are still unable to handle it correctly, so that for western European languages, ASCII or Latin-1 are probably the most portable choices today.

For these reasons, I recommend to keep a master copy of your texts in the most restricted encoding that can be used to represent them properly, and generate converted copies of it in other encodings when needed. There are many situations where this can be the case, in particular when interchange between various computer systems is to be expected, notably on the internet; in this case utf-8 would probably be the safest bet.

That said, if you make the effort to create a copy of a text in a more restricted encoding, and carry out the necessary substitutions, by all means store it next to the Master Mold to spare yourself the inconvenience of going through this again, should the need arise.

6. Mind the BOM

The utf-16 Unicode encoding comes in two flavors known as “big endian” and “low-endian”, whose differences matter little to this discussion. In order to distinguish between them, the Unicode standard prescribes that a specific character be included at the very beginning of the text, the so-called byte order mark or BOM.

Now some text editors, notably Notepad on Windows, add a kind of BOM even to utf-8 files, although it is not actually required to process them correctly (this mark can be viewed by opening your file in a text editor as if it was encoded in Latin-1, and it will appear as  at the beginning of the file). And since this BOM is not required, some programs seem to believe it’s not here, and treat it as if it was simply the initial substring of the text instead of ignoring it.

This can have all sorts of undesirable consequences, such as failure to recognize that a file is in a given format (e.g. xml), overestimation of the length of the file, loss of ASCII compatibility, and so on. Because of that, it’s safer not to include a BOM in utf-8 files. Some text editors will let you make that choice when you save files (see e.g. this post), otherwise you’ll need to be more crafty (see these indications).

7. Normalize Unicode strings

Regarding Unicode, one last thing needs to be discussed. Indeed, it turns out that Unicode offers several ways of writing certain letters, in particular those that bear one or more diacritic symbols. For instance, ç can be written either as a single character (known as LATIN SMALL LETTER C WITH CEDILLA), or as the combination of an ordinary character and a so-called “combining” one: LATIN SMALL LETTER C + COMBINING CEDILLA.

Processing these variant forms raises several issues, chief among which the fact that they’re not considered equivalent in string and pattern matching operations: queries including one such variant won’t match strings containing the other one, words containing either variant will appear as distinct entries in an index, and so on. Furthermore, operations involving the notion of string length are also rendered more complex by these variations: for instance, depending on the notation used, ça will be considered 2 or 3-letter long word.

Luckily, the solution has already been designed by the fine people of the Unicode Consortium. It is called NFC, “normalization form C” (for composition), and consists in a unique and standardized way of rewriting every string, in such fashion that whenever a sequence of combined elements (e.g. c + COMBINING CEDILLA) has a “unitary” equivalent (e.g. ç), the latter will be used systematically. While many Unicode strings are already in NFC (as noted on the site of Mark Davis—a very useful resource about NFC among other topics), converting those strings that are not NFC is a crucial precaution in the context of text analysis. This is actually something many programs do automatically, including my very own Orange-Textable.

Text encoding survival lessons (part 1 of 2)

If you’ve ever manipulated text data, you must have suffered from the encoding syndrome: when strings start to look like they’ve gone all ?????. My aim in this post (the first in a series of two) is to provide non-technical readers with the background knowledge about text encoding that should help them understand how to keep their level of ???? to a strict minimum in the future. The second post will summarize best practices related to text encoding that can be derived from this background knowledge.

Strings are stored in binary format on computers

In fact everything is stored as a sequence of 0’s and 1’s on a computer—that’s one of the most well-known fact about computers outside of computer science, and the reason why binary strings have become a common symbol for anything digital. As regards terminology, a single 0 or 1 is called a bit, and a sequence of 8 bits is called a byte. Bytes matter because they’re the standard way of chunking binary strings (hence the b in Kb, Mb, and so on). Just as a bit can take 2 values (0 and 1), a byte can take exactly 256 distinct values, namely 00000000, 00000001, 00000010, 00000011, …, 11111110, 11111111 (256 is 2 to the 8th, for the mathematically inclined).

Returning to text data with these notions at hand, we may summarize the problem of text encoding as that of mapping the internal representation of strings, which is in bytes, to their representation as hopefully meaningful sequences of characters in some language’s alphabet. And it would be a straightforward process if there was only one way of doing it, which is obviously not the ????.

The ASCII code

Back in the 60’s, when computer science was purely anglocentric, it was assumed that as far as alphabet was concerned, not much was needed besides a to z and A to Z. This assumption didn’t suit many languages of the world very well, but anyway, it was the basis of the first widely spread encoding, known as the American Standard Code for Information Interchange (ASCII), which is still in use half a century later.

The ASCII code is really just a conventional way of associating unaccented letters, numbers, and a bunch of useful characters (such as whitespace, dot, comma, underscore, ampersand, dollar, and so on) to bytes—or more precisely, to sequences of 7 bits. Why only 7? Because that’s what’s needed to represent the set of 128 characters in question. So for instance, A is 1000001, B is 1000010, a is 1100001, and so on (cf. this table).

Code pages, or the return of the 8-th bit

Now almost none of the world’s languages can be represented properly with the 128 characters of the ASCII code. Even those languages that do use these characters require extra ones such as é, ñ, and the like, and they form a rather influential group. Luckily for them, since binary data are chunked in bytes, every character in an ASCII-encoded string contains a free, extra bit that can be put to good use.

So at some point in the 80’s, many people had the same idea: let’s prepend each of the 128 existing ASCII 7-bit strings with a 0 (so e.g. A goes from 1000001 to 01000001), and we’ll get 128 other bytes (those that begin with 1, such as 11000001) to assign to é, ñ & co. That was a smart move, although obviously no single set of 128 extra characters could satisfy the needs of every language, even in the restricted group of languages with the largest economic weight at that time.

As a result, while everyone agreed about what came to be known as low ASCII (namely from 00000000 to 01111111) a number of distinct proposals were made for high or extended ASCII (from 10000000 to 11111111), and each such proposal was called a code page. In particular, the ISO-8859 norm is a widely used set of 16 code pages which assign various characters to each high-ASCII byte: for instance, 11110001 is ñ in ISO-8859-1 (aka Latin-1), ? in ISO-8859-2, ? in ISO-8859-5, ? in ISO-8859-7, and so on. And many more code pages have been proposed, notably by major players in the computer and software industry.


For work that does not require using more than one code page at a time, the code page system works rather well—as long as the elements of your work environment (hardware, software, data) are all attuned. When they cease to be, for instance because your text file is in ISO-8859-1 and your text processor believes it to be in plain ASCII, there comes the ????. With the advent of the World Wide Web in the 90’s, having to deal with documents in a wealth of mostly unidentified code pages became normal, and the need for a better solution became more obvious than ever. The solution that has been most widely adopted is Unicode.

As suggested by its name, Unicode has the considerable ambition of offering a single, unified solution to the problem of text encoding, one that works for all languages of the known world and beyond. It is based on couple of simple and powerful insights, the most important of which is arguably that all characters of all alphabets should be assigned a unique and permanent number (a “code point”, in technical jargon), regardless of the scheme that will be used to represent these numbers in the memory of a given device. Building and maintaining this huge directory of characters (113,021 as of June 2014) has kept the Unicode Consortium busy for more than two decades.

Unicode code points still have to be represented in memory somehow, and this is where things become more complicated (again). Indeed there are several ways of doing so (yes, again), the details of which are beyond the scope of this post. The most often used Unicode encodings are UTF-16 (which itself comes in two flavors called low-endian and big-endian), and UTF-8. UTF-8 encodes each code point in a variable number of bytes (from 1 to 4), with the nice feature that the 128 ASCII characters are encoded with a single byte, exactly like in the ASCII code; thus every ASCII-encoded string is also, in fact, a valid UTF-8 string and vice-versa. In a domain that remains strongly anglocentric, this property of UTF-8 has probably played an important role in its impressive adoption rate.

Coda and acknowledgments

There is a lot more to say about text encoding (including byte-order marks, combining characters, and so on), and I intend to say more in other posts, in particular in a follow-up about best practices. For the time being, I believe that this post provides enough background knowledge to understand why I recommend those practices—that is, when they’re posted. If you think otherwise, or if you think anything about this material, by all means leave a reply and let me know.

A lot of the inspiration for this post comes from an excellent and classical piece by Joel Spolsky. I’ve basically tried to do for an intended audience of non-technical users with a specific interest in text data what he did brilliantly for software developers.