Text encoding survival lessons (part 2 of 2)

In this second post on text encoding, I’d like to spell out a set of best practices, the rationale for which can be best understood at the light of the first post in this series. If you read only one of them, however, it should be this one.

1. Know thy encoding

Every single text string is associated with an encoding (ASCII, Latin-1, utf-8, etc.), and the computer needs to know which one to display the text correctly. In some cases a computer program might be able to guess, but the process is not entirely error-free. In yet other cases, the information will be explicitly indicated in the text itself (see 2 below), but it could be wrong. There’s no better configuration for a text user than actually knowing which encoding is associated to a given string.

2. If a dedicated metadata field exists, use it

In particular, html and xml files offer the possibility to inscribe their encoding directly in their content (here are several ways of doing so). Importantly, make sure to update this information if you create alternate versions of a file with different encodings.

3. Consider inscribing the encoding in the name of files

This is especially useful to recognize at first glance variant versions of the same text encoded in different ways. The metadata approach (see 2 above) does the job as well, but it requires opening the file and viewing its content. Both techniques may be profitably used in conjunction, file naming being primarily targeted at the user while metadata can be easily interpreted by the machine; don’t forget to keep them synchronized then.

If you choose to inscribe the encoding in file names, make sure to do it in a way that’s consistent with file naming best practices.

4. Adopt standard encodings

To some extent, every known encoding is standard, so a better way to say this would be “adopt the most widely used encodings”. This will maximize the odds that your software, as well as software of people you collaborate with, are able to deal with your data seamlessly. For western European languages, that would be Latin-1 (aka iso-8859-1) or utf-8 (ASCII too, but it’s only useful for English—or Latin).

5. Keep a master copy in the most restricted possible encoding

The character sets associated with some encodings are in a relationship of strict inclusion: from that point of view, ASCII is a subset of Latin-1, which in turn is a subset of utf-8, for instance. Converting data to a less restricted encoding (e.g. from ASCII to Latin-1) is a trivial operation that can usually be done using your favorite text editor (or with Orange Textable, for the adventurous). By contrast, converting to a more restricted encoding (e.g. from utf-8 to Latin-1) can imply a loss of information, because you’ll need to perform systematic substitutions such as œ to oe, for instance.

Moreover, while utf-8 (and Unicode in general) can be expected to become the most widely supported encoding in the future, many applications are still unable to handle it correctly, so that for western European languages, ASCII or Latin-1 are probably the most portable choices today.

For these reasons, I recommend to keep a master copy of your texts in the most restricted encoding that can be used to represent them properly, and generate converted copies of it in other encodings when needed. There are many situations where this can be the case, in particular when interchange between various computer systems is to be expected, notably on the internet; in this case utf-8 would probably be the safest bet.

That said, if you make the effort to create a copy of a text in a more restricted encoding, and carry out the necessary substitutions, by all means store it next to the Master Mold to spare yourself the inconvenience of going through this again, should the need arise.

6. Mind the BOM

The utf-16 Unicode encoding comes in two flavors known as “big endian” and “low-endian”, whose differences matter little to this discussion. In order to distinguish between them, the Unicode standard prescribes that a specific character be included at the very beginning of the text, the so-called byte order mark or BOM.

Now some text editors, notably Notepad on Windows, add a kind of BOM even to utf-8 files, although it is not actually required to process them correctly (this mark can be viewed by opening your file in a text editor as if it was encoded in Latin-1, and it will appear as  at the beginning of the file). And since this BOM is not required, some programs seem to believe it’s not here, and treat it as if it was simply the initial substring of the text instead of ignoring it.

This can have all sorts of undesirable consequences, such as failure to recognize that a file is in a given format (e.g. xml), overestimation of the length of the file, loss of ASCII compatibility, and so on. Because of that, it’s safer not to include a BOM in utf-8 files. Some text editors will let you make that choice when you save files (see e.g. this post), otherwise you’ll need to be more crafty (see these indications).

7. Normalize Unicode strings

Regarding Unicode, one last thing needs to be discussed. Indeed, it turns out that Unicode offers several ways of writing certain letters, in particular those that bear one or more diacritic symbols. For instance, ç can be written either as a single character (known as LATIN SMALL LETTER C WITH CEDILLA), or as the combination of an ordinary character and a so-called “combining” one: LATIN SMALL LETTER C + COMBINING CEDILLA.

Processing these variant forms raises several issues, chief among which the fact that they’re not considered equivalent in string and pattern matching operations: queries including one such variant won’t match strings containing the other one, words containing either variant will appear as distinct entries in an index, and so on. Furthermore, operations involving the notion of string length are also rendered more complex by these variations: for instance, depending on the notation used, ça will be considered 2 or 3-letter long word.

Luckily, the solution has already been designed by the fine people of the Unicode Consortium. It is called NFC, “normalization form C” (for composition), and consists in a unique and standardized way of rewriting every string, in such fashion that whenever a sequence of combined elements (e.g. c + COMBINING CEDILLA) has a “unitary” equivalent (e.g. ç), the latter will be used systematically. While many Unicode strings are already in NFC (as noted on the site of Mark Davis—a very useful resource about NFC among other topics), converting those strings that are not NFC is a crucial precaution in the context of text analysis. This is actually something many programs do automatically, including my very own Orange-Textable.

Leave a Reply

Your email address will not be published. Required fields are marked *