Category Archives: Best practices

Text encoding survival lessons (part 2 of 2)

In this second post on text encoding, I’d like to spell out a set of best practices, the rationale for which can be best understood at the light of the first post in this series. If you read only one of them, however, it should be this one.

1. Know thy encoding

Every single text string is associated with an encoding (ASCII, Latin-1, utf-8, etc.), and the computer needs to know which one to display the text correctly. In some cases a computer program might be able to guess, but the process is not entirely error-free. In yet other cases, the information will be explicitly indicated in the text itself (see 2 below), but it could be wrong. There’s no better configuration for a text user than actually knowing which encoding is associated to a given string.

2. If a dedicated metadata field exists, use it

In particular, html and xml files offer the possibility to inscribe their encoding directly in their content (here are several ways of doing so). Importantly, make sure to update this information if you create alternate versions of a file with different encodings.

3. Consider inscribing the encoding in the name of files

This is especially useful to recognize at first glance variant versions of the same text encoded in different ways. The metadata approach (see 2 above) does the job as well, but it requires opening the file and viewing its content. Both techniques may be profitably used in conjunction, file naming being primarily targeted at the user while metadata can be easily interpreted by the machine; don’t forget to keep them synchronized then.

If you choose to inscribe the encoding in file names, make sure to do it in a way that’s consistent with file naming best practices.

4. Adopt standard encodings

To some extent, every known encoding is standard, so a better way to say this would be “adopt the most widely used encodings”. This will maximize the odds that your software, as well as software of people you collaborate with, are able to deal with your data seamlessly. For western European languages, that would be Latin-1 (aka iso-8859-1) or utf-8 (ASCII too, but it’s only useful for English—or Latin).

5. Keep a master copy in the most restricted possible encoding

The character sets associated with some encodings are in a relationship of strict inclusion: from that point of view, ASCII is a subset of Latin-1, which in turn is a subset of utf-8, for instance. Converting data to a less restricted encoding (e.g. from ASCII to Latin-1) is a trivial operation that can usually be done using your favorite text editor (or with Orange Textable, for the adventurous). By contrast, converting to a more restricted encoding (e.g. from utf-8 to Latin-1) can imply a loss of information, because you’ll need to perform systematic substitutions such as œ to oe, for instance.

Moreover, while utf-8 (and Unicode in general) can be expected to become the most widely supported encoding in the future, many applications are still unable to handle it correctly, so that for western European languages, ASCII or Latin-1 are probably the most portable choices today.

For these reasons, I recommend to keep a master copy of your texts in the most restricted encoding that can be used to represent them properly, and generate converted copies of it in other encodings when needed. There are many situations where this can be the case, in particular when interchange between various computer systems is to be expected, notably on the internet; in this case utf-8 would probably be the safest bet.

That said, if you make the effort to create a copy of a text in a more restricted encoding, and carry out the necessary substitutions, by all means store it next to the Master Mold to spare yourself the inconvenience of going through this again, should the need arise.

6. Mind the BOM

The utf-16 Unicode encoding comes in two flavors known as “big endian” and “low-endian”, whose differences matter little to this discussion. In order to distinguish between them, the Unicode standard prescribes that a specific character be included at the very beginning of the text, the so-called byte order mark or BOM.

Now some text editors, notably Notepad on Windows, add a kind of BOM even to utf-8 files, although it is not actually required to process them correctly (this mark can be viewed by opening your file in a text editor as if it was encoded in Latin-1, and it will appear as  at the beginning of the file). And since this BOM is not required, some programs seem to believe it’s not here, and treat it as if it was simply the initial substring of the text instead of ignoring it.

This can have all sorts of undesirable consequences, such as failure to recognize that a file is in a given format (e.g. xml), overestimation of the length of the file, loss of ASCII compatibility, and so on. Because of that, it’s safer not to include a BOM in utf-8 files. Some text editors will let you make that choice when you save files (see e.g. this post), otherwise you’ll need to be more crafty (see these indications).

7. Normalize Unicode strings

Regarding Unicode, one last thing needs to be discussed. Indeed, it turns out that Unicode offers several ways of writing certain letters, in particular those that bear one or more diacritic symbols. For instance, ç can be written either as a single character (known as LATIN SMALL LETTER C WITH CEDILLA), or as the combination of an ordinary character and a so-called “combining” one: LATIN SMALL LETTER C + COMBINING CEDILLA.

Processing these variant forms raises several issues, chief among which the fact that they’re not considered equivalent in string and pattern matching operations: queries including one such variant won’t match strings containing the other one, words containing either variant will appear as distinct entries in an index, and so on. Furthermore, operations involving the notion of string length are also rendered more complex by these variations: for instance, depending on the notation used, ça will be considered 2 or 3-letter long word.

Luckily, the solution has already been designed by the fine people of the Unicode Consortium. It is called NFC, “normalization form C” (for composition), and consists in a unique and standardized way of rewriting every string, in such fashion that whenever a sequence of combined elements (e.g. c + COMBINING CEDILLA) has a “unitary” equivalent (e.g. ç), the latter will be used systematically. While many Unicode strings are already in NFC (as noted on the site of Mark Davis—a very useful resource about NFC among other topics), converting those strings that are not NFC is a crucial precaution in the context of text analysis. This is actually something many programs do automatically, including my very own Orange-Textable.

On text file names

Chances are good that even a very casual interest in text data analysis will lead you to manipulate text files. In this post I’d like to discuss a couple of file naming practices that can save you a lot of time and hassle when you start working with more such files, more frequently.

1. Use informative names

If you ever only work with a single file, calling it text might do. A soon as a second one enters the radar, calling it text2 will usually make for a poor and confusing naming scheme (when compared to, say, hamlet).

2. Always add a suffix

Like it or not, the vast majority of computers in the world are running Windows. On such computers, file names conventionally end with what’s called a file “suffix” or “extension”: a short string (usually 3-letter long) prepended with a dot, such as .txt, .pdf, or .html. This string is what lets the user and the computer know which app is meant to open a given file.

File suffixes are not needed on MacOS or Linux, but using them systematically will prove immensely useful if your files somehow land on Windows someday. For text files, .txt is your safest bet unless you know that your data are in some specific format (e.g. xml)

That said, if you are a Windows user, I recommend to deactivate the inconvenient system default setting that hides suffixes from you: here’s how to do that.

3. Use only non-accented alphanumeric characters

As you’ve probably experienced more than once and as we’ll discuss in another post, many characters tend to be wrongly interpreted when moving from computer to computer (sometimes even from app to app). The subset that’s most likely to be processed correctly everywhere is that of non-accented alphanumeric chars: from a to z, from A to Z, and from 0 to 9 (the underscore _ is also included in this subset, see 5. below).

Being a French speaker, I know that sometimes, sticking to non-accented chars hurts. Still, it hurts less when you choose to do it in the first place, than when you’re forced to rename dozens of files because their name were corrupted when moving from your Mac at work to your PC at home or the other way round.

4. Use only lowercase chars

Non-accented uppercase letters are usually correctly handled by various apps and computers. The problem with them is that they are often used in an inconsistent fashion, notably for title, name, and abbreviation capitalization. As a result, you’ll often find yourself wondering whether you did (or should) capitalize these elements in a given file name or not, unless you decide no later than right now never to use uppercase letters again in file names.

5. Use underscore in place of whitespace

Using informative names often implies assembling multiple words, which in turn raises the question of legibility. With respect to that, it’s important to refrain from making the most intuitive move, namely inserting whitespace between words. Indeed, there are several contexts where file names won’t be processed correctly if they contain whitespace (notably when what’s known as a “command prompt” is involved, sometimes even without the user’s knowledge).

At the same time, sticking to lowercase letters obviously prevents you from adopting a mixed-case convention (e.g. declarationHumanRights.txt). Luckily there is a quite portable solution which is also, arguably, more legible: replacing whitespace with underscore, as in declaration_human_rights.txt.

6. When using numbers, make them fixed-length

Lists of file names are often sorted in an order that seems alphabetical but is really “ASCII-betical” (we’ll talk more about ASCII in another post). As a result, sample10.txt will be placed between sample1.txt and sample2.txt, which is rarely useful.

In order to fix this, simply prepend numbers with 0’s so that they’re all the same length: e.g. sample01.txt, sample02.txt, sample10.txt.

7. When using dates, make them yyyy_mm_dd

This way spelling out dates (e.g. transcript_2014_09_14.txt, or the more compact but less legible transcript_20140914.txt) ensures that the ASCII-betical order matches the chronological one.

8. Take advantage of ASCII-betical order for categorizing files

If your files are organized according to some form of (typically hierarchical) categorization, put these categories at the beginning of file names, from the broadest (or most important) to the finest (or least important). This will make it easy, or at least easier, to select a subset of them based on these categories (by sorting them in ASCII-betical order).

For instance, suppose that you have a set of 12 files: 6 in English and 6 in French, and within each language 3 recipes and 3 user manuals. If the most important categorization for your purposes is language, a useful naming scheme might be en_recipe_1.txt, fr_manual_3.txt, etc. If text type matters most, you should rather use recipe_en_1.txt.