Monthly Archives: August 2014

Text encoding survival lessons (part 2 of 2)

In this second post on text encoding, I’d like to spell out a set of best practices, the rationale for which can be best understood at the light of the first post in this series. If you read only one of them, however, it should be this one.

1. Know thy encoding

Every single text string is associated with an encoding (ASCII, Latin-1, utf-8, etc.), and the computer needs to know which one to display the text correctly. In some cases a computer program might be able to guess, but the process is not entirely error-free. In yet other cases, the information will be explicitly indicated in the text itself (see 2 below), but it could be wrong. There’s no better configuration for a text user than actually knowing which encoding is associated to a given string.

2. If a dedicated metadata field exists, use it

In particular, html and xml files offer the possibility to inscribe their encoding directly in their content (here are several ways of doing so). Importantly, make sure to update this information if you create alternate versions of a file with different encodings.

3. Consider inscribing the encoding in the name of files

This is especially useful to recognize at first glance variant versions of the same text encoded in different ways. The metadata approach (see 2 above) does the job as well, but it requires opening the file and viewing its content. Both techniques may be profitably used in conjunction, file naming being primarily targeted at the user while metadata can be easily interpreted by the machine; don’t forget to keep them synchronized then.

If you choose to inscribe the encoding in file names, make sure to do it in a way that’s consistent with file naming best practices.

4. Adopt standard encodings

To some extent, every known encoding is standard, so a better way to say this would be “adopt the most widely used encodings”. This will maximize the odds that your software, as well as software of people you collaborate with, are able to deal with your data seamlessly. For western European languages, that would be Latin-1 (aka iso-8859-1) or utf-8 (ASCII too, but it’s only useful for English—or Latin).

5. Keep a master copy in the most restricted possible encoding

The character sets associated with some encodings are in a relationship of strict inclusion: from that point of view, ASCII is a subset of Latin-1, which in turn is a subset of utf-8, for instance. Converting data to a less restricted encoding (e.g. from ASCII to Latin-1) is a trivial operation that can usually be done using your favorite text editor (or with Orange Textable, for the adventurous). By contrast, converting to a more restricted encoding (e.g. from utf-8 to Latin-1) can imply a loss of information, because you’ll need to perform systematic substitutions such as œ to oe, for instance.

Moreover, while utf-8 (and Unicode in general) can be expected to become the most widely supported encoding in the future, many applications are still unable to handle it correctly, so that for western European languages, ASCII or Latin-1 are probably the most portable choices today.

For these reasons, I recommend to keep a master copy of your texts in the most restricted encoding that can be used to represent them properly, and generate converted copies of it in other encodings when needed. There are many situations where this can be the case, in particular when interchange between various computer systems is to be expected, notably on the internet; in this case utf-8 would probably be the safest bet.

That said, if you make the effort to create a copy of a text in a more restricted encoding, and carry out the necessary substitutions, by all means store it next to the Master Mold to spare yourself the inconvenience of going through this again, should the need arise.

6. Mind the BOM

The utf-16 Unicode encoding comes in two flavors known as “big endian” and “low-endian”, whose differences matter little to this discussion. In order to distinguish between them, the Unicode standard prescribes that a specific character be included at the very beginning of the text, the so-called byte order mark or BOM.

Now some text editors, notably Notepad on Windows, add a kind of BOM even to utf-8 files, although it is not actually required to process them correctly (this mark can be viewed by opening your file in a text editor as if it was encoded in Latin-1, and it will appear as  at the beginning of the file). And since this BOM is not required, some programs seem to believe it’s not here, and treat it as if it was simply the initial substring of the text instead of ignoring it.

This can have all sorts of undesirable consequences, such as failure to recognize that a file is in a given format (e.g. xml), overestimation of the length of the file, loss of ASCII compatibility, and so on. Because of that, it’s safer not to include a BOM in utf-8 files. Some text editors will let you make that choice when you save files (see e.g. this post), otherwise you’ll need to be more crafty (see these indications).

7. Normalize Unicode strings

Regarding Unicode, one last thing needs to be discussed. Indeed, it turns out that Unicode offers several ways of writing certain letters, in particular those that bear one or more diacritic symbols. For instance, ç can be written either as a single character (known as LATIN SMALL LETTER C WITH CEDILLA), or as the combination of an ordinary character and a so-called “combining” one: LATIN SMALL LETTER C + COMBINING CEDILLA.

Processing these variant forms raises several issues, chief among which the fact that they’re not considered equivalent in string and pattern matching operations: queries including one such variant won’t match strings containing the other one, words containing either variant will appear as distinct entries in an index, and so on. Furthermore, operations involving the notion of string length are also rendered more complex by these variations: for instance, depending on the notation used, ça will be considered 2 or 3-letter long word.

Luckily, the solution has already been designed by the fine people of the Unicode Consortium. It is called NFC, “normalization form C” (for composition), and consists in a unique and standardized way of rewriting every string, in such fashion that whenever a sequence of combined elements (e.g. c + COMBINING CEDILLA) has a “unitary” equivalent (e.g. ç), the latter will be used systematically. While many Unicode strings are already in NFC (as noted on the site of Mark Davis—a very useful resource about NFC among other topics), converting those strings that are not NFC is a crucial precaution in the context of text analysis. This is actually something many programs do automatically, including my very own Orange-Textable.

Text encoding survival lessons (part 1 of 2)

If you’ve ever manipulated text data, you must have suffered from the encoding syndrome: when strings start to look like they’ve gone all �����. My aim in this post (the first in a series of two) is to provide non-technical readers with the background knowledge about text encoding that should help them understand how to keep their level of ���� to a strict minimum in the future. The second post will summarize best practices related to text encoding that can be derived from this background knowledge.

Strings are stored in binary format on computers

In fact everything is stored as a sequence of 0’s and 1’s on a computer—that’s one of the most well-known fact about computers outside of computer science, and the reason why binary strings have become a common symbol for anything digital. As regards terminology, a single 0 or 1 is called a bit, and a sequence of 8 bits is called a byte. Bytes matter because they’re the standard way of chunking binary strings (hence the b in Kb, Mb, and so on). Just as a bit can take 2 values (0 and 1), a byte can take exactly 256 distinct values, namely 00000000, 00000001, 00000010, 00000011, …, 11111110, 11111111 (256 is 2 to the 8th, for the mathematically inclined).

Returning to text data with these notions at hand, we may summarize the problem of text encoding as that of mapping the internal representation of strings, which is in bytes, to their representation as hopefully meaningful sequences of characters in some language’s alphabet. And it would be a straightforward process if there was only one way of doing it, which is obviously not the ����.

The ASCII code

Back in the 60’s, when computer science was purely anglocentric, it was assumed that as far as alphabet was concerned, not much was needed besides a to z and A to Z. This assumption didn’t suit many languages of the world very well, but anyway, it was the basis of the first widely spread encoding, known as the American Standard Code for Information Interchange (ASCII), which is still in use half a century later.

The ASCII code is really just a conventional way of associating unaccented letters, numbers, and a bunch of useful characters (such as whitespace, dot, comma, underscore, ampersand, dollar, and so on) to bytes—or more precisely, to sequences of 7 bits. Why only 7? Because that’s what’s needed to represent the set of 128 characters in question. So for instance, A is 1000001, B is 1000010, a is 1100001, and so on (cf. this table).

Code pages, or the return of the 8-th bit

Now almost none of the world’s languages can be represented properly with the 128 characters of the ASCII code. Even those languages that do use these characters require extra ones such as é, ñ, and the like, and they form a rather influential group. Luckily for them, since binary data are chunked in bytes, every character in an ASCII-encoded string contains a free, extra bit that can be put to good use.

So at some point in the 80’s, many people had the same idea: let’s prepend each of the 128 existing ASCII 7-bit strings with a 0 (so e.g. A goes from 1000001 to 01000001), and we’ll get 128 other bytes (those that begin with 1, such as 11000001) to assign to é, ñ & co. That was a smart move, although obviously no single set of 128 extra characters could satisfy the needs of every language, even in the restricted group of languages with the largest economic weight at that time.

As a result, while everyone agreed about what came to be known as low ASCII (namely from 00000000 to 01111111) a number of distinct proposals were made for high or extended ASCII (from 10000000 to 11111111), and each such proposal was called a code page. In particular, the ISO-8859 norm is a widely used set of 16 code pages which assign various characters to each high-ASCII byte: for instance, 11110001 is ñ in ISO-8859-1 (aka Latin-1), ń in ISO-8859-2, ё in ISO-8859-5, ρ in ISO-8859-7, and so on. And many more code pages have been proposed, notably by major players in the computer and software industry.

Unicode

For work that does not require using more than one code page at a time, the code page system works rather well—as long as the elements of your work environment (hardware, software, data) are all attuned. When they cease to be, for instance because your text file is in ISO-8859-1 and your text processor believes it to be in plain ASCII, there comes the ����. With the advent of the World Wide Web in the 90’s, having to deal with documents in a wealth of mostly unidentified code pages became normal, and the need for a better solution became more obvious than ever. The solution that has been most widely adopted is Unicode.

As suggested by its name, Unicode has the considerable ambition of offering a single, unified solution to the problem of text encoding, one that works for all languages of the known world and beyond. It is based on couple of simple and powerful insights, the most important of which is arguably that all characters of all alphabets should be assigned a unique and permanent number (a “code point”, in technical jargon), regardless of the scheme that will be used to represent these numbers in the memory of a given device. Building and maintaining this huge directory of characters (113,021 as of June 2014) has kept the Unicode Consortium busy for more than two decades.

Unicode code points still have to be represented in memory somehow, and this is where things become more complicated (again). Indeed there are several ways of doing so (yes, again), the details of which are beyond the scope of this post. The most often used Unicode encodings are UTF-16 (which itself comes in two flavors called low-endian and big-endian), and UTF-8. UTF-8 encodes each code point in a variable number of bytes (from 1 to 4), with the nice feature that the 128 ASCII characters are encoded with a single byte, exactly like in the ASCII code; thus every ASCII-encoded string is also, in fact, a valid UTF-8 string and vice-versa. In a domain that remains strongly anglocentric, this property of UTF-8 has probably played an important role in its impressive adoption rate.

Coda and acknowledgments

There is a lot more to say about text encoding (including byte-order marks, combining characters, and so on), and I intend to say more in other posts, in particular in a follow-up about best practices. For the time being, I believe that this post provides enough background knowledge to understand why I recommend those practices—that is, when they’re posted. If you think otherwise, or if you think anything about this material, by all means leave a reply and let me know.

A lot of the inspiration for this post comes from an excellent and classical piece by Joel Spolsky. I’ve basically tried to do for an intended audience of non-technical users with a specific interest in text data what he did brilliantly for software developers.

On text file names

Chances are good that even a very casual interest in text data analysis will lead you to manipulate text files. In this post I’d like to discuss a couple of file naming practices that can save you a lot of time and hassle when you start working with more such files, more frequently.

1. Use informative names

If you ever only work with a single file, calling it text might do. A soon as a second one enters the radar, calling it text2 will usually make for a poor and confusing naming scheme (when compared to, say, hamlet).

2. Always add a suffix

Like it or not, the vast majority of computers in the world are running Windows. On such computers, file names conventionally end with what’s called a file “suffix” or “extension”: a short string (usually 3-letter long) prepended with a dot, such as .txt, .pdf, or .html. This string is what lets the user and the computer know which app is meant to open a given file.

File suffixes are not needed on MacOS or Linux, but using them systematically will prove immensely useful if your files somehow land on Windows someday. For text files, .txt is your safest bet unless you know that your data are in some specific format (e.g. xml)

That said, if you are a Windows user, I recommend to deactivate the inconvenient system default setting that hides suffixes from you: here’s how to do that.

3. Use only non-accented alphanumeric characters

As you’ve probably experienced more than once and as we’ll discuss in another post, many characters tend to be wrongly interpreted when moving from computer to computer (sometimes even from app to app). The subset that’s most likely to be processed correctly everywhere is that of non-accented alphanumeric chars: from a to z, from A to Z, and from 0 to 9 (the underscore _ is also included in this subset, see 5. below).

Being a French speaker, I know that sometimes, sticking to non-accented chars hurts. Still, it hurts less when you choose to do it in the first place, than when you’re forced to rename dozens of files because their name were corrupted when moving from your Mac at work to your PC at home or the other way round.

4. Use only lowercase chars

Non-accented uppercase letters are usually correctly handled by various apps and computers. The problem with them is that they are often used in an inconsistent fashion, notably for title, name, and abbreviation capitalization. As a result, you’ll often find yourself wondering whether you did (or should) capitalize these elements in a given file name or not, unless you decide no later than right now never to use uppercase letters again in file names.

5. Use underscore in place of whitespace

Using informative names often implies assembling multiple words, which in turn raises the question of legibility. With respect to that, it’s important to refrain from making the most intuitive move, namely inserting whitespace between words. Indeed, there are several contexts where file names won’t be processed correctly if they contain whitespace (notably when what’s known as a “command prompt” is involved, sometimes even without the user’s knowledge).

At the same time, sticking to lowercase letters obviously prevents you from adopting a mixed-case convention (e.g. declarationHumanRights.txt). Luckily there is a quite portable solution which is also, arguably, more legible: replacing whitespace with underscore, as in declaration_human_rights.txt.

6. When using numbers, make them fixed-length

Lists of file names are often sorted in an order that seems alphabetical but is really “ASCII-betical” (we’ll talk more about ASCII in another post). As a result, sample10.txt will be placed between sample1.txt and sample2.txt, which is rarely useful.

In order to fix this, simply prepend numbers with 0’s so that they’re all the same length: e.g. sample01.txt, sample02.txt, sample10.txt.

7. When using dates, make them yyyy_mm_dd

This way spelling out dates (e.g. transcript_2014_09_14.txt, or the more compact but less legible transcript_20140914.txt) ensures that the ASCII-betical order matches the chronological one.

8. Take advantage of ASCII-betical order for categorizing files

If your files are organized according to some form of (typically hierarchical) categorization, put these categories at the beginning of file names, from the broadest (or most important) to the finest (or least important). This will make it easy, or at least easier, to select a subset of them based on these categories (by sorting them in ASCII-betical order).

For instance, suppose that you have a set of 12 files: 6 in English and 6 in French, and within each language 3 recipes and 3 user manuals. If the most important categorization for your purposes is language, a useful naming scheme might be en_recipe_1.txt, fr_manual_3.txt, etc. If text type matters most, you should rather use recipe_en_1.txt.

Getting started

I’ve been thinking about starting an academic blog long enough, it’s time for me to get started.

Why would I do such a thing? In part because I’ve come across such inspiring blogs as Ted Underwood’s, Matthew Jockers’, and Jeff Atwood’s, to cite just a few, which prove that if a blog is lame, it is not necessarily due to the nature of the media (awkward as it may seem, this is actually intended to be a praise to these authors).

Also, I’m the author of an open source text analysis program called Orange Textable, whose most salient feature is its adoption of a visual programming interface. This design aims to provide non-technical users—in particular typical humanists—with (part of) the computing power normally reserved to programmers, albeit (almost) without coding, and of course within the restricted domain of text analysis.

Being the author of a software tool doesn’t make a compelling case for starting a blog per se, but the nature of Orange Textable‘s user interface is such that it can be more easily understood, I believe, through the examination of a number of diverse use cases. And since I’m frequently exploring such use cases in the course of the software’s development, I figured it might be useful to document them in passing on a blog.

I’d also like to use this channel as a means to share my views on some more general issues about Humanities computing and text analysis in particular. I hope that teaching in this field for a couple of years has given me a notion of ways in which beginners could avoid getting right from the start habits that they’ll wish to get rid of when they get more experienced. Best practices, if you will.

And who knows, maybe this turns out to be a good way to draft a book on visual programming for text analysis…