Chances are good that even a very casual interest in text data analysis will lead you to manipulate text files. In this post I’d like to discuss a couple of file naming practices that can save you a lot of time and hassle when you start working with more such files, more frequently.
1. Use informative names
If you ever only work with a single file, calling it text might do. A soon as a second one enters the radar, calling it text2 will usually make for a poor and confusing naming scheme (when compared to, say, hamlet).
2. Always add a suffix
Like it or not, the vast majority of computers in the world are running Windows. On such computers, file names conventionally end with what’s called a file “suffix” or “extension”: a short string (usually 3-letter long) prepended with a dot, such as .txt, .pdf, or .html. This string is what lets the user and the computer know which app is meant to open a given file.
File suffixes are not needed on MacOS or Linux, but using them systematically will prove immensely useful if your files somehow land on Windows someday. For text files, .txt is your safest bet unless you know that your data are in some specific format (e.g. xml)
That said, if you are a Windows user, I recommend to deactivate the inconvenient system default setting that hides suffixes from you: here’s how to do that.
3. Use only non-accented alphanumeric characters
As you’ve probably experienced more than once and as we’ll discuss in another post, many characters tend to be wrongly interpreted when moving from computer to computer (sometimes even from app to app). The subset that’s most likely to be processed correctly everywhere is that of non-accented alphanumeric chars: from a to z, from A to Z, and from 0 to 9 (the underscore _ is also included in this subset, see 5. below).
Being a French speaker, I know that sometimes, sticking to non-accented chars hurts. Still, it hurts less when you choose to do it in the first place, than when you’re forced to rename dozens of files because their name were corrupted when moving from your Mac at work to your PC at home or the other way round.
4. Use only lowercase chars
Non-accented uppercase letters are usually correctly handled by various apps and computers. The problem with them is that they are often used in an inconsistent fashion, notably for title, name, and abbreviation capitalization. As a result, you’ll often find yourself wondering whether you did (or should) capitalize these elements in a given file name or not, unless you decide no later than right now never to use uppercase letters again in file names.
5. Use underscore in place of whitespace
Using informative names often implies assembling multiple words, which in turn raises the question of legibility. With respect to that, it’s important to refrain from making the most intuitive move, namely inserting whitespace between words. Indeed, there are several contexts where file names won’t be processed correctly if they contain whitespace (notably when what’s known as a “command prompt” is involved, sometimes even without the user’s knowledge).
At the same time, sticking to lowercase letters obviously prevents you from adopting a mixed-case convention (e.g. declarationHumanRights.txt). Luckily there is a quite portable solution which is also, arguably, more legible: replacing whitespace with underscore, as in declaration_human_rights.txt.
6. When using numbers, make them fixed-length
Lists of file names are often sorted in an order that seems alphabetical but is really “ASCII-betical” (we’ll talk more about ASCII in another post). As a result, sample10.txt will be placed between sample1.txt and sample2.txt, which is rarely useful.
In order to fix this, simply prepend numbers with 0’s so that they’re all the same length: e.g. sample01.txt, sample02.txt, sample10.txt.
7. When using dates, make them yyyy_mm_dd
This way spelling out dates (e.g. transcript_2014_09_14.txt, or the more compact but less legible transcript_20140914.txt) ensures that the ASCII-betical order matches the chronological one.
8. Take advantage of ASCII-betical order for categorizing files
If your files are organized according to some form of (typically hierarchical) categorization, put these categories at the beginning of file names, from the broadest (or most important) to the finest (or least important). This will make it easy, or at least easier, to select a subset of them based on these categories (by sorting them in ASCII-betical order).
For instance, suppose that you have a set of 12 files: 6 in English and 6 in French, and within each language 3 recipes and 3 user manuals. If the most important categorization for your purposes is language, a useful naming scheme might be en_recipe_1.txt, fr_manual_3.txt, etc. If text type matters most, you should rather use recipe_en_1.txt.