How can you tell if a file is UTF-8 encoded or not?

Say you want to know if a particular file is encoded using UTF-8¹. On a UNIX box, you could just use the file command:

$ file *.txt
Housman - Loveliest of trees.txt:  ASCII English text
Millay - First fig.txt:UTF-8 Unicode English text
Yeats - When You Are Old.txt:  ASCII English text

Now, I know that’s not right. I created the Housman & the Yeats files using vim, & vim is set to use UTF-8², so something is funny somewhere.

In poking around to try to figure out a better method to find out if a file is UTF-8 or not, I discovered just the command I needed: isutf8. Yes, the name of the command is “is UTF8” all crammed together & lowercased, which certainly makes it easy to remember. It’s part of the moreutils package that you can download & install. Here’s how I did it.

On my Linux box running Debian:

# apt-get install moreutils
…
Need to get 53.3 kB of archives.
After this operation, 188 kB of additional disk space will be used.
Get:1 http://http.us.debian.org/debian/ squeeze/main moreutils amd64 0.41 [53.3 kB]
Fetched 53.3 kB in 0s (163 kB/s)   
…

On my Mac, using Homebrew³:

$ brew install moreutils
==> Downloading http://mirrors.kernel.org/debian/pool/main/m/moreutils/moreutils_0.45.tar.gz
######################################################################## 100.0%
==> make isutf8 ifne pee sponge mispipe lckdo parallel
/usr/local/Cellar/moreutils/0.45: 15 files, 148K, built in 3 seconds

Now that isutf8 was installed, I tried again to see if those text files were UTF-8:

$ isutf8 *.txt
$

That’s right—nothing. As it should be. In typical UNIX fashion, no news is good news, & means that the command did NOT find any files that were NOT UTF-8. Or, to put it another way, all three text files were in fact UTF-8, so the command did nothing.

Let’s see what happens with some other files:

$ isutf8 *
Messenger Bags.numbers: line 1, char 1, byte offset 12: invalid UTF-8 code
Student Paper.doc: line 1, char 1, byte offset 1: invalid UTF-8 code
Tix.jpg: line 1, char 1, byte offset 1: invalid UTF-8 code

Yep. Those were definitely not UTF-8 encoded.

I don’t think I’ll be using isutf8 constantly, but it’s sure a handy little tool to have around.⁴

chronic: runs a command quietly unless it fails
combine: combine the lines in two files using boolean operations
ifdata: get network interface info without parsing ifconfig output
ifne: run a program if the standard input is not empty
isutf8: check if a file or standard input is utf-8
lckdo: execute a program with a lock held
mispipe: pipe two commands, returning the exit status of the first
parallel: run multiple jobs at once
pee: tee standard input to pipes
sponge: soak up standard input and write to a file
ts: timestamp standard input
vidir: edit a directory in your text editor
vipe: insert a text editor into a pipe
zrun: automatically uncompress arguments to command

If you don’t know what UTF-8 is, read the Wikipedia article. Here’s the upshot: you want all your text editors & operating systems & web browsers to support & use UTF-8 by default. It makes life a lot easier. ↩
By putting set enc=utf-8 in my .vimrc file, of course. ↩
What? You’re not using Homebrew? Head over to https://github.com/mxcl/homebrew & get that sucker installed! It’s far better than fink or MacPorts. More on Homebrew some other time. ↩
Eagle-eyed readers might have noticed a list of software packages that were installed along with isutf8 when I gave the Homebrew listing. Looking over the list at the moreutils site, I think I’m going to have a lot to play with & write about over the coming months: ↩