Chainsaw on a Tire Swing

Blogging with teeth!

How can you tell if a file is UTF-8 encoded or not?

Say you want to know if a particular file is encoded using UTF-81. On a UNIX box, you could just use the file command:

1
2
3
4
$ file *.txt
Housman - Loveliest of trees.txt:  ASCII English text
Millay - First fig.txt:UTF-8 Unicode English text
Yeats - When You Are Old.txt:  ASCII English text

Now, I know that’s not right. I created the Housman & the Yeats files using vim, & vim is set to use UTF-82, so something is funny somewhere.

In poking around to try to figure out a better method to find out if a file is UTF-8 or not, I discovered just the command I needed: isutf8. Yes, the name of the command is “is UTF8” all crammed together & lowercased, which certainly makes it easy to remember. It’s part of the moreutils package that you can download & install. Here’s how I did it.

On my Linux box running Debian:

1
2
3
4
5
6
7
# apt-get install moreutils
Need to get 53.3 kB of archives.
After this operation, 188 kB of additional disk space will be used.
Get:1 http://http.us.debian.org/debian/ squeeze/main moreutils amd64 0.41 [53.3 kB]
Fetched 53.3 kB in 0s (163 kB/s)

On my Mac, using Homebrew3:

1
2
3
4
5
$ brew install moreutils
==> Downloading http://mirrors.kernel.org/debian/pool/main/m/moreutils/moreutils_0.45.tar.gz
######################################################################## 100.0%
==> make isutf8 ifne pee sponge mispipe lckdo parallel
/usr/local/Cellar/moreutils/0.45: 15 files, 148K, built in 3 seconds

Now that isutf8 was installed, I tried again to see if those text files were UTF-8:

1
2
$ isutf8 *.txt
$ 

That’s right—nothing. As it should be. In typical UNIX fashion, no news is good news, & means that the command did NOT find any files that were NOT UTF-8. Or, to put it another way, all three text files were in fact UTF-8, so the command did nothing.

Let’s see what happens with some other files:

1
2
3
4
$ isutf8 *
Messenger Bags.numbers: line 1, char 1, byte offset 12: invalid UTF-8 code
Student Paper.doc: line 1, char 1, byte offset 1: invalid UTF-8 code
Tix.jpg: line 1, char 1, byte offset 1: invalid UTF-8 code

Yep. Those were definitely not UTF-8 encoded.

I don’t think I’ll be using isutf8 constantly, but it’s sure a handy little tool to have around.4

  • chronic: runs a command quietly unless it fails
  • combine: combine the lines in two files using boolean operations
  • ifdata: get network interface info without parsing ifconfig output
  • ifne: run a program if the standard input is not empty
  • isutf8: check if a file or standard input is utf-8
  • lckdo: execute a program with a lock held
  • mispipe: pipe two commands, returning the exit status of the first
  • parallel: run multiple jobs at once
  • pee: tee standard input to pipes
  • sponge: soak up standard input and write to a file
  • ts: timestamp standard input
  • vidir: edit a directory in your text editor
  • vipe: insert a text editor into a pipe
  • zrun: automatically uncompress arguments to command
  1. If you don’t know what UTF-8 is, read the Wikipedia article. Here’s the upshot: you want all your text editors & operating systems & web browsers to support & use UTF-8 by default. It makes life a lot easier.

  2. By putting set enc=utf-8 in my .vimrc file, of course.

  3. What? You’re not using Homebrew? Head over to https://github.com/mxcl/homebrew & get that sucker installed! It’s far better than fink or MacPorts. More on Homebrew some other time.

  4. Eagle-eyed readers might have noticed a list of software packages that were installed along with isutf8 when I gave the Homebrew listing. Looking over the list at the moreutils site, I think I’m going to have a lot to play with & write about over the coming months:

Comments