I think I’ve finally got UTF-8 support across the board with my HTML Editor, the custom software that I use to maintain this website. This is an improvement on my lame-ass "paletted text" approach [see previous entry] which worked fine except for when I tried to run clean-up functions on the code and everything unusual got stripped out. I noticed my RSS feed was being exported with all non-standard ASCII characters removed, and decided it was about time I implemented at least minimal UTF-8 support.
UTF-8 and Unicode
UTF-8 is a method for encoding Unicode text [UTF-8 stands for Unicode Transformation Format, 8-bit encoding ], so to understand UTF-8, you first need to know what Unicode is.
Unicode is fast becoming the standard for encoding characters from all written languages. In a nutshell it is simply an agreement by all parties that any given character from any given language will always be represented by the same numeric value. From the Unicode web site:
Unicode provides a unique number for every character,
no matter what the platform,
no matter what the program,
no matter what the language.
For example, the Greek character α [alpha] is always represented by the numberic value 945 [hexadecimal: 3B1, Unicode notation: U+03B1].
The problem with Unicode is that there are a rather large number of characters which can be represented, so it’s hard to know how many bytes should be used to store a character digitally. Growing up with ASCII [American Standard Code for Information Interchange ] we programmers have become pretty comfortable with the convenience of using a single byte per character, but with only 256 possible combinations, this is woefully inadequate for the richness that is Unicode. Two bytes seemed sufficient a few years back — allowing over 65000 characters — but Unicode has already outgrown that limit. Three bytes would be plenty — with more than 16 million possible values — but we programmers hate numbers which aren’t powers of two! So the next viable option is four bytes per character; more than four billion possible values, more than we’ll ever need, and in general an enormous waste of space.
This is where UTF-8 comes in. The design philosophy of UTF-8 can be paraphrased: ASCII is compact and ubiquitous, so let’s try to keep it, but let’s also add extensions to support the rest of the scripts of the world.
This is achieved by introducing variable byte lengths for characters, with high bits used to signal "extended" characters. For standard 7-bit ASCII characters [U+0000 to U+007F] ASCII and UTF-8 are identical; eg: "cat" translates to the same three bytes in both ASCII and UTF-8.
Character Range (hex) |
Unicode (UCS-2/UTF-16) |
UTF-8 |
0-7F |
00000000 0xxxxxxx |
0xxxxxxx |
80-7FF |
00000xxx xxxxxxxx |
110xxxxx 10xxxxxx |
800-FFFF |
xxxxxxxx xxxxxxxx |
1110xxxx 10xxxxxx 10xxxxxx |
10000-1FFFFF |
- out of range - |
11110xxx 10xxxxxx 10xxxxxx
10xxxxxx |
200000-3FFFFFF |
- out of range - |
111110xx 10xxxxxx 10xxxxxx
10xxxxxx 10xxxxxx |
4000000-7FFFFFFF |
- out of range - |
1111110x 10xxxxxx 10xxxxxx
10xxxxxx 10xxxxxx 10xxxxxx |
Note that all bytes of multi-byte UTF-8 characters have the high-bit set to one, and only the first byte of a multi-byte character has both its highest bits set. This means there can never be confusion about where a character starts. So in UTF-8, the combined Greek and Latin sequence aβcδe is represented by the following seven bytes, and looking at the high bits you can pick out the extended characters without too much trouble:
01100001 11001110 10110010 01100011 11001110 10110100 01100101
Now the really clever bit about UTF-8 is that it is capable of passing unharmed through ASCII only systems [programs which don’t even recognize UTF-8], thanks to the fact that each character beyond U+007F looks like a valid sequence of extended ASCII when read as a byte-per-character. This is in stark contrast to other Unicode encodings such as UCS-2, which are full of zero bytes and therefore wreak havoc with ASCII processing systems. To an ASCII system, the UTF-8 representation of aβcδe parses as aβcδe . On the surface this may seem like a corruption, but the important thing to note is that no illegal ASCII characters appear in a UTF-8 bytestream, and so the same string can be read and written out again as raw ASCII and then decoded later as the original UTF-8. With the exception of 7-bit text systems [a legacy email standard, unfortunately, for which the hideous UTF-7 had to be invented] UTF-8 should be able to pass through ASCII systems unscathed.
Cons
Although UTF-8 is an incredibly useful and largely backwards compatible method of encoding the ever growing Unicode character set, there are a few things to watch out for:
-
Not all byte sequences can be interpreted as valid UTF-8. Some might think this a good thing — like built-in data verification — but I find it simply annoying because it means you have to do error checking as you read a UTF-8 string to make sure it conforms to the rules, and if it doesn’t then you have to decide on an appropriate response to this error. For example [C1, 34] is an invalid UTF-8 sequence because it has a lead byte which implies a two byte character, and yet the following byte does not have its high bit set. I for one don’t want to reject text files where such invalid codes appear, and yet there is no single approach to dealing with them.
-
Conversely, there are multiple ways of encoding the same value in UTF-8, such that a naive parser will not notice the difference. This causes security risks, because a character can "sneak through" disguised as a higher value. Especially dangerous here are NULLs, slashes, ampersands, percent signs and other symbols commonly given special treatment in software. The ampersand for instance should always be encoded with the single byte [26], but could easily be encoded as [C0,A6] or even [E0, 80, A6] as an attempt to slip by dodgy parsers [like mine for example]. Technically, such overlong sequences are illegal, but again the onus is on the software to check for them.
-
UTF-8 is slower to process, by virtue of requiring any processing at all. This is unlike ASCII, which can be read directly into memory byte for byte. To find the fifth character of an ASCII string one simply reads the fifth byte, whereas UTF-8 requires every character up to the fifth to be read in order to establish how many bytes they occupy. This can be a pain.
Points 1 and 2 are the biggest drawbacks for me. If UTF-8’s invalid and ambiguous byte sequences could be collapsed, I think it would be a brilliant encoding scheme. Sadly, point 2 could have been easily avoided with only a minor modification to the specification. Point 1 is trickier though.
Is it possible to devise a similar variable byte length encoding scheme where every conceivable byte sequence can be unambiguously interpreted as a valid character sequence, and every character sequence can be represented by one and only one byte sequence? I think probably yes, but it’s a bit late to worry about that now. [unfortunately it is impossible for such an encoding to also guarantee that f(A+B) will always be equivalent to f(A) + f(B), where A and B are arbitrary bytestreams and f is a string decoding function]
As it is, UTF-8 is the best we’ve got: It’s supported by almost everyone, it’s fairly easy to parse, and it replaces a hideously parochial code-page system, which benefit alone can hardly be overstated.