Re: FW: Algorithm

From: Mark Davis (
Date: Mon Mar 29 1999 - 12:20:08 EST

Heuristics for identifying between ASCII-family encodings (ASCII, 8859
series, etc) and Unicode (UTF-8, UTF-16BE, UTF-16LE) are pretty easy.
They work well if you have a reasonable amount of data to analyse (a few
hundred bytes). [If you try to distinguish among all character sets
(Unicode, ASCII-family, EUC-family, EBCDIC-family, ISO 2022), it gets
quite complicated.]

Off the top of my head, here are some things to check for (others are
welcome to add to this):

Any time you hit a byte with the high bit on, verify that byte and the
following bytes are in UTF-8 format (see page A-7 of TUS 2.0). If they
aren't, you are definitely not in UTF-8. If you hit a few such cases,
and they all correspond to the UTF-8 definition, you are probably in

One test you can use is whether the bytes, when taken in pairs,
correspond to assigned Unicode characters. If you are checking for a
particular version of Unicode (e.g. 2.1), this works very well--for
example, 4142 is not a valid Unicode character, but works fine as ASCII
"AB". However, this is fragile since if you are sent text in a future
versions of Unicode, your test will fail.

However, there are checks you can use for the likelyhood of text being

- If you get a 00 byte (or other unusual control-character bytes) then
you are probably UTF16. SPACE (0020), TAB (0008), CR (000D) and LF
(000A) and common punctuation will often cause this to happen, even in
non-Latin texts.

- If you get lots of cases where every other byte is identical, you are
probably in UTF-16.

- When you hit the above cases, you can use the polarity of the byte
index (even or odd) to distinguish between UTF-16BE and UTF-16LE.


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT