Re: FW: Algorithm

From: Mark Davis (marked@best.com)
Date: Mon Mar 29 1999 - 12:20:08 EST


Heuristics for identifying between ASCII-family encodings (ASCII, 8859
series, etc) and Unicode (UTF-8, UTF-16BE, UTF-16LE) are pretty easy.
They work well if you have a reasonable amount of data to analyse (a few
hundred bytes). [If you try to distinguish among all character sets
(Unicode, ASCII-family, EUC-family, EBCDIC-family, ISO 2022), it gets
quite complicated.]

Off the top of my head, here are some things to check for (others are
welcome to add to this):

UTF-8:
Any time you hit a byte with the high bit on, verify that byte and the
following bytes are in UTF-8 format (see page A-7 of TUS 2.0). If they
aren't, you are definitely not in UTF-8. If you hit a few such cases,
and they all correspond to the UTF-8 definition, you are probably in
UTF-8.

UTF-16BE/LE
One test you can use is whether the bytes, when taken in pairs,
correspond to assigned Unicode characters. If you are checking for a
particular version of Unicode (e.g. 2.1), this works very well--for
example, 4142 is not a valid Unicode character, but works fine as ASCII
"AB". However, this is fragile since if you are sent text in a future
versions of Unicode, your test will fail.

However, there are checks you can use for the likelyhood of text being
UTF-16BE/LE:

- If you get a 00 byte (or other unusual control-character bytes) then
you are probably UTF16. SPACE (0020), TAB (0008), CR (000D) and LF
(000A) and common punctuation will often cause this to happen, even in
non-Latin texts.

- If you get lots of cases where every other byte is identical, you are
probably in UTF-16.

- When you hit the above cases, you can use the polarity of the byte
index (even or odd) to distinguish between UTF-16BE and UTF-16LE.

--
business: medavis2@us.ibm.com, mark@unicode.org
personal: mark@macchiato.com, http://www.macchiato.com
--



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT