Re: FW: Algorithm

From: Mark Davis (marked@best.com)
Date: Mon Mar 29 1999 - 12:20:08 EST

Next message: Annie Morin: "Re: [long] Use of Unicode in AbiWord"
Previous message: Michael Everson: "Re: (TC304.2141) Name of the Euro in European languages"
Next in thread: Paul Keinanen: "Re: FW: Algorithm"
Maybe reply: Paul Keinanen: "Re: FW: Algorithm"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Heuristics for identifying between ASCII-family encodings (ASCII, 8859
series, etc) and Unicode (UTF-8, UTF-16BE, UTF-16LE) are pretty easy.
They work well if you have a reasonable amount of data to analyse (a few
hundred bytes). [If you try to distinguish among all character sets
(Unicode, ASCII-family, EUC-family, EBCDIC-family, ISO 2022), it gets
quite complicated.]

Off the top of my head, here are some things to check for (others are
welcome to add to this):

UTF-8:
Any time you hit a byte with the high bit on, verify that byte and the
following bytes are in UTF-8 format (see page A-7 of TUS 2.0). If they
aren't, you are definitely not in UTF-8. If you hit a few such cases,
and they all correspond to the UTF-8 definition, you are probably in
UTF-8.

UTF-16BE/LE
One test you can use is whether the bytes, when taken in pairs,
correspond to assigned Unicode characters. If you are checking for a
particular version of Unicode (e.g. 2.1), this works very well--for
example, 4142 is not a valid Unicode character, but works fine as ASCII
"AB". However, this is fragile since if you are sent text in a future
versions of Unicode, your test will fail.

However, there are checks you can use for the likelyhood of text being
UTF-16BE/LE:

- If you get a 00 byte (or other unusual control-character bytes) then
you are probably UTF16. SPACE (0020), TAB (0008), CR (000D) and LF
(000A) and common punctuation will often cause this to happen, even in
non-Latin texts.

- If you get lots of cases where every other byte is identical, you are
probably in UTF-16.

- When you hit the above cases, you can use the polarity of the byte
index (even or odd) to distinguish between UTF-16BE and UTF-16LE.

--
business: medavis2@us.ibm.com, mark@unicode.org
personal: mark@macchiato.com, http://www.macchiato.com
--

Next message: Annie Morin: "Re: [long] Use of Unicode in AbiWord"
Previous message: Michael Everson: "Re: (TC304.2141) Name of the Euro in European languages"
Next in thread: Paul Keinanen: "Re: FW: Algorithm"
Maybe reply: Paul Keinanen: "Re: FW: Algorithm"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT