RE: How to distinguish UTF-8 from Latin-* ?

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Jun 20 2000 - 16:19:32 EDT


Bob Rosenberg wrote:

> >
> >This was my concern, there is no way to distinguish UTF-8 from Latin-1 in
> >case of upper ASCII characters here.
>
> Yes there is - its called a "Sanity Check". You parse the file looking for
> High-ASCII. If you find none - you are US-ASCII (or ISO-8859-1). Once you
> find one, you use the UTF-8 Suffix method to see how long the string should
> be IF it is UTF-8. Look at the next x characters to see if they have the
> correct suffix. If not, count as a Bad-UTF-8. If so, count as one
> Good-UTF-8. Once you roll off the end of the string resume scanning for
> another High-ASCII and do the check again. After finding 12 strings that
> start with High-ASCII (or bopping off the end of the file) check your
> GOOD/BAD counts. All BAD means ISO-8859-1. All GOOD means UTF-8.

Well, not necessarily. Granted, the distribution of precedent bytes and
successor bytes in UTF-8, when interpreted as ISO 8859-1, mostly results
in gibberish that is unlikely to appear in real text. The first byte of
a two-byte UTF-8 sequence consists essentially of an accented capital
letter in 8859-1 (0xC0..0xDF). And the successor bytes are either C1
controls or come from the set of miscellaneous symbols, currency signs,
punctuation, etc., that are rather unlikely to occur directly following
an uppercase accented Latin letter.

But if I invented a hoity-toity company name with extra accents for
"class", such as, L·DÏ·DÀ® Productions, Inc. and sent this to you in
ISO 8859-1, as I am currently doing, your sanity check will fail in
this case and identify this file as UTF-8, with 3 characters misinterpreted.
(i.e., L<bullet>D<Greek letter eta>D. Productions, Inc.) Of course, a further check
for irregular sequence UTF-8 would discover that 0xC0 0xAE ==> U+002E is
not shortest form UTF-8, and might, therefore, not actually be UTF-8,
but even that cannot really be relied on.

> Mixed
> (with most being BAD) is ISO-8859-1 (the Goods are "noise"). Mostly Good
> with a few Bad are either malformed UTF-8 or ISO-8859-1 (with the bad luck
> of finding 2 byte strings that LOOK LIKE UTF-8).

Even entirely GOOD can have that bad luck, as this email itself
demonstrates.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT