RE: How to distinguish UTF-8 from Latin-* ?

From: Robert A. Rosenberg (bob.rosenberg@digitscorp.com)
Date: Thu Jun 22 2000 - 12:37:35 EDT


At 12:12 PM 06/20/2000 -0800, Kenneth Whistler wrote:
>Bob Rosenberg wrote:
>
> > >
> > >This was my concern, there is no way to distinguish UTF-8 from Latin-1 in
> > >case of upper ASCII characters here.
> >
> > Yes there is - its called a "Sanity Check". You parse the file looking for
> > High-ASCII. If you find none - you are US-ASCII (or ISO-8859-1). Once you
> > find one, you use the UTF-8 Suffix method to see how long the string
> should
> > be IF it is UTF-8. Look at the next x characters to see if they have the
> > correct suffix. If not, count as a Bad-UTF-8. If so, count as one
> > Good-UTF-8. Once you roll off the end of the string resume scanning for
> > another High-ASCII and do the check again. After finding 12 strings that
> > start with High-ASCII (or bopping off the end of the file) check your
> > GOOD/BAD counts. All BAD means ISO-8859-1. All GOOD means UTF-8.
>
>Well, not necessarily. Granted, the distribution of precedent bytes and
>successor bytes in UTF-8, when interpreted as ISO 8859-1, mostly results
>in gibberish that is unlikely to appear in real text. The first byte of
>a two-byte UTF-8 sequence consists essentially of an accented capital
>letter in 8859-1 (0xC0..0xDF). And the successor bytes are either C1
>controls or come from the set of miscellaneous symbols, currency signs,
>punctuation, etc., that are rather unlikely to occur directly following
>an uppercase accented Latin letter.
>
>But if I invented a hoity-toity company name with extra accents for
>"class", such as, L·DÏ·DÀ® Productions, Inc. and sent this to you in
>ISO 8859-1, as I am currently doing, your sanity check will fail in
>this case and identify this file as UTF-8, with 3 characters misinterpreted.
>(i.e., L<bullet>D<Greek letter eta>D. Productions, Inc.) Of course, a
>further check
>for irregular sequence UTF-8 would discover that 0xC0 0xAE ==> U+002E is
>not shortest form UTF-8, and might, therefore, not actually be UTF-8,
>but even that cannot really be relied on.

True you can FAKE an incorrect evaluation by plugging a trick string into
an otherwise low ASCII file/message. My comment was aimed at normal (not a
faked) files. I agree that missed the extra sanity check of looked for
shortest string but if I remember the rules correctly, there is no
requirement the shortest form be emitted - only a strong suggestion to do
so (with a stronger suggestion to accept it [ie: "Be liberal with what you
accept and conservative with what you create"]). I doubt that a real
ISO-8859-1 file could be mistaken for a UTF-8 one without it being
specially constructed to trick the sanity check. Note that the 12 string
"universe" is just an attempt to check for false positives and could be
adjusted for circumstances.

> > Mixed
> > (with most being BAD) is ISO-8859-1 (the Goods are "noise"). Mostly Good
> > with a few Bad are either malformed UTF-8 or ISO-8859-1 (with the bad luck
> > of finding 2 byte strings that LOOK LIKE UTF-8).
>
>Even entirely GOOD can have that bad luck, as this email itself
>demonstrates.

Since this is a special message that was designed to spoof not a real
message, I do not regard it as bad luck. If you can supply a set of normal
text that would give a false reading, I'd be much more willing to say that
my claim of just doing a sanity check was overly simplistic.

>--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT