Re: How to distinguish UTF-8 from Latin-* ?

From: Daniel Biddle (deltab@osian.net)
Date: Wed Jun 21 2000 - 02:27:17 EDT


On Tue, 2000-06-20, Doug Ewell wrote:

> Kenneth Whistler <kenw@sybase.com> wrote:
>
> > But if I invented a hoity-toity company name with extra accents for
> > "class", such as, L┬ĚD¤ĚD└« Productions, Inc. and sent this to you in
> > ISO 8859-1, as I am currently doing, your sanity check will fail in
> > this case and identify this file as UTF-8, with 3 characters
> > misinterpreted.
>
> Still, you have to admit this is an extremely contrived case.

A much less contrived case, suggested by someone on this list a while ago:
it's easy to find Web pages containing "NESCAF╔" or "NESTL╔" in upper-case
letters, so there's a good chance that plain text files exist right now
containing "NESCAFɮ" or "NESTLɮ"; both of these strings could be
misinterpreted as containing a valid UTF-8 byte sequence.

-- 
Daniel Biddle <deltab@osian.net>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT