Mixing UTF-8 and ISO 8859-1 (was: Normalization Form KC)

From: Doug Ewell (dewell@compuserve.com)
Date: Tue Aug 31 1999 - 01:54:06 EDT


>> Before all tools are fixed, all must normally write data in
>> ISO 8859-1 format. ISO 8859-1 with embedded UTF-8 would also
>> be ok. (Note: some of Markus Kuhn's objections to having a
>> base256 version of UTF-8 because it lacks essential properties
>
> No. I'm sorry, you will have to stop demanding stuff which is patently
> ridiculous.
>
> Tools that deal with ISO-8859-1 mixed with UTF-8 cannot be written.
> You can't even reliably autodetect between ISO-8859-1 and UTF-8.

The problem is not that it is impossible to write such a tool (it isn't)
but that it won't work 100 percent of the time. It is commonly pointed
out that a byte in the range [0xC0, 0xDF] followed by a byte in the
range [0x80, 0xBF] is unlikely to occur in Latin-1 text, but it isn't
hard to create a contrived example:

        TURBO CAFÉ®

(my new hypothetical, Java-based, registered-trademarked product)
contains the Latin-1 sequence 0xC9 0xAE, which is perfectly legal UTF-8
for U+026E, LATIN SMALL LETTER LEZH. Your mixed UTF-8/8859-1 text tool
will dutifully convert my E-acute and circled-R into what, a voiced
lateral fricative? You do not want text tools that work less than 100
percent of the time.

-Doug



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT