Re: 8-bit text which is supposed to be UTF-8 but isn't

From: John Cowan (cowan@locke.ccil.org)
Date: Sun Jan 30 2000 - 18:52:49 EST


Dan scripsit:

> ISO 10646 is 31 bits. All possible values should be allowed.
> I do not know why Unicode have decided to grow their bits to
> more than 16 bits, but not to all 31 bits of ISO 10646.

JTC1/SC2/WG2 have declared that they will not go past 0010FFFF,
except for the (de facto deprecated) private-use areas
at 00E00000-00FFFFFF and at 60000000-7FFFFFFF.

> But that is no reason to not allow full 31 bits in UTF-8 encoded
> text.

It is, indeed, the reason.

> You should also specify that Unicode technical report #15 normalisation
> form C should be used. This will simplify much encoding/decoding
> and help searching and case insensitivity comparisons.

I would even go further, to require Form KC (no compatibility characters)
as well, at least in headers if not in body text.

> And best would be if this was valid everywhere, both in the protocol
> headers and the body text. The current MIME-encodings in headers
> are terrible.

Agreed. I believe the current draft drops or deprecates those.
(This is news, not mail, remember.)

> No, case insensitivity should be available on all letters. It is
> very important for many people. For a protocol
> to work well it should be implemented using a well defined way like
> section 2.3 in Unicode technical report #21.

But why do case folding at all? Simply forbid the use of uppercase
characters.

> As there is a group working on getting international characters into
> DNS, you may wait a little and see the results from them. It may
> affect the Usenet News protocol.

Where is this group?

-- 
John Cowan                                   cowan@ccil.org
       I am a member of a civilization. --David Brin



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:58 EDT