Re: 8-bit text which is supposed to be UTF-8 but isn't

From: Dan (Dan.Oscarsson@trab.se)
Date: Sun Jan 30 2000 - 06:42:43 EST


> > (Basically, we say: "just pass on, and what
> > happens at presentation is undefined. If the agent settles on to
> > present it as, say, Latin-1, that is fine with me.)
>
> Reasonable for Usenet news, where most content is read by human
> beings.

I agree.

> Here's a more precise version:
>
> UTF-8-xtra-head-2 = %d192-223
> UTF-8-xtra-head-3 = %d224-239
> UTF-8-xtra-head-4 = %d240-247
> UTF8-xtra-tail = %d128-191
> UTF8-xtra-char = UTF8-xtra-head-2 UTF8-xtra-tail
> | UTF8-xtra-head-3 2*UTF8-xtra-tail
> | UTF8-xtra-head-4 3*UTF8-xtra-tail
>
> Bytes %d247-253 are technically legal but will never be needed,
> as Unicode/ISO 10646 will never grow beyond hex 0010FFFF except for
> deprecated additional private-use zones that predate Unicode,
> and bytes %254-255 are outright illegal.

ISO 10646 is 31 bits. All possible values should be allowed.
I do not know why Unicode have decided to grow their bits to
more than 16 bits, but not to all 31 bits of ISO 10646.
But that is no reason to not allow full 31 bits in UTF-8 encoded
text.

Specify UCS (ISO 10646) encoded in UTF-8 without character range limits.
Do not restrict to current limits of Unicode.

You should also specify that Unicode technical report #15 normalisation
form C should be used. This will simplify much encoding/decoding
and help searching and case insensitivity comparisons.

And best would be if this was valid everywhere, both in the protocol
headers and the body text. The current MIME-encodings in headers
are terrible.

 
>
> > Generally, we have been careful to get into too much detail who
> > UTF-8 will work when it comes to case equivalence and such, as there
> > is no one on the list with deeper knowledge in the area. We're just
> > assuming that there will be good libraries that programmers can use.
>
> IMHO (and other i18n type will probably agree), case folding is a bad
> idea in general. For backward compatibility, only case-fold the
> ASCII characters, and leave the others alone.

No, case insensitivity should be available on all letters. It is
very important for many people. For a protocol
to work well it should be implemented using a well defined way like
section 2.3 in Unicode technical report #21.

As there is a group working on getting international characters into
DNS, you may wait a little and see the results from them. It may
affect the Usenet News protocol.

   Dan



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:58 EDT