Re: 8-bit text which is supposed to be UTF-8 but isn't

From: Addison Phillips [GSC] (addison@globalsight.com)
Date: Sun Jan 30 2000 - 22:25:43 EST


> Where is this group?

One such group is iDNS (http://www.idns.org ?)

They have that odd UTF-5 proposal to be compatible with existing software.

Supposedly IETF has a working group, but I know nothing about it. I do know
that there was a recent expansion of legal domain names to 63 bytes (which
is almost 3x the old limit... or about what one would expect to accommodate
the BMP in UTF-8........)

Addison

Addison P. Phillips
Senior Globalization Consultant
Global Sight Corporation
mailto:addison@globalsight.com
================================
101 Metro Drive, Suite 750
San Jose, California 95110
(+1) 408.350.3600 - Telephone
(+1) 408.350.3601 - Fax
http://www.globalsight.com
================================

Red Herring names Global Sight among the 1999 "Ten to Watch" in its annual
roundup of the top 100 companies of the electronic economy. Read more at:
http://www.redherring.com/mag/issue67/news-feature-du99-global.html

Going global with your web site? Global Sight provides Web-based software
solutions that simplify the process, cut costs, and save time.

----- Original Message -----
From: John Cowan <cowan@locke.ccil.org>
To: Unicode List <unicode@unicode.org>
Sent: Sunday, January 30, 2000 6:53 PM
Subject: Re: 8-bit text which is supposed to be UTF-8 but isn't

> Dan scripsit:
>
> > ISO 10646 is 31 bits. All possible values should be allowed.
> > I do not know why Unicode have decided to grow their bits to
> > more than 16 bits, but not to all 31 bits of ISO 10646.
>
> JTC1/SC2/WG2 have declared that they will not go past 0010FFFF,
> except for the (de facto deprecated) private-use areas
> at 00E00000-00FFFFFF and at 60000000-7FFFFFFF.
>
> > But that is no reason to not allow full 31 bits in UTF-8 encoded
> > text.
>
> It is, indeed, the reason.
>
> > You should also specify that Unicode technical report #15 normalisation
> > form C should be used. This will simplify much encoding/decoding
> > and help searching and case insensitivity comparisons.
>
> I would even go further, to require Form KC (no compatibility characters)
> as well, at least in headers if not in body text.
>
> > And best would be if this was valid everywhere, both in the protocol
> > headers and the body text. The current MIME-encodings in headers
> > are terrible.
>
> Agreed. I believe the current draft drops or deprecates those.
> (This is news, not mail, remember.)
>
> > No, case insensitivity should be available on all letters. It is
> > very important for many people. For a protocol
> > to work well it should be implemented using a well defined way like
> > section 2.3 in Unicode technical report #21.
>
> But why do case folding at all? Simply forbid the use of uppercase
> characters.
>
> > As there is a group working on getting international characters into
> > DNS, you may wait a little and see the results from them. It may
> > affect the Usenet News protocol.
>
> Where is this group?
>
> --
> John Cowan cowan@ccil.org
> I am a member of a civilization. --David Brin



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:58 EDT