Re: 32'nd bit & UTF-8

From: Doug Ewell (dewell@adelphia.net)
Date: Wed Jan 19 2005 - 01:37:53 CST

Next message: Raymond Mercier: "Re: Coptic II"

Previous message: Doug Ewell: "Re: Coptic II"
In reply to: Hans Aberg: "Re: 32'nd bit & UTF-8"
Next in thread: Hans Aberg: "Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Reply: Hans Aberg: "Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hans Aberg <haberg at math dot su dot se> wrote:

>>>> The old RFC you're refering to is not designating UTF-8, but
>>>> UTF-BSS, which is a transformation format,
>>>
>>> OK. Fine, so we have a name for it.
>>
>> I was not sure about the name of it when writing the message.
>
> According to <http://www.cl.cam.ac.uk/~mgk25/unicode.html>, UTF is
> short for UCS Transformation Format, where UCS stands for Universal
> Character Set. When speaking about the extensions that I speak about,
> I think they should certainly have a separate name. Perhaps UTF-8X for
> extended, or BTF-8 for "bit (byte) transformation format".

RFC 2044, the original (1996) Internet definition of UTF-8, defined up
to 6-byte sequences.

While RFC 2044 has been superseded (RFC 2279, 1998) and re-superseded
(RFC 3629, 2003), and the 5- and 6-byte sequences have been removed, the
point is that they were originally defined in an encoding scheme called
"UTF-8." It is not true that they were only defined under some other
name, such as FSS-UTF (the name used in Unicode 1.1) or "UTF-BSS,"
whatever that is.

"BTF-8" is taken; see:

http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML018/0830.html

> The Unicode standard is like Big Brother in George Orwell's "1984",
> making it possible to only speak about what is right, but not what is
> wrong.

My goodness.

> Besides, even though Unicode has declared to never use more than 21
> bits, in the track record, Unicode has reneged on such promises. It
> might be prudent to knock down a full 32-bit encoding, declaring
> UTF-8/32 to be subsets of that.

I suppose the "promise" that you are referring to, on which Unicode
"reneged," was the original 16-bit design that was extended with the use
of surrogate pairs.

The difference between finding 65,000 things that need to be encoded and
finding 1.1 million things that need to be encoded is the difference
between night and day.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

Next message: Raymond Mercier: "Re: Coptic II"
Previous message: Doug Ewell: "Re: Coptic II"
In reply to: Hans Aberg: "Re: 32'nd bit & UTF-8"
Next in thread: Hans Aberg: "Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Reply: Hans Aberg: "Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 01:41:35 CST