Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Wed Jan 19 2005 - 12:38:22 CST

Next message: John H. Jenkins: "Re: Subject: Re: 32'nd bit & UTF-8"

Previous message: Philippe VERDY: "Re: Unicode lexers (was:32'nd bit & UTF-8)"
In reply to: Doug Ewell: "Re: 32'nd bit & UTF-8"
Next in thread: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 2005/01/19 08:37, Doug Ewell at dewell@adelphia.net wrote:

> Hans Aberg <haberg at math dot su dot se> wrote:
>
>>>>> The old RFC you're refering to is not designating UTF-8, but
>>>>> UTF-BSS, which is a transformation format,
>>>>
>>>> OK. Fine, so we have a name for it.
>>>
>>> I was not sure about the name of it when writing the message.
>>
>> According to <http://www.cl.cam.ac.uk/~mgk25/unicode.html>, UTF is
>> short for UCS Transformation Format, where UCS stands for Universal
>> Character Set. When speaking about the extensions that I speak about,
>> I think they should certainly have a separate name. Perhaps UTF-8X for
>> extended, or BTF-8 for "bit (byte) transformation format".
>
> RFC 2044, the original (1996) Internet definition of UTF-8, defined up
> to 6-byte sequences.
>
> While RFC 2044 has been superseded (RFC 2279, 1998) and re-superseded
> (RFC 3629, 2003), and the 5- and 6-byte sequences have been removed, the
> point is that they were originally defined in an encoding scheme called
> "UTF-8." It is not true that they were only defined under some other
> name, such as FSS-UTF (the name used in Unicode 1.1) or "UTF-BSS,"
> whatever that is.
>
> "BTF-8" is taken; see:
>
> http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML018/0830.html

Another name might be CPBTF-8 for "code point to binary transformation
format".

>> The Unicode standard is like Big Brother in George Orwell's "1984",
>> making it possible to only speak about what is right, but not what is
>> wrong.
>
> My goodness.

Chocking, isn't it? :-)

>> Besides, even though Unicode has declared to never use more than 21
>> bits, in the track record, Unicode has reneged on such promises. It
>> might be prudent to knock down a full 32-bit encoding, declaring
>> UTF-8/32 to be subsets of that.
>
> I suppose the "promise" that you are referring to, on which Unicode
> "reneged," was the original 16-bit design that was extended with the use
> of surrogate pairs.

Right.

> The difference between finding 65,000 things that need to be encoded and
> finding 1.1 million things that need to be encoded is the difference
> between night and day.

I can only refer to the development of the Bison parser generator
<http://gnu.org>. There, the number of tokens, states, etc where ofthen
limted to 2^15. But it turns out that people want, in vfiew of more powerful
computers to plug in larger and larger grammars. One can then plug in really
large machine genrated grammars. This way, one might plug in grammars with
millions of tokens, for example. So these lower limits are being now
changed.

So, as long as these Unicode encodings will only be used for human
enumeration characters, 1 million is perhaps well within the boundaries. But
if somebody comes up with some clever machine enunciation, then it might be
broken.

Hans Aberg

Next message: John H. Jenkins: "Re: Subject: Re: 32'nd bit & UTF-8"
Previous message: Philippe VERDY: "Re: Unicode lexers (was:32'nd bit & UTF-8)"
In reply to: Doug Ewell: "Re: 32'nd bit & UTF-8"
Next in thread: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Maybe reply: Philippe VERDY: "Re: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 12:59:56 CST