Re: Subject: Re: 32'nd bit & UTF-8

From: Doug Ewell (dewell@adelphia.net)
Date: Sat Jan 22 2005 - 12:29:31 CST

Next message: Jon Hanna: "RE: BOM in HTML"

Previous message: Jon Hanna: "RE: Subject: Re: 32'nd bit & UTF-8"
In reply to: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"
Next in thread: Doug Ewell: "Re: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

RE: Subject: Re: 32'nd bit & UTF-8
Lars Kristan wrote:

> Richard Gillam wrote:
>> The current committee is EXTREMELY vigilant
>> and won't let these things happen.
>
> I thing this vigilance (though typically warranted) has gotten too
> far. So far in fact that even though it would be possible to prevent
> losing data when encountering invalid sequences in UTF-8, it still
> looks like the cost of 128 codepoints is too much. Yes, the million
> codepoints will suffice for a very long time.

This has been explained to you already. Continuing to ignore the
explanation is not improving your credibility.

The problem with encoding 128 placeholders is not in finding 128
available code points. It is that the entities to be encoded are not
"characters"; they are binary blobs that represent characters in some
other encoding. Unicode is not intended to be an indexing mechanism
into other encodings; see http://www.unicode.org/Public/TEXT/ALLOC.TXT,
section II.A, paragraphs 1 and 2 for some historical insight on this.

In addition, encoding 128 placeholders for unconverted characters from
another encoding would be tantamount to creating stateful duplicate
encodings for these characters. If the unconverted encoding is ISO
8859-1, then the character Ä could be represented either as U+00C4 or as
U+xx54 (assuming your block of 128 blobs begins at U+xx00). The
equivalence between these two code points could not be captured within
the concept of canonical equivalence (unlike <00C4> vs. <0041 0308>)
because the unconverted encoding would not be indicated anywhere in the
text stream.

All Unix software that passes byte-sequence filenames to UTF-8
subsystems will be subject to this issue. Customers will not single out
your particular software for rejection because it fails to provide a
non-standard workaround for it.

I will continue to oppose the encoding of 128 binary placeholders, even
as I support the allocation of thousands of code points for "real"
characters.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

Next message: Jon Hanna: "RE: BOM in HTML"
Previous message: Jon Hanna: "RE: Subject: Re: 32'nd bit & UTF-8"
In reply to: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"
Next in thread: Doug Ewell: "Re: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Jan 22 2005 - 12:32:47 CST