RE: Subject: Re: 32'nd bit & UTF-8

From: Lars Kristan (lars.kristan@hermes.si)
Date: Sat Jan 22 2005 - 13:17:19 CST

Next message: Michael Everson: "Re: I Heart Huckabees"

Previous message: Hans Aberg: "Re: Byte-oriented lexer generator for Unicode"
Maybe in reply to: Arcane Jill: "Subject: Re: 32'nd bit & UTF-8"
Next in thread: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Doug Ewell wrote:

> This has been explained to you already. Continuing to ignore the
> explanation is not improving your credibility.

I have told you what I think of the explanation.

>
> The problem with encoding 128 placeholders is not in finding 128
> available code points. It is that the entities to be encoded are not
> "characters"; they are binary blobs that represent characters in some
> other encoding. Unicode is not intended to be an indexing mechanism
> into other encodings

I know finding them is not a problem. What Unicode is intended for is just a
statement. Reinterpret it. If you can't, UTC can. They should.

The 128 codepoints I am proposing are not indexes. They are 128 distinct
variants of U+FFFD.

> ; see
> http://www.unicode.org/Public/TEXT/ALLOC.TXT,
> section II.A, paragraphs 1 and 2 for some historical insight on this.

I can see the similarity. I also see that the proposal has been rejected for
a number of reasons.

--- quote ---
The expansion of the O-Zone to 94 rows is impossible
in any case.
--- end quote ---

This one is particulary interesting:

--- quote ---
vulnerability to
irretrievable loss of character semantics makes it unsafe for
general-purpose
blind interchange of text data
--- end quote ---

So, rejecting data or replacing with a single U+FFFD is safer than using 128
codepoints? I fail to see how.

> In addition, encoding 128 placeholders for unconverted characters from
> another encoding would be tantamount to creating stateful duplicate
> encodings for these characters. If the unconverted encoding is ISO
> 8859-1, then the character Ä could be represented either as
> U+00C4 or as
> U+xx54 (assuming your block of 128 blobs begins at U+xx00). The
> equivalence between these two code points could not be captured within
> the concept of canonical equivalence (unlike <00C4> vs. <0041 0308>)
> because the unconverted encoding would not be indicated
> anywhere in the
> text stream.

U+xx54 in this example has nothing to do with U+00C4. The swap zone assumes
there would be an attempt to bind the data with an encoding. My approach
doesn't attempt that in the first place. It simply introduces 128 U+FFFD's.
If anyone chooses to abuse them in a swap-zone way, they of course can. But
this is just as much as anyone can abuse the U+00C4 for switching the
keyboard into German layout. It does not constitute a reason to not encode
it.

> All Unix software that passes byte-sequence filenames to UTF-8
> subsystems will be subject to this issue. Customers will not
> single out
> your particular software for rejection because it fails to provide a
> non-standard workaround for it.

I've solved all my UNIX problems. Using PUA. It is Windows that gives me
problems now. Customers want Unicode output in console. Why doesn't Windows
support UTF-8 locale? Not that I'm being pesky about it, UTF-16 would also
be fine. As long as I can get Unicode through stdout. Well, and of course be
able to feed it to some other application.

> I will continue to oppose the encoding of 128 binary
> placeholders

Feel free to do so.

Lars

Next message: Michael Everson: "Re: I Heart Huckabees"
Previous message: Hans Aberg: "Re: Byte-oriented lexer generator for Unicode"
Maybe in reply to: Arcane Jill: "Subject: Re: 32'nd bit & UTF-8"
Next in thread: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Jan 22 2005 - 17:08:44 CST