RE: Subject: Re: 32'nd bit & UTF-8

From: Lars Kristan (lars.kristan@hermes.si)
Date: Sat Jan 22 2005 - 13:17:19 CST

  • Next message: Michael Everson: "Re: I Heart Huckabees"

    Doug Ewell wrote:

    > This has been explained to you already. Continuing to ignore the
    > explanation is not improving your credibility.

    I have told you what I think of the explanation.

    >
    > The problem with encoding 128 placeholders is not in finding 128
    > available code points. It is that the entities to be encoded are not
    > "characters"; they are binary blobs that represent characters in some
    > other encoding. Unicode is not intended to be an indexing mechanism
    > into other encodings

    I know finding them is not a problem. What Unicode is intended for is just a
    statement. Reinterpret it. If you can't, UTC can. They should.

    The 128 codepoints I am proposing are not indexes. They are 128 distinct
    variants of U+FFFD.

    > ; see
    > http://www.unicode.org/Public/TEXT/ALLOC.TXT,
    > section II.A, paragraphs 1 and 2 for some historical insight on this.

    I can see the similarity. I also see that the proposal has been rejected for
    a number of reasons.

    --- quote ---
    The expansion of the O-Zone to 94 rows is impossible
    in any case.
    --- end quote ---

    This one is particulary interesting:

    --- quote ---
    vulnerability to
    irretrievable loss of character semantics makes it unsafe for
    general-purpose
    blind interchange of text data
    --- end quote ---

    So, rejecting data or replacing with a single U+FFFD is safer than using 128
    codepoints? I fail to see how.

    > In addition, encoding 128 placeholders for unconverted characters from
    > another encoding would be tantamount to creating stateful duplicate
    > encodings for these characters. If the unconverted encoding is ISO
    > 8859-1, then the character Ä could be represented either as
    > U+00C4 or as
    > U+xx54 (assuming your block of 128 blobs begins at U+xx00). The
    > equivalence between these two code points could not be captured within
    > the concept of canonical equivalence (unlike <00C4> vs. <0041 0308>)
    > because the unconverted encoding would not be indicated
    > anywhere in the
    > text stream.

    U+xx54 in this example has nothing to do with U+00C4. The swap zone assumes
    there would be an attempt to bind the data with an encoding. My approach
    doesn't attempt that in the first place. It simply introduces 128 U+FFFD's.
    If anyone chooses to abuse them in a swap-zone way, they of course can. But
    this is just as much as anyone can abuse the U+00C4 for switching the
    keyboard into German layout. It does not constitute a reason to not encode
    it.

    > All Unix software that passes byte-sequence filenames to UTF-8
    > subsystems will be subject to this issue. Customers will not
    > single out
    > your particular software for rejection because it fails to provide a
    > non-standard workaround for it.

    I've solved all my UNIX problems. Using PUA. It is Windows that gives me
    problems now. Customers want Unicode output in console. Why doesn't Windows
    support UTF-8 locale? Not that I'm being pesky about it, UTF-16 would also
    be fine. As long as I can get Unicode through stdout. Well, and of course be
    able to feed it to some other application.

    > I will continue to oppose the encoding of 128 binary
    > placeholders

    Feel free to do so.

    Lars



    This archive was generated by hypermail 2.1.5 : Sat Jan 22 2005 - 17:08:44 CST