Re: Subject: Re: 32'nd bit & UTF-8

From: Doug Ewell (
Date: Sat Jan 22 2005 - 12:29:31 CST

  • Next message: Jon Hanna: "RE: BOM in HTML"

    RE: Subject: Re: 32'nd bit & UTF-8
    Lars Kristan wrote:

    > Richard Gillam wrote:
    >> The current committee is EXTREMELY vigilant
    >> and won't let these things happen.
    > I thing this vigilance (though typically warranted) has gotten too
    > far. So far in fact that even though it would be possible to prevent
    > losing data when encountering invalid sequences in UTF-8, it still
    > looks like the cost of 128 codepoints is too much. Yes, the million
    > codepoints will suffice for a very long time.

    This has been explained to you already. Continuing to ignore the
    explanation is not improving your credibility.

    The problem with encoding 128 placeholders is not in finding 128
    available code points. It is that the entities to be encoded are not
    "characters"; they are binary blobs that represent characters in some
    other encoding. Unicode is not intended to be an indexing mechanism
    into other encodings; see,
    section II.A, paragraphs 1 and 2 for some historical insight on this.

    In addition, encoding 128 placeholders for unconverted characters from
    another encoding would be tantamount to creating stateful duplicate
    encodings for these characters. If the unconverted encoding is ISO
    8859-1, then the character Ä could be represented either as U+00C4 or as
    U+xx54 (assuming your block of 128 blobs begins at U+xx00). The
    equivalence between these two code points could not be captured within
    the concept of canonical equivalence (unlike <00C4> vs. <0041 0308>)
    because the unconverted encoding would not be indicated anywhere in the
    text stream.

    All Unix software that passes byte-sequence filenames to UTF-8
    subsystems will be subject to this issue. Customers will not single out
    your particular software for rejection because it fails to provide a
    non-standard workaround for it.

    I will continue to oppose the encoding of 128 binary placeholders, even
    as I support the allocation of thousands of code points for "real"

    -Doug Ewell
     Fullerton, California

    This archive was generated by hypermail 2.1.5 : Sat Jan 22 2005 - 12:32:47 CST