Re: Conformance (Was: 32'nd bit & UTF-8)

From: Hans Aberg (haberg@math.su.se)
Date: Fri Jan 21 2005 - 12:47:21 CST

  • Next message: Richard T. Gillam: "RE: Byte-oriented lexer generator for Unicode"

    [Off the list.]

    On 2005/01/21 17:49, Richard T. Gillam at rgillam@las-inc.com wrote:

    >> So deprecating it seems to be a distinct possibility.
    >
    > I really wish you'd quit saying this. This simply isn't true. Or, at
    > the very least, is EXTREMELY unlikely and very far into the future. As
    > several other people have already pointed out to you, the Unicode
    > codespace contains room for 1.1 million characters.

    This in fact not the problem, but what would "deprecate" mean in the case of
    a character standard? ASCII and the ISO-Latin etc encodings will never
    become deprecated, even though they eventually may become obsolete. For the
    word "deprecate" to make sense, there must be a notion of "Unicode
    conformance". If say a protocol would require that UTF-8/16/32 all must be
    supported, then UTF-16 could be made deprecated in that protocol.

    As for the issue of filling up the code points, just wait and see. If they
    should be exhausted quickly, that perhaps requires machine generated
    encodings. Unicode perhaps need not support such specialty use

    > Again, many people have addressed this point and you're ignoring them.
    > UTF-8 HAS NO BOM. There is nothing in the Unicode standard mandating or
    > even encouraging the use of EF BB BF at the beginning of a UTF-8 file.
    > That sequence has no special meaning in UTF-8; it's just a zero-width
    > non-breaking space. FE FF at the top of a UTF-8 file is just flat
    > illegal.

    We know that. See the other post in this new thread. The formulation in the
    Unicode standard is vacuous and confusing, prone resulting in ambiguous
    interpretations, and needs to be changed. There is no need for mentioning
    the BOM at all, except as a curiosity note that programs and some protocols
    may decide to give it special treatment. In that respect it is not different
    from other character sequence markers for shell scripts, PS, etc. Unicode is
    just a character encoding, and just provides the character for use.

      Hans Aberg



    This archive was generated by hypermail 2.1.5 : Fri Jan 21 2005 - 13:05:23 CST