RE: Where is the First> Last> convention documented?

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Sep 12 2007 - 15:55:48 CDT

  • Next message: Philippe Verdy: "RE: Where is the First> Last> convention documented?"

    Philippe Verdy wrote:

    > Kenneth Whistler wrote:
    > > Note, however, as regards names in particular, that some
    > > Unicode characters (e.g., noncharacters, private-use characters) don't
    > > have character names, ...)
    >
    > I won't discuss the case of CJK and Hangul ranges, because they do have
    > complete properties including standard names.
    > But I still don't understand why the assigned controls and PUAs don't have
    > at least one default character name, at least computed algorithmically (like
    > Hangul and CJK ideographs).
    >
    > For the stability of applications using these characters, it seems that
    > these controls and PUAs should still have a standard name (may be this name
    > is "U+xxx"...)

    That is the "short identifier" (ISO/IEC 10646, Clause 6.5), not
    the "standard name".

    And short identifiers don't follow the name syntax restrictions,
    because they allow one character, "+", that is not allowed in character
    names.

    > to avoiud any possible future conflicts with other characters
    > that will get their own standard names,

    How can there be a future conflict between a character that
    has no name (noncharacter, private-use character) and
    a character that gets a name in the future?

    > if the application needs to define a
    > name property for these characters instead of retuning a non unique empty
    > name or raising an exception (as if the characters were unassigned).

    Bad programming assumptions lead to bad program behavior. The
    fix for this is the test:

    if (name==NULL)
    {
      // do something interesting, instead of terminating with access fault
    }

    > The most obvious missing names that we frequently encounter in texts encoded
    > with valid UTF are with controls.

    And that is a problem because... ?

    > Why Unicode still does not endorse the existing ISO 646 and ISO 8859 names
    > for these C0 and C1 controls?

    Have you read ISO 646 or ISO 8859-1 (or any other part) recently?

    They do not contain any character names for C0 or C1 controls. They
    define characters (with names) for the G0 and G1 sets, 0x20..0x7E
    and 0xA0..0xFF (in the case of ISO 646 just G0). ISO 8859-1 depends
    on ISO 2022 and ISO 4873 (normatively) for its use of control
    cods, and the control functions are defined elsewhere by other
    standards.

    In short, there is no such thing as "ISO 8859 names for ... C0 and
    C1 controls."

    > Why would it be a problem to assign such name

    Well, one problem might be that they don't exist.

    But I'll cut you some slack. Presumably you have in mind
    ISO 6429:1992 names. But even ISO 6429 doesn't have names
    for *all* C1 controls. And ISO 6429 simply specifies one
    widely-used definition of C0 and C1 controls -- it isn't
    their exclusive definition.

    > (a name is just a name, not a description of its semantic or intended use in
    > applications).

    In which case, why go down the road of specifying a name,
    when not all applications in fact use the same control
    function definitions for C0 and C1 controls? Where does
    that lead except into trouble and confusion?

    > So:
    > * instead of having just "<control>" for U+001B, why not having "<control>
    > ESC" for the ASCII escape character

    "<control>" is a metalabel used in the generation of
    code charts, just like "<reserved>" and "<not a character>"
    are. None of those are character names; they violate
    both the uniqueness requirement for character names and
    the syntax for character names -- intentionally.

    > * instead of having just "<private use>" for U+E000, why not having
    > "<private use> E000" computed algorithmically for the standard name?

    1. Because it isn't necessary.

    2. Because it violates character name syntax.

    > As an alternative, you could say that some applications could generate the
    > comment field or use it algorithmically, so that the strict compatibility
    > will be preserved for the existing name field. This would give the extended
    > names (respectively for the examples above):
    > * "<control> #ESC"
    > * "<private use> #E000"

    And "#" isn't allowed in character names, either.

    > I don't see which other standard it will break.

    Well, the Unicode Standard and ISO/IEC 10646 for starters.
    See the character name specifications for both.

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Sep 12 2007 - 16:00:42 CDT