RE: Where is the First> Last> convention documented?

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Sep 13 2007 - 18:17:47 CDT

  • Next message: Stephane Bortzmeyer: "[kim.davies@icann.org: Tool to convert IDN into image file]"

    > I have not changed my tune nor even my intimate intuition if what Isaidwas
    > not clear and could be interpreted differently.

    WhatEVer.

    > The need for stable names for C0 andC1 controls remains, and when I speak
    > about stability, it's not within the Unicode standard itself (because such
    > names are still not present), but within applications or documents needing
    > names to reference them in a more clear way than just U+00xx (which is not
    > ambiguous but not clear enough, for readers, given that even Unicode needs
    > to define "aliases" to reference them in many places in its annexes.

    I get it that you think there should be a standard list of names for
    all the C0 and C1 control codes.

    What you seem to be missing still is that the purpose of the Unicode
    control character codes, U+0000..U+001F, U+007F..U+009F is for
    interoperable mapping to the ISO-2022 framework C0 and C1 control
    codes. And in that context, any particular control code does not
    have a fixed control function or name. Usage differs by application.

    For example, Marc-8 (http://www.loc.gov/marc/specifications/speccharmarc8.html)
    makes use of non ISO-6429 C1 control function assignments, namely:

    0x88 non-sorting character(s) begin
    0x89 non-sorting character(s) end
    0x8D joiner
    0x8E nonjoiner

    Now in what way would discussing mappings and interoperating with
    Marc-8 and Unicode be clarified by referring to, for instance,

    U+0088 CHARACTER TABULATION SET
    U+0089 CHARACTER TABULATION WITH JUSTIFICATION
    U+008D REVERSE LINE FEED
    U+008E SINGLE SHIFT TWO

    ?

    > So your attempt to say that the proposed names using "<>" or "# within names
    > were non conforming are not relevant. What application need are stable names
    > even if those names come from another character property which does not
    > respect the current rules for existing standard character names. After all,
    > Unicode references the "na1" property

    Exegesis for those not completely steeped in the arcana of the UCD...
    There are two "name" properties carried in UnicodeData.txt in
    the UCD:

    # ================================================
    # Miscellaneous Properties
    # ================================================
    ...
    na ; Name
    na1 ; Unicode_1_Name

    The "Name" property is the normative, immutable character name
    property I have been talking about. "Unicode_1_Name" is
    an informative property that is neither complete nor
    completely consistent, as it has been put to use in part
    just to produce ISO 6429 aliases for C0 and C1 control codes,
    for printing in the charts.

    > (see the XML proposed format for the
    > UCD),andcould as well have another property if it does not want to change
    > the value of existing properties. And we have lots of other properties for
    > CJK ideographs.

    Yes, it is always possible to add more properties, including more
    informative name attributes, but you would have to convince the
    UTC of the cost/benefit tradeoff in doing so. Note that the
    printed Unicode Standard (and the machine-readable NamesList.txt)
    is full of informative aliases for characters, but other than
    the few normative formal aliases, nobody has seen sufficient requirement
    to turn these into formal values of character properties.

    Furthermore, if the concern is stability of applications, having
    *another* name property isn't going to help at all. It isn't
    going to change API's that return *the* character name for
    a Unicode character -- i.e. the normative character name.
    All it does is introduce another bunch of names in an informative
    list for people to get confused about, frankly.

    > Most commonly used names are those based on 2/3 character abbreviations, so
    > these "aliases" are still the best: "NUL, ..., TAB, LF, VT, FF, CR, ... DC1,
    > ..., CSI, ...".
    >
    > I won't take the 2-characters Keld's mnemonic as they are broken even if
    > they remain in old charset definition RFCs:

    Ah, *some*thing we can agree on!

    > But at least, these names would simplify the writing of new specifications,
    > or could help disambiguate some old RFCs by making them more precise if some
    > normative reference was simply available to specify this without long lists
    > of local definitions in each document needing them (including in the Unicode
    > standard annexes where these names are needed and redefined locally).

    If you think a standard list of string labels for
    C0/C1 control codes (either short ones like "TAB" and "LF" or
    long ones like "CHARACTER TABULATION" and "LINE FEED") is
    required, then by all means, write an RFC specifying your list.

    I just don't think the UTC has any interest in going there
    for the Unicode Standard, since it is already on record as
    having specified the current scheme as printed in the standard --
    namely, control codes have no character name, but are printed
    with an (informative) alias to the ISO 6429 control function
    name (if one exists).

    --Ken



    This archive was generated by hypermail 2.1.5 : Thu Sep 13 2007 - 18:20:20 CDT