RE: Where is the First> Last> convention documented?

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Sep 13 2007 - 15:12:19 CDT

  • Next message: Philippe Verdy: "RE: Where is the First> Last> convention documented?"

    Philippe,

    > Regarding my comment about missing names, I was not pretending that these
    > complemented names should be defined the same way as other assigned names.

    I didn't assume that you were *pretending* that to be the case;
    I observed that you were *asserting* it to be the case.
     
    > But references to characters by name is better than reference by codepoint
    > in many documents as it makes the reference clearer.

    Ah, now you change your tune. I have no quarrel with that claim. Certainly
    being able to refer to common use control codes by names such
    as "tab" and "carriage return" instead of hexadecimal U+0009 and
    U+000D makes the intent clearer to everyone -- even those of us
    who spend much of our day thinking in hexadecimal.

    But in your prior contribution, you were talking about alleged
    problems of stability of applications because of characters which
    currently have no normatively defined character name attribute.

    > Even Unicode needs to assign them names locally in many places to controls
    > to make things clearer (look at the documents and standard annexes about the
    > BiDi algorithm and line/word breaking.)

    Nobody is going to complain about that. But those aren't normative
    character name attribute values, but *aliases*. See TUS 5.0, p. 573,
    to see all the aliases for C0 control codes.

    > Why I spoke about ISO 8859-1 and ISO 646 I spoke about their reference to
    > the C0 and C1 subsets. But also about their definition in IANA charsets that
    > DO include the C0 and C1 subsets, not just the G0 and G1 characters.
    > (there's a difference between "ISO-8859-1", the IANA charset made of "ISO
    > 8859-1 for G0 plus C0 controls, and "ISO 8859-1"; notice the addition of the
    > hyphen; the same is true between "ISO 646" and "ISO-646".)

    O.k., there is a difference between an ISO-defined coded
    character set such as ISO/IEC 8859-1:1998, which defines
    character identity by mapping to ISO/IEC 10646, and a charset
    registered in the IANA registry, which maps code points
    to characters, depending on external references to define
    what those characters are.

    When you refer to the "IANA charset" ISO-8859-1, you are referring to this
    entry in the IANA character set registry:

    Name: ISO_8859-1:1987 [RFC1345,KXS2]
    MIBenum: 4
    Source: ECMA registry
    Alias: iso-ir-100
    Alias: ISO_8859-1
    Alias: ISO-8859-1 (preferred MIME name)
    Alias: latin1
    Alias: l1
    Alias: IBM819
    Alias: CP819
    Alias: csISOLatin1

    And the *mapping* for that charset is defined by external reference
    to this masterful example of clarity in RFC 1345:

      &charset ISO_8859-1:1987
      &rem source: ECMA registry
      &alias iso-ir-100
      &g1esc x2d41 &g2esc x2e41 &g3esc x2f41
      &alias ISO_8859-1
      &alias ISO-8859-1
      &alias latin1
      &alias l1
      &alias IBM819
      &alias CP819
      &code 0
      NU SH SX EX ET EQ AK BL BS HT LF VT FF CR SO SI
      DL D1 D2 D3 D4 NK SY EB CN EM SB EC FS GS RS US
      SP ! " Nb DO % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
      At A B C D E F G H I J K L M N O P Q R S T U V W X Y Z <( // )> '> _
      '! a b c d e f g h i j k l m n o p q r s t u v w x y z (! !! !) '? DT
      PA HO BH NH IN NL SA ES HS HJ VS PD PU RI S2 S3
      DC P1 P2 TS CC MW SG EG SS GC SC CI ST OC PM AC
      NS !I Ct Pd Cu Ye BB SE ': Co -a << NO -- Rg '-
      DG +- 2S 3S '' My PI .M ', 1S -o >> 14 12 34 ?I
      A! A' A> A? A: AA AE C, E! E' E> E: I! I' I> I:
      D- N? O! O' O> O? O: *X O/ U! U' U> U: Y' TH ss
      a! a' a> a? a: aa ae c, e! e' e> e: i! i' i> i:
      d- n? o! o' o> o? o: -: o/ u! u' u> u: y' th y:

    Now nobody would dispute that that mapping specifies that
    ISO-8859-1 maps 0x00..0x1F, 0x7F to C0 control codes
    and 0x80..0x9F to C1 control codes. And in fact that
    is how everybody implements the mapping of 8859-1,
    because to do otherwise would be silly and non-interoperable.
    See also the mapping table on the Unicode website for
    the latest published version of 8859-1:

    http://www.unicode.org/Public/MAPPINGS/ISO8859-8859-1.TXT

    But that doesn't mean that 8859-1 (neither the charset
    ISO-8859-1, nor the standard itself, ISO/IEC 8859-1:1998)
    *defines* names for the C0 or C1 control codes.

    In fact, if you read on in RFC 1345 for other charset
    mappings provided for other charsets, those same mapping
    lines for C0 control codes are simply copied over and
    over again. You'd be just as accurate in claiming that
    we should be using the "DEC MCS" control code names as
    the "8859-1" control code names, since the charset for
    DEC VAX/VMS in RFC 1435 also includes the mapping lines:

      NU SH SX EX ET EQ AK BL BS HT LF VT FF CR SO SI
      DL D1 D2 D3 D4 NK SY EB CN EM SB EC FS GS RS US

    So when you ask:

    "Why Unicode still does not endorse the existing ISO 646 and ISO 8859 names
    for these C0 and C1 controls?"

    and justify the implied claim for these nonexistent names by
    reference to the IANA charset registry and ISO-8859-1,
    the only meaningful interpretation for that is that you
    must be advocating that the Unicode Consortium endorse
    the names for control codes given in the Character Mnemonic
    Table (Section 3) of RFC 1345.

    The problem with that is two-fold:

    First, there are now 7 discrepancies in detail between the
    names given in Keld's mnemonic table and the latest version
    if ISO 6429. (0x0010, 0x001D..0x001F, 0x0084, 0x008E..0x008F)

    Second, and more serious, the mnemonic table contains utterly
    bogus names for 3 control codes:

    PA 0080 PADDING CHARACTER (PAD)
    HO 0081 HIGH OCTET PRESET (HOP)
    GC 0099 SINGLE GRAPHIC CHARACTER INTRODUCER (SGCI)

    Those were proposals from early, early drafts of ISO/IEC 10646,
    and were dropped completely, well before the publication of
    ISO/IEC 10646-1:1993. But RFC 1345 was never updated to even
    annotate that, let alone remove the offending fantasy definitions.

    So what exact set of names is the Unicode Consortium supposed
    to "endorse" then? I think (most) everyone understands that at
    this point RFC 1345 is a 15-year-old archaeological relic, and
    not something to be depended on for character names.

    > Even if there are non agreed names across several references about names
    > assigned to C0 and C1 controls, at least one name should be specified
    > consistently for use in Unicode/ISO 10646 contexts.

    At least one *alias* is specified consistently for C0 and C1
    controls (except 0x0080, 0x0081, 0x0084, 0x0099). See p. 573
    and p. 578 of TUS 5.0. Where is the problem?

    > When Ispoke about possible conflicts, its because applications frequently
    > need to display names for controls. These names will preferably be those
    > assigned by Unicode and ISO 10646 when thy exist, but if they are missing,
    > the names will be inferred in some way, using the historic "na1" property,
    > if available or some other legacy conventions, causing possible confusion if
    > there's no agreed convention.

    The Unicode Consortium and WG2 have no interest in legislating
    disputes between applications that refer to U+0009 as
    "CHARACTER TABULATION" (the current ISO 6429 name), "HORIZONTAL
    TABULATION" (the old ISO 6429 name), "HT", or just "tab".
    And I still don't see where the problem is here.

    > Note that I know that not all C1controls have names, but the names are
    > appearing in IBM references about EBCDIC, from where these controls were
    > inherited and remapped into C1 controls.

    Some were, and some were not. The IBM EBCDIC control functions
    differ significantly from the ISO 6429 C0 and C1 controls.
    See, for example:

    http://www.barrcentral.com/help/spool/B_ASCII_and_EBCDIC_Standards.htm

    which notes such EBCDIC control functions as "BYP Bypass",
    "TRN Transparent", "WUS Word Underscore", "EO Eight Ones",
    and so on -- none of which you are going to find in ISO 6429.

    > The names are used in transcoding
    > tables (that have existed since long before Unicode/ISO 10646).

    *Which* names? In which transcoding tables? I'm guessing again
    you are referring to the above-named RFC 1345.

    > I don't see why not assigning a name (possibly through a separate property)
    > for these controls would be a problem for Unicode and iSO 10646 stability.

    Ah, well, they have aliases. Where is the problem?

    > But it's clear that these names do exist in many other references, notably
    > within many RFCs and protocol specifications. You just need to choose a name
    > that matches the most common usage (even if there are other inconsistent
    > assignements in other references, which may be deprecated or never meant to
    > be normative).

    And why should the Unicode Consortium be in the business of
    providing normative names for control codes, in an
    area where it claims no jurisdiction beyond the common text format
    controls listed in Table 16-1, TUS 5.0, p. 533?

    --Ken
     



    This archive was generated by hypermail 2.1.5 : Thu Sep 13 2007 - 15:17:08 CDT