RE: Where is the First> Last> convention documented?

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Sep 13 2007 - 15:12:19 CDT

Next message: Philippe Verdy: "RE: Where is the First> Last> convention documented?"

Previous message: Asmus Freytag: "Re: Where is the First> Last> convention documented?"
Maybe in reply to: Stephane Bortzmeyer: "Where is the First> Last> convention documented?"
Next in thread: Philippe Verdy: "RE: Where is the First> Last> convention documented?"
Reply: Philippe Verdy: "RE: Where is the First> Last> convention documented?"
Reply: Philippe Verdy: "RE: Where is the First> Last> convention documented?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Philippe,

> Regarding my comment about missing names, I was not pretending that these
> complemented names should be defined the same way as other assigned names.

I didn't assume that you were *pretending* that to be the case;
I observed that you were *asserting* it to be the case.

> But references to characters by name is better than reference by codepoint
> in many documents as it makes the reference clearer.

Ah, now you change your tune. I have no quarrel with that claim. Certainly
being able to refer to common use control codes by names such
as "tab" and "carriage return" instead of hexadecimal U+0009 and
U+000D makes the intent clearer to everyone -- even those of us
who spend much of our day thinking in hexadecimal.

But in your prior contribution, you were talking about alleged
problems of stability of applications because of characters which
currently have no normatively defined character name attribute.

> Even Unicode needs to assign them names locally in many places to controls
> to make things clearer (look at the documents and standard annexes about the
> BiDi algorithm and line/word breaking.)

Nobody is going to complain about that. But those aren't normative
character name attribute values, but *aliases*. See TUS 5.0, p. 573,
to see all the aliases for C0 control codes.

> Why I spoke about ISO 8859-1 and ISO 646 I spoke about their reference to
> the C0 and C1 subsets. But also about their definition in IANA charsets that
> DO include the C0 and C1 subsets, not just the G0 and G1 characters.
> (there's a difference between "ISO-8859-1", the IANA charset made of "ISO
> 8859-1 for G0 plus C0 controls, and "ISO 8859-1"; notice the addition of the
> hyphen; the same is true between "ISO 646" and "ISO-646".)

O.k., there is a difference between an ISO-defined coded
character set such as ISO/IEC 8859-1:1998, which defines
character identity by mapping to ISO/IEC 10646, and a charset
registered in the IANA registry, which maps code points
to characters, depending on external references to define
what those characters are.

When you refer to the "IANA charset" ISO-8859-1, you are referring to this
entry in the IANA character set registry:

Name: ISO_8859-1:1987 [RFC1345,KXS2]
MIBenum: 4
Source: ECMA registry
Alias: iso-ir-100
Alias: ISO_8859-1
Alias: ISO-8859-1 (preferred MIME name)
Alias: latin1
Alias: l1
Alias: IBM819
Alias: CP819
Alias: csISOLatin1

And the *mapping* for that charset is defined by external reference
to this masterful example of clarity in RFC 1345:

  &charset ISO_8859-1:1987
  &rem source: ECMA registry
  &alias iso-ir-100
  &g1esc x2d41 &g2esc x2e41 &g3esc x2f41
  &alias ISO_8859-1
  &alias ISO-8859-1
  &alias latin1
  &alias l1
  &alias IBM819
  &alias CP819
  &code 0
  NU SH SX EX ET EQ AK BL BS HT LF VT FF CR SO SI
  DL D1 D2 D3 D4 NK SY EB CN EM SB EC FS GS RS US
  SP ! " Nb DO % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
  At A B C D E F G H I J K L M N O P Q R S T U V W X Y Z <( // )> '> _
  '! a b c d e f g h i j k l m n o p q r s t u v w x y z (! !! !) '? DT
  PA HO BH NH IN NL SA ES HS HJ VS PD PU RI S2 S3
  DC P1 P2 TS CC MW SG EG SS GC SC CI ST OC PM AC
  NS !I Ct Pd Cu Ye BB SE ': Co -a << NO -- Rg '-
  DG +- 2S 3S '' My PI .M ', 1S -o >> 14 12 34 ?I
  A! A' A> A? A: AA AE C, E! E' E> E: I! I' I> I:
  D- N? O! O' O> O? O: *X O/ U! U' U> U: Y' TH ss
  a! a' a> a? a: aa ae c, e! e' e> e: i! i' i> i:
  d- n? o! o' o> o? o: -: o/ u! u' u> u: y' th y:

Now nobody would dispute that that mapping specifies that
ISO-8859-1 maps 0x00..0x1F, 0x7F to C0 control codes
and 0x80..0x9F to C1 control codes. And in fact that
is how everybody implements the mapping of 8859-1,
because to do otherwise would be silly and non-interoperable.
See also the mapping table on the Unicode website for
the latest published version of 8859-1:

http://www.unicode.org/Public/MAPPINGS/ISO8859-8859-1.TXT

But that doesn't mean that 8859-1 (neither the charset
ISO-8859-1, nor the standard itself, ISO/IEC 8859-1:1998)
*defines* names for the C0 or C1 control codes.

In fact, if you read on in RFC 1345 for other charset
mappings provided for other charsets, those same mapping
lines for C0 control codes are simply copied over and
over again. You'd be just as accurate in claiming that
we should be using the "DEC MCS" control code names as
the "8859-1" control code names, since the charset for
DEC VAX/VMS in RFC 1435 also includes the mapping lines:

NU SH SX EX ET EQ AK BL BS HT LF VT FF CR SO SI
DL D1 D2 D3 D4 NK SY EB CN EM SB EC FS GS RS US

So when you ask:

"Why Unicode still does not endorse the existing ISO 646 and ISO 8859 names
for these C0 and C1 controls?"

and justify the implied claim for these nonexistent names by
reference to the IANA charset registry and ISO-8859-1,
the only meaningful interpretation for that is that you
must be advocating that the Unicode Consortium endorse
the names for control codes given in the Character Mnemonic
Table (Section 3) of RFC 1345.

The problem with that is two-fold:

First, there are now 7 discrepancies in detail between the
names given in Keld's mnemonic table and the latest version
if ISO 6429. (0x0010, 0x001D..0x001F, 0x0084, 0x008E..0x008F)

Second, and more serious, the mnemonic table contains utterly
bogus names for 3 control codes:

PA 0080 PADDING CHARACTER (PAD)
HO 0081 HIGH OCTET PRESET (HOP)
GC 0099 SINGLE GRAPHIC CHARACTER INTRODUCER (SGCI)

Those were proposals from early, early drafts of ISO/IEC 10646,
and were dropped completely, well before the publication of
ISO/IEC 10646-1:1993. But RFC 1345 was never updated to even
annotate that, let alone remove the offending fantasy definitions.

So what exact set of names is the Unicode Consortium supposed
to "endorse" then? I think (most) everyone understands that at
this point RFC 1345 is a 15-year-old archaeological relic, and
not something to be depended on for character names.

> Even if there are non agreed names across several references about names
> assigned to C0 and C1 controls, at least one name should be specified
> consistently for use in Unicode/ISO 10646 contexts.

At least one *alias* is specified consistently for C0 and C1
controls (except 0x0080, 0x0081, 0x0084, 0x0099). See p. 573
and p. 578 of TUS 5.0. Where is the problem?

> When Ispoke about possible conflicts, its because applications frequently
> need to display names for controls. These names will preferably be those
> assigned by Unicode and ISO 10646 when thy exist, but if they are missing,
> the names will be inferred in some way, using the historic "na1" property,
> if available or some other legacy conventions, causing possible confusion if
> there's no agreed convention.

The Unicode Consortium and WG2 have no interest in legislating
disputes between applications that refer to U+0009 as
"CHARACTER TABULATION" (the current ISO 6429 name), "HORIZONTAL
TABULATION" (the old ISO 6429 name), "HT", or just "tab".
And I still don't see where the problem is here.

> Note that I know that not all C1controls have names, but the names are
> appearing in IBM references about EBCDIC, from where these controls were
> inherited and remapped into C1 controls.

Some were, and some were not. The IBM EBCDIC control functions
differ significantly from the ISO 6429 C0 and C1 controls.
See, for example:

http://www.barrcentral.com/help/spool/B_ASCII_and_EBCDIC_Standards.htm

which notes such EBCDIC control functions as "BYP Bypass",
"TRN Transparent", "WUS Word Underscore", "EO Eight Ones",
and so on -- none of which you are going to find in ISO 6429.

> The names are used in transcoding
> tables (that have existed since long before Unicode/ISO 10646).

*Which* names? In which transcoding tables? I'm guessing again
you are referring to the above-named RFC 1345.

> I don't see why not assigning a name (possibly through a separate property)
> for these controls would be a problem for Unicode and iSO 10646 stability.

Ah, well, they have aliases. Where is the problem?

> But it's clear that these names do exist in many other references, notably
> within many RFCs and protocol specifications. You just need to choose a name
> that matches the most common usage (even if there are other inconsistent
> assignements in other references, which may be deprecated or never meant to
> be normative).

And why should the Unicode Consortium be in the business of
providing normative names for control codes, in an
area where it claims no jurisdiction beyond the common text format
controls listed in Table 16-1, TUS 5.0, p. 533?

--Ken

Next message: Philippe Verdy: "RE: Where is the First> Last> convention documented?"
Previous message: Asmus Freytag: "Re: Where is the First> Last> convention documented?"
Maybe in reply to: Stephane Bortzmeyer: "Where is the First> Last> convention documented?"
Next in thread: Philippe Verdy: "RE: Where is the First> Last> convention documented?"
Reply: Philippe Verdy: "RE: Where is the First> Last> convention documented?"
Reply: Philippe Verdy: "RE: Where is the First> Last> convention documented?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Sep 13 2007 - 15:17:08 CDT