Re: Questions on ZWNBS - for line initial holam plus alef

From: John Cowan (cowan@mercury.ccil.org)
Date: Tue Aug 12 2003 - 23:28:06 EDT

  • Next message: John Cowan: "Re: Questions on ZWNBS - for line initial holam plus alef"

    Peter Kirk scripsit:

    > >2) In attribute values, LF, CR, and TAB characters are normalized to
    > >spaces. Not relevant here.
    >
    > This would be relevant if it is legal for the character after LF, CR,
    > and TAB to be a combining mark. Is this legal? In this case what was
    > previously a defective (but legal) combining sequence would turn into a
    > non-defective one, but the intended whitespace would be lost.

    The point is that there is no such thing as an *intended* line break in
    an attribute value; it will *always* be translated to a space before
    the application sees it. (More exactly, line-break characters can
    be inserted into attribute values, but only with the use of a numeric
    character reference such as "
".)

    > Not just a rendering glitch, I suspect. If the combining character is
    > combined with the separating space, the space loses many of its
    > separating functions, and perhaps keeps a confusing subset of them with
    > all sorts of possibilities of error.

    The space(s) will be used to separate individual tokens at processing
    time. No spacing diacritic (either single-character or space+combining)
    is permitted in a NMTOKEN.

    > At best tokens beginning with
    > combining characters will be unusable. At worst they will crash the
    > implementation (and count on someone trying deliberately to do that!).

    In effect, the combining character will constitute a defective combining
    sequence at the beginning of the individual token.

    Stepping away from the letter of the standard for a moment, there is
    no real reason to begin a NMTOKEN with a combining character. It is
    only allowed is a result of the miscegenation of SGML concepts with
    Unicode ones.

    In SGML's original design of tokens, they consisted of letters and digits
    (and a few punctuation marks, which functioned as letters). There were
    four kinds: a NUMBER could contain only digits, a NAME could not begin
    with a digit, a NUTOKEN had to begin with a digit, and a NMTOKEN had no
    restrictions. ID and IDREF had the same syntax as NAME with additional
    semantics. Later, the categories "letter" and "digit" were generalized,
    by redefining the concrete syntax, to be whatever you wanted, and were
    renamed "name-start" and "name" characters (technically, a name character
    was a letter *or* a digit).

    When SGML was simplified to produce XML, only NMTOKEN, the most general
    type of token, was kept. However, in order to keep the semantics of
    "letter" and "digit" in the Unicode world, "letter" was extended to be any
    letter and "digit" to be any digit *or* combining character. That worked
    well for ID and IDREF, since treating combining characters as part of
    "digit" prevented them from appearing first, as was only sensible.

    Unfortunately, NMTOKENs, since there were no restrictions, became able
    to begin with a combining character, though that made no real sense.
    To write in a restriction would make it impossible to specify XML's
    concrete syntax in SGML terms, which did not allow for three different
    classes of characters within tokens. So we wound up with a basically
    useless capability that if used will only cause trouble.

    -- 
    John Cowan  jcowan@reutershealth.com  www.reutershealth.com  ccil.org/~cowan
    Dievas dave dantis; Dievas duos duonos          --Lithuanian proverb
    Deus dedit dentes; deus dabit panem             --Latin version thereof
    Deity donated dentition;
      deity'll donate doughnuts                     --English version by Muke Tever
    God gave gums; God'll give granary              --Version by Mat McVeagh
    


    This archive was generated by hypermail 2.1.5 : Wed Aug 13 2003 - 00:03:51 EDT