Re: Questions on ZWNBS - for line initial holam plus alef

From: Peter Kirk (
Date: Wed Aug 13 2003 - 08:09:05 EDT

  • Next message: Jon Hanna: "RE: Questions on ZWNBS - for line initial holam plus alef"

    On 12/08/2003 20:28, John Cowan wrote:

    >Peter Kirk scripsit:
    >>>2) In attribute values, LF, CR, and TAB characters are normalized to
    >>>spaces. Not relevant here.
    >>This would be relevant if it is legal for the character after LF, CR,
    >>and TAB to be a combining mark. Is this legal? In this case what was
    >>previously a defective (but legal) combining sequence would turn into a
    >>non-defective one, but the intended whitespace would be lost.
    >The point is that there is no such thing as an *intended* line break in
    >an attribute value; it will *always* be translated to a space before
    >the application sees it. (More exactly, line-break characters can
    >be inserted into attribute values, but only with the use of a numeric
    >character reference such as "
    Sorry, I'm confused. Are you saying that the input processing will
    translate line breaks into spaces within attribute values, unless
    inserted as 
 ? Well, I suppose this is fair enough as it is up to
    the user not to enter garbage.

    >>Not just a rendering glitch, I suspect. If the combining character is
    >>combined with the separating space, the space loses many of its
    >>separating functions, and perhaps keeps a confusing subset of them with
    >>all sorts of possibilities of error.
    >The space(s) will be used to separate individual tokens at processing
    >time. No spacing diacritic (either single-character or space+combining)
    >is permitted in a NMTOKEN.
    OK if this is clearly illegal, but this might restrict use of some
    languages in NMTOKEN. Would NBSP + combining be allowed?

    >>At best tokens beginning with
    >>combining characters will be unusable. At worst they will crash the
    >>implementation (and count on someone trying deliberately to do that!).
    >In effect, the combining character will constitute a defective combining
    >sequence at the beginning of the individual token.
    >Stepping away from the letter of the standard for a moment, there is
    >no real reason to begin a NMTOKEN with a combining character. It is
    >only allowed is a result of the miscegenation of SGML concepts with
    >Unicode ones.
    >In SGML's original design of tokens, they consisted of letters and digits
    >(and a few punctuation marks, which functioned as letters). There were
    >four kinds: a NUMBER could contain only digits, a NAME could not begin
    >with a digit, a NUTOKEN had to begin with a digit, and a NMTOKEN had no
    >restrictions. ID and IDREF had the same syntax as NAME with additional
    >semantics. Later, the categories "letter" and "digit" were generalized,
    >by redefining the concrete syntax, to be whatever you wanted, and were
    >renamed "name-start" and "name" characters (technically, a name character
    >was a letter *or* a digit).
    >When SGML was simplified to produce XML, only NMTOKEN, the most general
    >type of token, was kept. However, in order to keep the semantics of
    >"letter" and "digit" in the Unicode world, "letter" was extended to be any
    >letter and "digit" to be any digit *or* combining character. That worked
    >well for ID and IDREF, since treating combining characters as part of
    >"digit" prevented them from appearing first, as was only sensible.
    >Unfortunately, NMTOKENs, since there were no restrictions, became able
    >to begin with a combining character, though that made no real sense.
    >To write in a restriction would make it impossible to specify XML's
    >concrete syntax in SGML terms, which did not allow for three different
    >classes of characters within tokens. So we wound up with a basically
    >useless capability that if used will only cause trouble.
    There is some potential for real trouble here, if one process outputs an
    NMTOKEN starting with a combining character preceded by a separating
    space, or something else which is changed into a space, and another
    process takes the new space plus combining character as a unit and so
    doesn't recognise the separation. Any hackers and virus programmers
    reading this will soon start flooding the Internet with tokens beginning
    with combining characters in the hope of crashing implementations or
    finding back doors. Of course this wouldn't have been a problem if
    Unicode had never defined space plus combining character as legal and
    meaningful. But this is not my problem!

    Peter Kirk (personal) (work)

    This archive was generated by hypermail 2.1.5 : Wed Aug 13 2003 - 08:58:56 EDT