From: Mark Davis (firstname.lastname@example.org)
Date: Wed Aug 13 2003 - 10:04:12 EDT
Peter, in XML you really don't want to use attributes for any general
text; there are too many restrictions on the content. For example, we
never put translatable text into them. Attributes should really be
treated more like sequences of symbols, with a constrained syntax.
This is also not in violation of the Unicode conformance clause. A
"space plus combining
character" is a unit in some sense. That is, it is a combining
character sequence (and grapheme cluster). However, there is no clause
that says that such units cannot be changed, or that any particular
sequence of characters cannot be changed; operations such as case
mapping or normalization do just that, they change characters.
There are restrictions on what can be changed *if* a process purports
to not modify the text (C10). But an XML parser is certainly capable
of interpreting a sequence A B, and deciding that it wants to change A
to C. If the parser interpreted the 0x0041 in UTF-16 as a Z or a Greek
Alpha, *that* would be a violation of C7. But interpreting a space as
a space, then deciding to modify it, is perfectly legit.
► “Eppur si muove” ◄
----- Original Message -----
From: "Peter Kirk" <email@example.com>
To: "John Cowan" <firstname.lastname@example.org>
Sent: Wednesday, August 13, 2003 05:09
Subject: Re: Questions on ZWNBS - for line initial holam plus alef
> On 12/08/2003 20:28, John Cowan wrote:
> >Peter Kirk scripsit:
> >>>2) In attribute values, LF, CR, and TAB characters are normalized
> >>>spaces. Not relevant here.
> >>This would be relevant if it is legal for the character after LF,
> >>and TAB to be a combining mark. Is this legal? In this case what
> >>previously a defective (but legal) combining sequence would turn
> >>non-defective one, but the intended whitespace would be lost.
> >The point is that there is no such thing as an *intended* line
> >an attribute value; it will *always* be translated to a space
> >the application sees it. (More exactly, line-break characters can
> >be inserted into attribute values, but only with the use of a
> >character reference such as "
> Sorry, I'm confused. Are you saying that the input processing will
> translate line breaks into spaces within attribute values, unless
> inserted as
? Well, I suppose this is fair enough as it is up
> the user not to enter garbage.
> >>Not just a rendering glitch, I suspect. If the combining character
> >>combined with the separating space, the space loses many of its
> >>separating functions, and perhaps keeps a confusing subset of them
> >>all sorts of possibilities of error.
> >The space(s) will be used to separate individual tokens at
> >time. No spacing diacritic (either single-character or
> >is permitted in a NMTOKEN.
> OK if this is clearly illegal, but this might restrict use of some
> languages in NMTOKEN. Would NBSP + combining be allowed?
> >>At best tokens beginning with
> >>combining characters will be unusable. At worst they will crash
> >>implementation (and count on someone trying deliberately to do
> >In effect, the combining character will constitute a defective
> >sequence at the beginning of the individual token.
> >Stepping away from the letter of the standard for a moment, there
> >no real reason to begin a NMTOKEN with a combining character. It
> >only allowed is a result of the miscegenation of SGML concepts with
> >Unicode ones.
> >In SGML's original design of tokens, they consisted of letters and
> >(and a few punctuation marks, which functioned as letters). There
> >four kinds: a NUMBER could contain only digits, a NAME could not
> >with a digit, a NUTOKEN had to begin with a digit, and a NMTOKEN
> >restrictions. ID and IDREF had the same syntax as NAME with
> >semantics. Later, the categories "letter" and "digit" were
> >by redefining the concrete syntax, to be whatever you wanted, and
> >renamed "name-start" and "name" characters (technically, a name
> >was a letter *or* a digit).
> >When SGML was simplified to produce XML, only NMTOKEN, the most
> >type of token, was kept. However, in order to keep the semantics
> >"letter" and "digit" in the Unicode world, "letter" was extended to
> >letter and "digit" to be any digit *or* combining character. That
> >well for ID and IDREF, since treating combining characters as part
> >"digit" prevented them from appearing first, as was only sensible.
> >Unfortunately, NMTOKENs, since there were no restrictions, became
> >to begin with a combining character, though that made no real
> >To write in a restriction would make it impossible to specify XML's
> >concrete syntax in SGML terms, which did not allow for three
> >classes of characters within tokens. So we wound up with a
> >useless capability that if used will only cause trouble.
> There is some potential for real trouble here, if one process
> NMTOKEN starting with a combining character preceded by a separating
> space, or something else which is changed into a space, and another
> process takes the new space plus combining character as a unit and
> doesn't recognise the separation. Any hackers and virus programmers
> reading this will soon start flooding the Internet with tokens
> with combining characters in the hope of crashing implementations or
> finding back doors. Of course this wouldn't have been a problem if
> Unicode had never defined space plus combining character as legal
> meaningful. But this is not my problem!
> Peter Kirk
> email@example.com (personal)
> firstname.lastname@example.org (work)
This archive was generated by hypermail 2.1.5 : Wed Aug 13 2003 - 11:01:49 EDT