From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Aug 13 2003 - 10:50:26 EDT
----- Original Message -----
From: "Peter Kirk" <peter.r.kirk@ntlworld.com>
To: "Jon Hanna" <jon@spin.ie>
Cc: <unicode@unicode.org>
Sent: Wednesday, August 13, 2003 3:05 PM
Subject: Re: Questions on ZWNBS - for line initial holam plus alef
> On 13/08/2003 04:44, Jon Hanna wrote:
>
> >No, the safe thing to do (and the thing that is done) is to treat the
space
> >as a space ignoring the fact that the NMTOKEN contains a combining
> >character, this is even safer than your suggestion since it can't
> >mis-identify the combining properties of a character.
> >
> >
> OK, it's safe, but it is a misuse of Unicode. As space plus combining
> character is a unit in Unicode, it should be treated as a unit by
higher
> level protocols. If higher level protocols are allowed to do arbitrary
> things within Unicode units, there is no end to the possible
confusion.
> See for example, from Unicode 4.0 chapter 3:
>
> C7 A process shall interpret a coded character representation
according
> to the character
> semantics established by this standard, if that process does interpret
> that coded character
> representation.
OK, but XML inherits its behavior from SGML and you won't change it.
The only way to bypass this would be to use entitiy references to encode
the base space needed by the Unicode convention, so this is related to
what Unicode defines as a higher level protocol, needed here to bypass
the limitations of basic text. However it still creates a problem within
CDATA sections, which are not supposed to contain entity references.
One needs then to use the XML CDATA escaping mechanism with
another escaping system specific to CDATA sections (which are
formally anonymous text elements and equivalent to them).
This archive was generated by hypermail 2.1.5 : Wed Aug 13 2003 - 11:43:09 EDT