From: Jon Hanna (email@example.com)
Date: Wed Aug 06 2003 - 08:58:57 EDT
> In the context of XML processing, where strings should (must?) be
FYI. It's "should" for XML 1.1, and it's quite explicitly stated that normalisation is not required for a document to be well-formed. XML1.0 doesn't mention Unicode normalisation, although plenty of applications built on top of it do, sometimes with "must", or with "must" in certain circumstances. (Of course that character normalisation is always based on the W3C character model, which uses NFC).
Therefore you cannot assume that XML is in NFC form. This is potentially problematic in a few contexts where non-NFC XML will be fed to an application requiring NFC (and as I said some applications do require it) notably in cases where whitespace normalisation may change the NFC normalisation. Hence, this normalisation should occur before any other processing (which is the obvious logical way to do it anyway, but sometimes people do things the wrong way around as the occasional security hole related to UTF-8 shows). Hence an application that receives data in XML format will still have to take the precautions of checking that spaces aren't followed with combining characters before treating them as breaking characters, unless it either insists on NFC, or performs normalisation itself.
A more interesting (so you're probably already aware of this, but maybe someone else here isn't) potential pitfall with the position NFC has in the XML world is the fact that an application which doesn't insist on NFC will happily output U+0338 COMBINING LONG SOLIDUS OVERLAY as the first character of an element's content. NFC normalisation of the document would then combine the U+003E GREATER-THAN SIGN (>) of the markup with the combining solidus to produce U+226F NOT GREATER-THAN. For the most part this is only an issue with applications that intend to represent substring operations, as there is little sense in beginning a string with U+0338 anyway, but it's a strange thing to get a support request about if someone somehow gets it in there.
There's a similar problem with U+0338 as the first character in an element name, although such a document wouldn't be well-formed.
This archive was generated by hypermail 2.1.5 : Wed Aug 06 2003 - 09:46:36 EDT