NFC Normalization of whitespace+nonspacing combining in XML

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue May 20 2003 - 18:28:11 EDT

  • Next message: Michael Everson: "Characters as requested plus some more"

    I got a problem, and don't know whever it is an XML issue or an Unicode issue, and would want to know if there's some reference link to solve this issue:

    According to Unicode, non-starter characters never compose with a prior control character or format control character. So the NFC form of <TAB, COMBINING ACCUTE ACCENT> is the same string.

    However, the whitespace behavior of XML (and also HTML4 or prior versions, or SGML) in some well-defined case can cause a problem, as XML indicates that documents should be created and handled in the DOM structure in their NFC form, but XML/HTML/SGML also define ways by which whitespaces collapse together as if it was a single SPACE (in fact according to the spacing style of the containing element.)
    This causes a problem when used with CDATA sections, notably text elements, which may contain occurences of a CR or LF or TAB followed by a non-spacing cominbing diacritic (or another non-starter character which can combine with space).

    According to Unicode, CR+ACCUTE is in NFC form, and so complies with XML requirement(?) for handling in DOM (where all should be performed using NFC). But according to XML (or HTML) the parsed document must then be converted (interpreted) as if it was SPACE+COMBINING ACCUTE ACCENT which is not NFC.

    If canonicalizing the document, it will become a single NON COMBINING ACCUTE ACCENT and the CDATA section (or text element) will then be incorrectly interpreted as not containing a whitespace, something that may cause problems in places where the COMBINING ACCUTE ACCENT was expected to be kept, for example if it starts the value of the text element. In critical cases, this may prevent the document to parse correctly with a XML validator.

    Is this issue (related to normalized and unified whitespaces) discussed somewhere ? What is the incidence with XML security? Thanks for giving pointers or advices related to this interpretation issue. I sure that this possible conflict of interpretation has been discussed somewhere (and one could think this is a problem in XML/HTML/SGML but not Unicode).

    After all there is probably no issue, if NFC is NOT really mandatory in XML/DOM, and if DOM allows a string to contain non normalized strings (denormalized documents are still possible in their "text/xml" representation which supports many encodings including Unicode UTF-*). But then this creates other security issues due to canonically equivalent strings (notably in selectors, or XPath) that could be considered different in XML but equivalent for Unicode...

    -- Philippe.



    This archive was generated by hypermail 2.1.5 : Tue May 20 2003 - 19:10:24 EDT