RE: NFC Normalization of whitespace+nonspacing combining in XML

From: Francois Yergeau (FYergeau@alis.com)
Date: Wed May 21 2003 - 14:37:30 EDT

  • Next message: Kenneth Whistler: "More of The Unicode Standard, Version 4.0 available online"

    Philippe Verdy écrit :
    > According to Unicode, CR+ACCUTE is in NFC form, and so
    > complies with XML requirement(?) for handling in DOM (where
    > all should be performed using NFC). But according to XML (or
    > HTML) the parsed document must then be converted
    > (interpreted) as if it was SPACE+COMBINING ACCUTE ACCENT
    > which is not NFC.

    It is NFC.

    > If canonicalizing the document, it will become a single NON
    > COMBINING ACCUTE ACCENT...

    From the UCD:

    00B4;ACUTE ACCENT;Sk;0;ON;<compat> 0020 0301;;;;N;SPACING ACUTE;;;;

    This is only a <compat> decomposition, so SPACE+COMBINING ACCUTE ACCENT
    remains unchanged in NFC.

    A more interesting case is that of U+0338 COMBINING LONG SOLIDUS OVERLAY,
    which combines with > to give U+226F NOT GREATER-THAN. This can damage XML
    files.

    -- 
    François Yergeau
    


    This archive was generated by hypermail 2.1.5 : Wed May 21 2003 - 15:45:08 EDT