Re: Nicest UTF

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Dec 10 2004 - 20:23:07 CST

  • Next message: Mark Davis: "Re: US-ASCII (was: Re: Invalid UTF-8 sequences)"

    From: "John Cowan" <jcowan@reutershealth.com>
    > Marcin 'Qrczak' Kowalczyk scripsit:
    >
    >> http://www.w3.org/TR/2000/REC-xml-20001006#charsets
    >> implies that the appropriate level for parsing XML is code points.
    >
    > You are reading the XML Recommendation incorrectly. It is not defined
    > in terms of codepoints (8-bit, 16-bit, or 32-bit) but in terms of
    > characters. XML processors are required to process UTF-8 and UTF-16,
    > and may process other character encodings or not. But the internal
    > model is that of characters. Thus surrogate code points are not
    > allowed.

    I have different reading, because the "character" in XML is not the same as
    the "character" in Unicode. For XML, U+10FFFF is a valid character (even if
    its use is explicitly not recommanded, it is perfectly valid), for Unicode
    it's a non-character... For XML, U+0001 is *sometimes* a valid character,
    sometimes not.

    And I disagree with you about the fact the U+0000 can't be used in XML
    documents. It can be used in URI through URI escaping mechanism, as
    explicitly indicated in the XML specification...

    And the fact that the various character productions, that are normally
    normative, have been changed so often, sometimes through erratas that were
    forgotten in the text of the next edition of the standard, then reintroduced
    in an errata, shows that these productions are less reliable than the
    descriptive *definitions* which ARE normative in XML...

    The only thing about which I can agree is that XML will forbid surrogates
    and U+FFFE and U+FFFF, but I won't say that a XML parser that does not
    reject NULs or other non-characters or "disallowed" C0 controls is so much
    buggy. I do think that these restrictions is a defect of XML...

    But all these is also a proof that XML documents are definitely NOT
    plain-text documents, so you can't use Unicode encoding rules at the encoded
    XML document level, only at the finest plain-text nodes (these are the
    levels that the productions in the XML standard are trying, with more or
    less success, to standardize).

    As a consequence any process that blindly applies a plain-text normalization
    to a complete XML document is bogous, because it breaks the most basic XML
    conformance, i.e. the core document structure...



    This archive was generated by hypermail 2.1.5 : Fri Dec 10 2004 - 20:24:10 CST