Re: Nicest UTF

From: John Cowan (jcowan@reutershealth.com)
Date: Fri Dec 10 2004 - 21:22:11 CST

  • Next message: Philippe Verdy: "Re: Please RSVP... (was: US-ASCII)"

    Philippe Verdy scripsit:

    > And I disagree with you about the fact the U+0000 can't be used in XML
    > documents. It can be used in URI through URI escaping mechanism, as
    > explicitly indicated in the XML specification...

    You have a hold of the right stick but at the wrong end. U+0000 can be
    encoded in a URI as %00, but that does not mean that the IRIs in system ids
    and namespace names (and potentially other places) can contain explicit
    U+0000 characters or � escapes either. Both of those are illegal,
    and documents that contain them are not well-formed.

    In character content and attribute values, U+0000 is not possible.

    > And the fact that the various character productions, that are normally
    > normative, have been changed so often, sometimes through erratas that
    > were forgotten in the text of the next edition of the standard,

    Do you have evidence for this claim?

    > The only thing about which I can agree is that XML will forbid surrogates
    > and U+FFFE and U+FFFF, but I won't say that a XML parser that does not
    > reject NULs or other non-characters or "disallowed" C0 controls is so
    > much buggy.

    You are of course entitled to your uninformed opinion.

    > But all these is also a proof that XML documents are definitely NOT
    > plain-text documents, so you can't use Unicode encoding rules at the
    > encoded XML document level, only at the finest plain-text nodes (these
    > are the levels that the productions in the XML standard are trying, with
    > more or less success, to standardize).

    You can't blindly do *normalization* of XML documents as if they were
    plain text. *Encoding* XML documents according to Unicode is of course
    possible and desirable.

    > As a consequence any process that blindly applies a plain-text
    > normalization to a complete XML document is bogous, because it breaks the
    > most basic XML conformance, i.e. the core document structure...

    In one extraordinarily unlikely case, yes: the appearance of a
    combining overlay slash following the ">" that closes a tag will
    damage the document if it is NFC-normalized.

    -- 
    You are a child of the universe no less         John Cowan
    than the trees and all other acyclic            http://www.reutershealth.com
    graphs; you have a right to be here.            http://www.ccil.org/~cowan
      --DeXiderata by Sean McGrath                  jcowan@reutershealth.com
    


    This archive was generated by hypermail 2.1.5 : Fri Dec 10 2004 - 21:23:09 CST