Re: Nicest UTF

From: John Cowan (jcowan@reutershealth.com)
Date: Fri Dec 10 2004 - 19:38:59 CST

  • Next message: Asmus Freytag: "Re: US-ASCII (was: Re: Invalid UTF-8 sequences)"

    Philippe Verdy scripsit:

    > If you look at the XML 1.0 Second Edition

    The Second Edition has been superseded by the Third.

    > Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
    > [#x10000-#x10FFFF]

    That is normative.

    > But the comment following it specifies:

    That comment is not normative and not meant to be precise.

    > the restrictive
    > definition of "Char" above also includes the whole range of C1 controls

    By oversight.

    > (#x80..#x9F), so I can't understand why the Char definition is so
    > restrictive on controls; in addition the definition of Char also
    > *includes* many non-characters (it only excludes surrogates, and U+FFFE
    > and U+FFFF, but forgets to exclude U+1FFFE and U+1FFFF, U+2FFFE and
    > U+2FFFF, ..., U+10FFFE and U+10FFFF).

    By oversight again.

    > Note however that nearly all XML parsers don't seem to honor this
    > constraint (like SGML parsers...)!

    Please specify the parsers that do and don't honor this. Any which
    don't honor it are buggy, and any documents which exploit those bugs
    are not XML.

    > What is even worse is that XML 1.1 now reallows NUL for system
    > identifiers and URIs, through escaping mechanisms.

    Not true. U+0000 is absolutely excluded in both XML 1.0 and XML 1.1.

    -- 
    "I could dance with you till the cows           John Cowan
    come home.  On second thought, I'd              http://www.ccil.org/~cowan
    rather dance with the cows when you             http://www.reutershealth.com
    came home."  --Rufus T. Firefly                 jcowan@reutershealth.com
    


    This archive was generated by hypermail 2.1.5 : Fri Dec 10 2004 - 19:40:54 CST