Re: Nicest UTF

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Dec 10 2004 - 19:18:42 CST

  • Next message: John Cowan: "Re: Nicest UTF"

    From: "Philippe Verdy" <verdy_p@wanadoo.fr>
    > From: "Marcin 'Qrczak' Kowalczyk" <qrczak@knm.org.pl>
    >> "Philippe Verdy" <verdy_p@wanadoo.fr> writes:
    >>
    >>> The XML/HTML core syntax is defined with fixed behavior of some
    >>> individual characters like '&', '<', quotation marks, and with special
    >>> behavior for spaces.
    >>
    >> The point is: what "characters" mean in this sentence. Code points?
    >> Combining character sequences? Something else?
    >
    > See the XML character model document... XML ignores combining sequences.
    > But for Unicode and for XML a character is an abstract character with a
    > single code allocated in a *finite* repertoire. The repertoire of all
    > possible combining characters sequences is already infinite in Unicode, as
    > well as the number of "default grapheme clusters" they can represent.

    Note there is some differently relaxed definitions of what constitutes a
    "character" for XML.
    If you look at the XML 1.0 Second Edition, it specifies that the document is
    a "text" (defined only as a sequence of "characters", which may represent
    markup or character data) will only contain characters in this set:
    Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
    [#x10000-#x10FFFF]

    But the comment following it specifies:
    "any Unicode character, excluding the surrogate blocks, FFFE, and FFFF."
    which is considerably weaker (because it would include ALL basic controls in
    the range #x0 to #x1F, and not only TAB, LF, CR); the restrictive definition
    of "Char" above also includes the whole range of C1 controls (#x80..#x9F),
    so I can't understand why the Char definition is so restrictive on controls;
    in addition the definition of Char also *includes* many non-characters (it
    only excludes surrogates, and U+FFFE and U+FFFF, but forgets to exclude
    U+1FFFE and U+1FFFF, U+2FFFE and U+2FFFF, ..., U+10FFFE and U+10FFFF).

    So XML does allow Unicode/ISO10646 non-characters... But not all. Apparently
    many XML parsers seem to ignore the restriction of Char above, notably in
    CDATA sections....

    The alternative is then to use numeric character references, as defined by
    this even weaker production (in 4.1. Character and Entity References):

    CharRef ::= '&#' [0-9]+ ';'
             | '&#x' [0-9a-fA-F]+ ';'

    but with this definition:
    "A character reference refers to a specific character in the ISO/IEC 10646
    character set, for example one not directly accessible from available input
    devices."

    Which is exactly the purpose of encoding something like "&#1;" to encode a
    SOH character U+0001 (which after all is a valid Unicode/ISO/IEC10646
    "character"), or even a NUL character.

    The "CharRef" production however is annotated by a Well-Formedness
    Constraint, "Legal Character":
    "Characters referred to using character references must match the production
    for Char.

    Note however that nearly all XML parsers don't seem to honor this constraint
    (like SGML parsers...)!

    This was later amended in an errata for XML 1.0 which now says that the list
    of code points whose use is *discouraged* (but explicitly *not* forbidden)
    for the "Char" production is now:
    [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF],
    [#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],
    [#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],
    [#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],
    [#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],
    [#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],
    [#x10FFFE-#x10FFFF].
    This clause is not really normative, but just adds to the confusion...Then
    comes XML 1.1, that extends the restrictive "Char" production:Char ::=
    [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]with the same comment
    "any Unicode character, excluding the surrogate blocks, FFFE, and FFFF."So
    in XML 1.0, the comment was accurate, not the formal production...In XML
    1.1, all C0 and C1 controls (except NUL) are now allowed, but some of them
    their use is restricted in some cases:

    RestrictedChar ::= [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] |
    [#x86-#x9F]

    What is even worse is that XML 1.1 now reallows NUL for system identifiers
    and URIs, through escaping mechanisms. Clearly, the XML specification is
    inconsistent there, and this would explain why most XML parsers are more
    permissive than what is given in the "Char" production of the XML
    specification, and that they simply refer to the definition of valid
    codepoints for Unicode and ISO/IEC 10646, excluding only surrogate code
    points (a valid code point can be a non-character, and can also be a
    NUL...): the XML parser will accept those code points, but will let the
    validity control to the application using the parsed XML data, or will offer
    some tuning options to enable this "Char" filter (that depends on XML
    version...).

    See also the various erratas for XML 1.1, related to "RestrictedChar"...
    Or to the list of characters whose use is discouraged (meaning explicitly
    not forbidden, so allowed...):

    [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF],
    [#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],
    [#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],
    [#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],
    [#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],
    [#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],
    [#x10FFFE-#x10FFFF].

    (these are: the DEL control of ASCII, most C1 controls except two of them,
    and most non-characters that are not forbidden in the "Char" production)

    For those still using XML 1.0, note that the current specification is in a
    "Third Edition"... just to complicate things:
    http://www.w3.org/TR/2004/REC-xml-20040204/



    This archive was generated by hypermail 2.1.5 : Fri Dec 10 2004 - 19:20:55 CST