Re: Nicest UTF

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Dec 10 2004 - 19:18:42 CST

Next message: John Cowan: "Re: Nicest UTF"

Previous message: D. Starner: "Re: Nicest UTF"
Maybe in reply to: Theodore H. Smith: "Nicest UTF"
Next in thread: John Cowan: "Re: Nicest UTF"
Reply: John Cowan: "Re: Nicest UTF"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "Philippe Verdy" <verdy_p@wanadoo.fr>
> From: "Marcin 'Qrczak' Kowalczyk" <qrczak@knm.org.pl>
>> "Philippe Verdy" <verdy_p@wanadoo.fr> writes:
>>
>>> The XML/HTML core syntax is defined with fixed behavior of some
>>> individual characters like '&', '<', quotation marks, and with special
>>> behavior for spaces.
>>
>> The point is: what "characters" mean in this sentence. Code points?
>> Combining character sequences? Something else?
>
> See the XML character model document... XML ignores combining sequences.
> But for Unicode and for XML a character is an abstract character with a
> single code allocated in a *finite* repertoire. The repertoire of all
> possible combining characters sequences is already infinite in Unicode, as
> well as the number of "default grapheme clusters" they can represent.

Note there is some differently relaxed definitions of what constitutes a
"character" for XML.
If you look at the XML 1.0 Second Edition, it specifies that the document is
a "text" (defined only as a sequence of "characters", which may represent
markup or character data) will only contain characters in this set:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]

But the comment following it specifies:
"any Unicode character, excluding the surrogate blocks, FFFE, and FFFF."
which is considerably weaker (because it would include ALL basic controls in
the range #x0 to #x1F, and not only TAB, LF, CR); the restrictive definition
of "Char" above also includes the whole range of C1 controls (#x80..#x9F),
so I can't understand why the Char definition is so restrictive on controls;
in addition the definition of Char also *includes* many non-characters (it
only excludes surrogates, and U+FFFE and U+FFFF, but forgets to exclude
U+1FFFE and U+1FFFF, U+2FFFE and U+2FFFF, ..., U+10FFFE and U+10FFFF).

So XML does allow Unicode/ISO10646 non-characters... But not all. Apparently
many XML parsers seem to ignore the restriction of Char above, notably in
CDATA sections....

The alternative is then to use numeric character references, as defined by
this even weaker production (in 4.1. Character and Entity References):

CharRef ::= '&#' [0-9]+ ';'
| '&#x' [0-9a-fA-F]+ ';'

but with this definition:
"A character reference refers to a specific character in the ISO/IEC 10646
character set, for example one not directly accessible from available input
devices."

Which is exactly the purpose of encoding something like "" to encode a
SOH character U+0001 (which after all is a valid Unicode/ISO/IEC10646
"character"), or even a NUL character.

The "CharRef" production however is annotated by a Well-Formedness
Constraint, "Legal Character":
"Characters referred to using character references must match the production
for Char.

Note however that nearly all XML parsers don't seem to honor this constraint
(like SGML parsers...)!

This was later amended in an errata for XML 1.0 which now says that the list
of code points whose use is *discouraged* (but explicitly *not* forbidden)
for the "Char" production is now:
[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF],
[#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],
[#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],
[#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],
[#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],
[#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],
[#x10FFFE-#x10FFFF].
This clause is not really normative, but just adds to the confusion...Then
comes XML 1.1, that extends the restrictive "Char" production:Char ::=
[#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]with the same comment
"any Unicode character, excluding the surrogate blocks, FFFE, and FFFF."So
in XML 1.0, the comment was accurate, not the formal production...In XML
1.1, all C0 and C1 controls (except NUL) are now allowed, but some of them
their use is restricted in some cases:

RestrictedChar ::= [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] |
[#x86-#x9F]

What is even worse is that XML 1.1 now reallows NUL for system identifiers
and URIs, through escaping mechanisms. Clearly, the XML specification is
inconsistent there, and this would explain why most XML parsers are more
permissive than what is given in the "Char" production of the XML
specification, and that they simply refer to the definition of valid
codepoints for Unicode and ISO/IEC 10646, excluding only surrogate code
points (a valid code point can be a non-character, and can also be a
NUL...): the XML parser will accept those code points, but will let the
validity control to the application using the parsed XML data, or will offer
some tuning options to enable this "Char" filter (that depends on XML
version...).

See also the various erratas for XML 1.1, related to "RestrictedChar"...
Or to the list of characters whose use is discouraged (meaning explicitly
not forbidden, so allowed...):

[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF],
[#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],
[#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],
[#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],
[#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],
[#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],
[#x10FFFE-#x10FFFF].

(these are: the DEL control of ASCII, most C1 controls except two of them,
and most non-characters that are not forbidden in the "Char" production)

For those still using XML 1.0, note that the current specification is in a
"Third Edition"... just to complicate things:
http://www.w3.org/TR/2004/REC-xml-20040204/

Next message: John Cowan: "Re: Nicest UTF"
Previous message: D. Starner: "Re: Nicest UTF"
Maybe in reply to: Theodore H. Smith: "Nicest UTF"
Next in thread: John Cowan: "Re: Nicest UTF"
Reply: John Cowan: "Re: Nicest UTF"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Dec 10 2004 - 19:20:55 CST