Re: Nicest UTF

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Dec 10 2004 - 20:23:07 CST

Next message: Mark Davis: "Re: US-ASCII (was: Re: Invalid UTF-8 sequences)"

Previous message: D. Starner: "Re: Nicest UTF"
In reply to: John Cowan: "Re: Nicest UTF"
Next in thread: John Cowan: "Re: Nicest UTF"
Reply: John Cowan: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "John Cowan" <jcowan@reutershealth.com>
> Marcin 'Qrczak' Kowalczyk scripsit:
>
>> http://www.w3.org/TR/2000/REC-xml-20001006#charsets
>> implies that the appropriate level for parsing XML is code points.
>
> You are reading the XML Recommendation incorrectly. It is not defined
> in terms of codepoints (8-bit, 16-bit, or 32-bit) but in terms of
> characters. XML processors are required to process UTF-8 and UTF-16,
> and may process other character encodings or not. But the internal
> model is that of characters. Thus surrogate code points are not
> allowed.

I have different reading, because the "character" in XML is not the same as
the "character" in Unicode. For XML, U+10FFFF is a valid character (even if
its use is explicitly not recommanded, it is perfectly valid), for Unicode
it's a non-character... For XML, U+0001 is *sometimes* a valid character,
sometimes not.

And I disagree with you about the fact the U+0000 can't be used in XML
documents. It can be used in URI through URI escaping mechanism, as
explicitly indicated in the XML specification...

And the fact that the various character productions, that are normally
normative, have been changed so often, sometimes through erratas that were
forgotten in the text of the next edition of the standard, then reintroduced
in an errata, shows that these productions are less reliable than the
descriptive *definitions* which ARE normative in XML...

The only thing about which I can agree is that XML will forbid surrogates
and U+FFFE and U+FFFF, but I won't say that a XML parser that does not
reject NULs or other non-characters or "disallowed" C0 controls is so much
buggy. I do think that these restrictions is a defect of XML...

But all these is also a proof that XML documents are definitely NOT
plain-text documents, so you can't use Unicode encoding rules at the encoded
XML document level, only at the finest plain-text nodes (these are the
levels that the productions in the XML standard are trying, with more or
less success, to standardize).

As a consequence any process that blindly applies a plain-text normalization
to a complete XML document is bogous, because it breaks the most basic XML
conformance, i.e. the core document structure...

Next message: Mark Davis: "Re: US-ASCII (was: Re: Invalid UTF-8 sequences)"
Previous message: D. Starner: "Re: Nicest UTF"
In reply to: John Cowan: "Re: Nicest UTF"
Next in thread: John Cowan: "Re: Nicest UTF"
Reply: John Cowan: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Dec 10 2004 - 20:24:10 CST