Re: latin1 decoder implementation

From: Philippe Verdy <>
Date: Tue, 20 Nov 2012 03:46:51 +0100

2012/11/19 "Martin J. Drst" <>

>> Note also that the W3C
>> does not automatically endorses the Unicode and ISO/IEC 10646 standards as
>> well (there's a delay before accepting newer releases of TUS and ISO/IEC
>> 10646, and the W3C frequently adds now several restrictions).
> Can you give examples? As far as I'm aware, the W3C has always tried to
> make sure that e.g. new characters encoded in Unicode can be used as soon
> as possible. There are some cases where this has been missed in the past
> (e.g. XML naming rules), but where corrective action has been taken.

I did not speak about the characters themselves : the whole UCS is
accessible, but with restrictions of use (or incompatibilities of behavior
in the context of HTML). XML is more relaxed about this and this will not
changed because XML is not just a standard for transporting text but a lot
of various datas (even if some data requires a specific syntax, there are
also restricted characters for which you need an alternate representation,
not handled at the DOM level itself but at an appliation-specific higher
level of protocol).

The most important differences is in how Unicode charaxter properties are
handled, and in the tricky details of Unicode algorithms. We also have
differences in the subset of characters usable for identifiers (XML and
HTML are more restricted, or will require an escaping mechanims to work at
the DOM level, but not directly encodable in the XML syntax without this
escaping mechanims).

HTML is not perfect because there are also differences of implmentation for
the transform between the XML/HTML syntax level and the resulting data
accessible at the DOM level (it is not bijective when you start from the
XML syntax, due to alternate representations possible andpart of the
standard, but the reverse is also true and these are implementation
bugs still found everywhere, notably in the XML and HTML parsers where they
are frequent, but also sometimes, more rarely, in the XML/HTML encoders,
where the encoded data cannot be decoded exactly like it was at the initial
DOM level). There are also various interpretations still existing in the
behavior of whitespaces (according to the xml:whitespace="*"
pseudo-attribute which is frequently not matched exactly as it should be ;
such bugs are detected when trying to implement document signatures). Other
variations of interpretations are also caused by the named entities (the
difference exists between "validating" and "non-validating" parsers, and
even within the validating ones, when there are external document entities,
and in the specifications of data schemas).
Received on Mon Nov 19 2012 - 20:51:30 CST

This archive was generated by hypermail 2.2.0 : Mon Nov 19 2012 - 20:51:31 CST