Date: Thu Jul 03 1997 - 08:34:20 EDT

A 09:37 02/07/97 -0400, Chris Lilley a écrit :

>8 bit character sets of course have their place, as efficient encodings
>for transferring HTML documents. Document processing happens in the document
>character set, which for HTML is 10646. Provided the character set is
>correctly indicated, there is no problem with using an 8bit encoding if
>an entire document happens to fit into a small character repertoire.
>On viewing the source I discovered illegal numeric character references
>> support French fully are " œ ", "
>> Π" and " ٠".
>The document character set for HTML is 10646, thus all numeric character
>references refer to this regardless of the character encoding used to
>transmit the document. This document is unlabelled, so in accordance with
>the HTTP specification is assumed to be in ISO-8859-1. The thre NCRs do
>not refer to printable characters, but to the C1 control characters.
>I deduce that the document was written on a Windows machine and that the
>character set was CP-1252. To quote from Chris Wendt of Microsoft:
>> Microsoft Windows platforms use Code Page 1252 to display Latin-1 text
>> such as HTML pages. This character set contains graphics characters in
>> the C1 control area. Problems arise when document authors or
>> authoring systems use these extra characters while still labelling the
>> page as Latin-1. Problems also arise when numeric entity references are
>> generated in the C1 zone, because numeric entity references are
>> resolved relative to the document character set (Unicode) rather than
>> relative to the character encoding used for a particular document.

[Alain] :
You got a nice illustration of the problem we're trying to solve and that I
am confident we will solve it.

We modified that "Latin 0" proposal Wednesday night before we hold the
SC2/WG3 meeting and SC2 Plenary later on this week and next week. Apart
from the EURO, the French ligatures OE, the upper case Y DIARESEIS, the
missing characters for full support of Finnish in Latin 1 (a language
which, like French, was supposed to be covered fully but which was not)
will be added in this new Latin table intended to replace Latin 1 in 5
years at the latest during parallel transition to UNICODE, of course (and,
importantly, as a standard reference for private codes like EBCDIC code
tables on main frames to allow full platform range character integrity).
Btw the missing Finnish characters are upper case and lower case S CARON
and Z CARON.

The proposal was initially a joint France-Canada proposal. It is now
officially a France-Canada-Finland-Ireland-Denmark proposal. Other
countries' support is very likely as per the atmosphere that reigns in
corridors here.

It remains a table intended for use for Western European languages. There
could have been competition between other languages if we had tried to add
2 other characters and therefore we stopped there. It is limited to
correcting current problems internal to Latin 1, taking the opportunity of
the urgent EURO SYMBOL support for making the change (a problem not to be
underestimated in all its implications, human, economical, technical and
psychological, considered in Europe as important as the YEAR 2000 transition).

Alain LaBonté
Iraklion, Ellas

