Latin00 (was Re: MES as an ISO standard?)

From: Chris Lilley (Chris.Lilley@sophia.inria.fr)
Date: Wed Jul 02 1997 - 09:37:02 EDT


On Jul 1, 11:31pm, Markus G. Kuhn wrote:

> What I am worried about is that there is still such a lot of interest in
> 8-bit character sets. Look at the new French ISO 8859 Latin-0 proposal for
> instance:
>
> http://www.indigo.ie/egt/standards/iso8859/latin00.html
>

8 bit character sets of course have their place, as efficient encodings
for transferring HTML documents. Document processing happens in the document
character set, which for HTML is 10646. Provided the character set is
correctly indicated, there is no problem with using an 8bit encoding if
an entire document happens to fit into a small character repertoire.

The rest of this message is primarily intended for Michael Everson but
may be of interest to the lists also; it concerns use of numeric character
references in HTML.

=========

I was reading http://www.indigo.ie/egt/standards/iso8859/latin00.html
and was puzzled to see:

> The 3 characters missing in Latin 1 to support French fully are
> " ", " " and " ".

On viewing the source I discovered illegal numeric character references

> support French fully are " œ ", "
> Π" and " ٠".

The document character set for HTML is 10646, thus all numeric character
references refer to this regardless of the character encoding used to
transmit the document. This document is unlabelled, so in accordance with
the HTTP specification is assumed to be in ISO-8859-1. The thre NCRs do
not refer to printable characters, but to the C1 control characters.

I deduce that the document was written on a Windows machine and that the
character set was CP-1252. To quote from Chris Wendt of Microsoft:

> Microsoft Windows platforms use Code Page 1252 to display Latin-1 text
> such as HTML pages. This character set contains graphics characters in
> the C1 control area. Problems arise when document authors or
> authoring systems use these extra characters while still labelling the
> page as Latin-1. Problems also arise when numeric entity references are
> generated in the C1 zone, because numeric entity references are
> resolved relative to the document character set (Unicode) rather than
> relative to the character encoding used for a particular document.

These NCRs should either be replaced by the correct NCRs or alternatively
by the correct named entities [1].

The correct NCRs are œ Œ and Ÿ

The correct entity names are œ Œ and Ÿ

[1] http://www.w3.org/TR/WD-entities

-- 
Chris Lilley, W3C                          [ http://www.w3.org/ ]
Graphics and Fonts Guy            The World Wide Web Consortium
http://www.w3.org/people/chris/              INRIA,  Projet W3C
chris@w3.org                       2004 Rt des Lucioles / BP 93
+33 (0)4 93 65 79 87       06902 Sophia Antipolis Cedex, France



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT