Re: Latin00 (was Re: MES as an ISO standard?)

At 06:35 -0700 1997-07-02, Unicode Discussion wrote:
>On Jul 1, 11:31pm, Markus G. Kuhn wrote:
>> What I am worried about is that there is still such a lot of interest in
>> 8-bit character sets. Look at the new French ISO 8859 Latin-0 proposal for
>> instance:
>8 bit character sets of course have their place, as efficient encodings
>for transferring HTML documents. Document processing happens in the document
>character set, which for HTML is 10646. Provided the character set is
>correctly indicated, there is no problem with using an 8bit encoding if
>an entire document happens to fit into a small character repertoire.

Well, people still use 8-bit character sets, and, like it or not, will do
so for many years to come.

>The rest of this message is primarily intended for Michael Everson but
>may be of interest to the lists also; it concerns use of numeric character
>references in HTML.

>> The 3 characters missing in Latin 1 to support French fully are
>> " ", " " and " ".

>On viewing the source I discovered illegal numeric character references
>> support French fully are " œ ", "
>> Π" and " ٠".

I did not encode numeric character references. I write my documents with
the Macintosh character set, and save them in 8-bit in Latin 1 encoding. I
believed that the encoding used by PageSpinner was the Windows superset of
Latin 1, which contains the French characters. Could your source reader
have converted them to numeric references?

>I deduce that the document was written on a Windows machine and that the
>character set was CP-1252. To quote from Chris Wendt of Microsoft:
>> Microsoft Windows platforms use Code Page 1252 to display Latin-1 text
>> such as HTML pages. This character set contains graphics characters in
>> the C1 control area. Problems arise when document authors or
>> authoring systems use these extra characters while still labelling the
>> page as Latin-1. Problems also arise when numeric entity references are
>> generated in the C1 zone, because numeric entity references are
>> resolved relative to the document character set (Unicode) rather than
>> relative to the character encoding used for a particular document.
>These NCRs should either be replaced by the correct NCRs or alternatively
>by the correct named entities [1].
>The correct NCRs are œ Œ and Ÿ
>The correct entity names are œ Œ and Ÿ

Ick. Nothing less pleasant than those long entity names. But maybe I should
use them for those characters.....

