Re: HTML - i18n / NCR & charsets

From: Martin J. Duerst (mduerst@ifi.unizh.ch)
Date: Thu Nov 28 1996 - 06:07:22 EST


On Tue, 26 Nov 1996, Misha Wolf wrote:

> Indeed, ISO 8859-1 is a *strict* subset of Unicode, hence there are *no*
> differences between the two.
>
> Microsoft's Windows Code Page 1252 (often called Windows Latin 1) has
> characters in the range 80-9F (decimal 128-159), unlike either the ISO
> 8859-X family of standards or Unicode.
>
> The NBSP is at A0 (decimal 160) and so presents no problems. WCP 1252
> has a bullet at 95 (decimal 149), not (as far as I can see) at decimal
> 143. The numeric character reference • is illegal.
>
> Chris Wendt, from Microsoft, agreed at Seville that the use of illegal
> numeric character references was unfortunate and asked for suggestions.
> The consensus was that entity names should be used instead. As entity
> names do not (appear to) exist for most of Microsoft's extra chars, it
> was suggested that some enterprising person write them up in an RFC.
> I believe there was at least one volunteer: Chris Lilley of W3C.

I was there, but don't remember this part of the discussion.
Defining entity names for things such as "..." may not be that
bad an idea.
However, one has to be aware of a few related facts before
actually doing this:

- Using 8-bit data directly and correctly labeling the page as
        being in Windows Code Page 1252 encoding is an existing
        solution (as far as browsers support CP 1252, and as
        far as starting to use all kinds of proprietary encodings
        is not really ideal).
- Using the correct numeric character reference is also a
        solution. As this uses decimal values beyond 255,
        and I have not yet heard of any pages using such values
        for something else than Unicode, it should not cause
        compatibility problems. It works on all browsers
        that support this part of the i18n spec.
- When we developed the i18n draft, we were repeatedly asked
        from various parties to include more entities. This
        included all kinds of areas. We decided to complete
        Latin-1, but not to go beyond it to not delay our work
        further. I guess if anybody starts to work on additional
        character entities, (s)he won't be able to stop with
        the few characters that are in CP 1252. The list may
        quickly become so long as to not be feasible as a
        single list, also.

Regards, Martin.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:33 EDT