Re: HTML - i18n / NCR & charsets

From: Misha Wolf (MISHA.WOLF@reuters.com)
Date: Wed Nov 27 1996 - 12:39:28 EST


We have three representations:
(a) raw octets
(b) numeric character references
(c) entity names.

Numeric character references are, of course, supposed to refer to Unicode/
ISO 10646.

The charset, whether specified via HTTP or HTML or a menu, should affect
the interpretation of (a). It should *not* affect the interpretation of
(b) or (c). The major browsers were broken in this regard and are being
gradually fixed.

An example of a "cheesy little editor" that created lots of polluted Web
pages was FrontPage 1.0. Though Microsoft sold it as suitable only for
Code Page 1252, lots of people used it on other Code Pages. FP 1.0 simply
exports stuff as if it were CP 1252, hence a Russian Web page ends up full
of Latin 1 entity names! FP 2.0 (aka 97) has, I believe, fixed this.

The various Internet Assistants did the same foul thing. I hope they've
been fixed.

The pages created using these tools will presumably (?) get fixed when
their authors pass them through the new versions of the tools. Can anyone
confirm/deny this?

Misha



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:32 EDT