Re: HTML - i18n / NCR & charsets

From: Misha Wolf (MISHA.WOLF@reuters.com)
Date: Tue Nov 26 1996 - 19:39:14 EST


Indeed, ISO 8859-1 is a *strict* subset of Unicode, hence there are *no*
differences between the two.

Microsoft's Windows Code Page 1252 (often called Windows Latin 1) has
characters in the range 80-9F (decimal 128-159), unlike either the ISO
8859-X family of standards or Unicode.

The NBSP is at A0 (decimal 160) and so presents no problems. WCP 1252
has a bullet at 95 (decimal 149), not (as far as I can see) at decimal
143. The numeric character reference • is illegal.

Chris Wendt, from Microsoft, agreed at Seville that the use of illegal
numeric character references was unfortunate and asked for suggestions.
The consensus was that entity names should be used instead. As entity
names do not (appear to) exist for most of Microsoft's extra chars, it
was suggested that some enterprising person write them up in an RFC.
I believe there was at least one volunteer: Chris Lilley of W3C.

Misha

---

There are just a few differences; mainly in the empty block which has the funny chars such as th bullet (143) and non-breaking-space (160) to name the popular offenders.

---

Hmmm...Is there actually a difference in the first 256 codes of Unicode and ISO8859-1? I thought they were identical over that range?

---

Small bit of text on i18n-html and possible problems with the Numerical char-index/code references into a unicode rather than the announced charset in HTTP; and the lack of signalling out-of-band of this break with current practice.

As HTML if often transported using HTTP, the current proposal for an internationalized version of HTML causes several conflicts with widespread existing problems and 'out-of-HTML- band' communicated charset information on HTTP level; or the default latin1 assumption.

In the HTTP header, a resource send out can be labeled with a charset. This label is not part of the document stream, but send seperately in the MIME header of HTTP. If no charset is defined in such a way, latin1 is to be assumed.

In the actual world people have taken to using so called Numerical Glyph/Character references within their HTML documents, such as   which are simply indexes into the 'defined' character set.

In the il8n proposal these numerical references are taken to be indexes into the unicode set, so called 'codepoint's. This regardless of the character set announced in the header. (or in an http_equiv in the actual body).

Currently most of these numerical references are intented by their authors to be indexes into latin1 or, if a charset is announced in the http header by the server, as in index into that set.

Effectively HTML has been upgraded to a new and better version, which most certainly addresses, and has solved, some of the issues related to internationalized publishing.

Although the i18n proposal is most certainly the way to go, and superior in every respect; it does break some widespread current practice.

I acknowledge that the cases where it breaks practice are few and in between; and mainly concern just a few pi-font sybols such as the buller but the principle is just as important. Also I do realize that their is a 'godel' problem in that the actual message cannot know about the charset representation; and that thus the content-type cannouncement of the charset in the http header is dubious when it comes to NCRs.

Some possible solutions are proposed:

1. An extended Content-type header is used. Content-type: text/html.i18n Content-type: text/html-i18n

2. An additional attribute to the charset is used Content-type: text/html; charset=iso-8859-1; ncr=iso-104..

3. An additional (level) attribute to the text/html is used. Content-type: text/html; level=2; charset=iso8859-1 Content-type: text/html; version=2.0/i; charset=iso8859-1

4. An additional DTD specifier in the HTML is insisted upon. <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 2.0i//EN">

5. An additional header is added to signal that the site is internatialised. Content-Quality: i18n/v1.02

Please note that the effect accomplished by each of the above techniques are similar; they serve to inform the receiving end about the way any in-line numerical character references are to be treated.

Option number 1 is by far the easiest to implement; and some of the deployed server and browser codes is able to tread this as an 'html' resource with a 'il8n; flavouring.

If HTML-i18n is to go ahead, without any signaling about the NCRs target charset change (i.e in Unicode rather than the announced charset); then IMHO this should at least be mensioned in the draft as it break existing, widespread, practice, which prior to this i18n draft could not be signalled as 'wrong' or 'illegal'.

Dw.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:32 EDT