Re: Usage of CP1252 characters on www.msnbc.com

From: Markus G. Kuhn (kuhn@cs.purdue.edu)
Date: Tue Jul 08 1997 - 15:23:03 EDT


Lars Henrik Mathiesen wrote on 1997-07-08 12:41 UTC:
> The user interface that I would prefer is:
> ...
> 1) Use Unicode numerical character references: ...
> 2) Use Unicode UTF-8: ...
> 3) Use only ISO Latin-1 characters: ...
> 4) Use native Windows character set (CP1252): ...
>
> What happened to the idea of using named character entities, as in
> <http://www.w3.org/pub/WWW/TR/WD-entities>? Someone did mention them,
> but no notice seemed to be taken...

For good reason. They

  - are even less widely supported
  - need just another table in the implementation
  - do not scale beyond CP1252
  - do not even support MES or WGL4
  - are just yet another small and arbitrary chosen subset of
    Unicode that will contribute to the uncontrolable
    Unicode subset inflation

and therefore do not look that attractive at all to me.

The named character entities - like NCRs - are just a mechanism that allows
you to stay in the ASCII world. In my opinion the ultimate solution will
be UTF-8 or UCS-2, because then ARABIC LIGATURE UIGHUR KIRGHIZ YEH WITH HAMZA
ABOVE WITH ALEF MAKSURA ISOLATED FORM will be as much a normal base
character set member as LATIN CAPITAL CHARACTER A. I just do not yet
recommend UTF-8 right now, because editors for it are not yet really wide
spread (except under Plan9).

NCRs are just a simpler and somewhat less convenient step to the ultimate
solution, and providing too convenient intermediate but incomplete
solutions is more of a danger in the long term in the sense that it
will delay the ultimate solution and will just create mechanisms that
have to be supported ad infinitum for backwards compatibility.

Markus

-- 
Markus G. Kuhn, Computer Science grad student, Purdue
University, Indiana, USA -- email: kuhn@cs.purdue.edu



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT