Re: Usage of CP1252 characters on www.msnbc.com

From: Markus G. Kuhn (kuhn@cs.purdue.edu)
Date: Mon Jul 07 1997 - 23:15:38 EDT


Chris Pratley wrote on 1997-07-08 02:33 UTC:
> Although a configurable option is a possible solution, we know that the
> typical user (representing around 95-98% of users) never changes
> defaults in a program, especially something as obscure as encoding
> options. As you may know it is very popular to attack Microsoft for "UI
> bloat", and this would no doubt add to that IMHO. But assuming we have
> options, "which one do you default to?" is the $64000 question.

Well, it certainly will not do any harm to offer all possible
options in a somewhat hidden way, say by allowing to select the option
in the Windows Registry or some configuration file. This would at
least allow people like MSNBC who have already identified and understood
the problem to make the appropriate switch in a minute instead of
having to "work hard on a fix for the problem". In the MSNBC case,
the optimal choice is certainly Latin-1 downconversion.

> If you did have options, you could label the options you list as:
> a) compatible with 1997 browsers and later
> b) compatible with 1997 browsers and later
> c) modify contents of document to be readable in all browsers.
> Warning: some contents may appear different from your original document

And noone would understand any more what these options are about.
It is not possible to understand the difference between these options
if they are not labeled with precise terminology (Unicode, numeric
character reference, ISO 8859-1, etc.). The label texts you suggest
are a user interface nightmare that I have encountered much too often
on Windows system: By suppressing precise vocabulary, you give the
inexperienced user the impression that she knows what is going on
(without actually affecting in any way the level of understanding),
while giving at the same time the expert user a very hard time
figuring out what these "user friendly" options stand for.

The user interface that I would prefer is:

  Character Set Compatibility Options

  Advanced configuration: You normally do *not* want to change these
  settings unless you have a specific requirement for the way certain
  Windows specific characters are represented such that they can be
  processed on old or non-Windows browsers.

  How shall Windows encode CP1252 characters in the code range 128-159
  that are not part of ISO 8859-1, the classical HTML character set
  (e.g., the smart quotes and the trademark sign)?

  1) Use Unicode numerical character references: this is the encoding that
     follows strictly the HTML standard. This will not display some
     characters on old browser without Unicode support.

  2) Use Unicode UTF-8: this is a modern more compact encoding that follows
     strictly the HTML standard and allows easier editing on some Unix
     systems. This will not display some characters on old browser
     without Unicode support.

  3) Use only ISO Latin-1 characters: Replace some Windows specific
     characters by similar replacements that are guaranteed to
     be displayable on even the oldest Web browser.

  4) Use native Windows character set (CP1252): This option will encode
     all characters such that they are correctly displayed on even the
     oldest Windows browser, but most likely not on other platforms.
     Use this option only when you know that only Windows browsers
     will view the file (e.g., on Intranets) and Option 1) is not
     acceptable because some of them are old pre-Unicode versions
     that have not yet been updated.

  Default is 1), if you get complains from people with old browsers,
  we recommend 3) except if you do not want characters to be changed
  and are sure that all browsers are running on Windows, in which
  case we recommend 4). Option 2) is available for special applications
  and experimental purposes, we recommend not to use it unless you know
  that you want a UTF-8 file in order to edit it on another platform.

If you are concerned about the default, you can still implement this
menue now (such that customers like MSNBC can select option 3) and use
option 4) as a default at the moment. Two releases later, you make
1) the default when 95% of your customers have Unicode browsers.

If you are concerned about the amount of text, you can easily move
all of this into a help screen easily accessible from the menue.

> Now, if your competitor offered this option:
> d) Compatible with all browsers used _in your company_
> you would have a hard time competing. (Note the emphasis on "in your
> company" in the fourth option, meaning the customer's company. You could
> even go on to say "most browsers on the Internet", but that got me in
> trouble last time :-))
>
> Erik raised an option of writing the actual byte value of the characters
> in the file. It was my understanding that this can cause trouble in some
> Unix servers that are not expecting byte vales in the 0x80-0x9F range.
> Can someone comment here?

If you check my reply again, you'll find that I also suggested the exactly
same solution there, too (see option 4 above):

>> - output directly in CP1252 bytes (not NCR!) and make sure that the
>> IANA registry contains a reasonable MIME entry for CP1252 and that
>> the HTTP server will announce CP1252 as the encoding.

It is not really in the interest of finding a simple common denominator
among all plattforms, but it is formally better than using the
CP1252 NCRs.

I would be surprised if Unix servers have problems with bytes in the
C1 range. They should normally just pass these values on transparently.

Markus

-- 
Markus G. Kuhn, Computer Science grad student, Purdue
University, Indiana, USA -- email: kuhn@cs.purdue.edu



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT