RE: Usage of CP1252 characters on www.msnbc.com

From: Chris Pratley (chrispr@microsoft.com)
Date: Tue Jul 08 1997 - 02:36:19 EDT


Thanks for the ideas. What you describe is close to what I was planning
to do for the advanced settings, including the help file page with
descriptions of the ramifications of each choice. The labels I gave in
my example were not what I was suggesting as actual labels - that is
just what the options mean in reality (e.g. #1 and #2 require newest
browsers). But it is pretty hopeless to solve the general problem that
way. Your design is great for a well-educated technical person who may
not be familiar with this exact issue, but has a good head for software.

I've spent several years doing usability studies of real consumer and
corporate users. I think you are overestimating the average
non-technical person's tolerance for jargon and technical details. I'm
not against having buried options (even registry entries or config
files!) for the technical user, but expecting any normal person to have
any patience for having to mess with "encodings" is asking for it. It is
the kind of thing where people do not accept any explanation - it should
just work, and if it doesn't work, then the software is at fault.

For example, in your text you say, "This will not display some
characters on old browser without Unicode support.". Right away you've
lost the majority of people. Unicode? What is that? Browser? How do I
know what an old one is, and which ones won't "characters" show up on?
By the way, what's a "character"? What part of my web page won't be
displayed? Explain to me why your software doesn't work properly...

In testing controls like the one you described, I've found people fail
utterly to use them if they are not serious technical people. This is
why I am looking for some other ideas, and to get a feel for what other
designers are doing.

As an aside, you're leaving out most of the rest of the world with your
design. There are many other encodings (Shift-JIS, JIS-0208, EUC-JP,
Big5, GB2312, KOI8-R, etc) that are in use that people in each local
market expect our software to support. So it gets a little more
complicated.

The sad fact is that if we default to a solution like #1, then we invite
a huge number of calls to technical support asking why things don't look
the same in the browser as they did in the authoring tool. Each call
averages something like a $25 cost, which very rapidly reduces profit
and hence the whole point of making the software. It's not an easy
decision, and if you do the math, it's a lot of money.

I really hope new browsers proliferate quickly so it will be possible to
default to this setting soon. At the moment, defaulting to #1 would
cause trouble for more absolute numbers of people than using the illegal
encodings (#4) does. Is anyone ready to take the plunge and break
backward compatibility _by default_ in order to conform to the emerging
standard?

Cheers,
Chris

        -----Original Message-----
        From: Unicode Discussion [SMTP:unicode@unicode.org]
        Sent: Monday, July 07, 1997 8:14 PM
        To: Multiple Recipients of
        Subject: Re: Usage of CP1252 characters on www.msnbc.com

        Chris Pratley wrote on 1997-07-08 02:33 UTC:
> Although a configurable option is a possible solution, we know
that the
> typical user (representing around 95-98% of users) never
changes
> defaults in a program, especially something as obscure as
encoding
> options. As you may know it is very popular to attack
Microsoft for "UI
> bloat", and this would no doubt add to that IMHO. But assuming
we have
> options, "which one do you default to?" is the $64000
question.

        Well, it certainly will not do any harm to offer all possible
        options in a somewhat hidden way, say by allowing to select the
option
        in the Windows Registry or some configuration file. This would
at
        least allow people like MSNBC who have already identified and
understood
        the problem to make the appropriate switch in a minute instead
of
        having to "work hard on a fix for the problem". In the MSNBC
case,
        the optimal choice is certainly Latin-1 downconversion.

> If you did have options, you could label the options you list
as:
> a) compatible with 1997 browsers and later
> b) compatible with 1997 browsers and later
> c) modify contents of document to be readable in all
browsers.
> Warning: some contents may appear different from your original
document

        And noone would understand any more what these options are
about.
        It is not possible to understand the difference between these
options
        if they are not labeled with precise terminology (Unicode,
numeric
        character reference, ISO 8859-1, etc.). The label texts you
suggest
        are a user interface nightmare that I have encountered much too
often
        on Windows system: By suppressing precise vocabulary, you give
the
        inexperienced user the impression that she knows what is going
on
        (without actually affecting in any way the level of
understanding),
        while giving at the same time the expert user a very hard time
        figuring out what these "user friendly" options stand for.

        The user interface that I would prefer is:

          Character Set Compatibility Options

          Advanced configuration: You normally do *not* want to change
these
          settings unless you have a specific requirement for the way
certain
          Windows specific characters are represented such that they can
be
          processed on old or non-Windows browsers.

          How shall Windows encode CP1252 characters in the code range
128-159
          that are not part of ISO 8859-1, the classical HTML character
set
          (e.g., the smart quotes and the trademark sign)?

          1) Use Unicode numerical character references: this is the
encoding that
             follows strictly the HTML standard. This will not display
some
             characters on old browser without Unicode support.

          2) Use Unicode UTF-8: this is a modern more compact encoding
that follows
             strictly the HTML standard and allows easier editing on
some Unix
             systems. This will not display some characters on old
browser
             without Unicode support.

          3) Use only ISO Latin-1 characters: Replace some Windows
specific
             characters by similar replacements that are guaranteed to
             be displayable on even the oldest Web browser.

          4) Use native Windows character set (CP1252): This option will
encode
             all characters such that they are correctly displayed on
even the
             oldest Windows browser, but most likely not on other
platforms.
             Use this option only when you know that only Windows
browsers
             will view the file (e.g., on Intranets) and Option 1) is
not
             acceptable because some of them are old pre-Unicode
versions
             that have not yet been updated.

          Default is 1), if you get complains from people with old
browsers,
          we recommend 3) except if you do not want characters to be
changed
          and are sure that all browsers are running on Windows, in
which
          case we recommend 4). Option 2) is available for special
applications
          and experimental purposes, we recommend not to use it unless
you know
          that you want a UTF-8 file in order to edit it on another
platform.

        If you are concerned about the default, you can still implement
this
        menue now (such that customers like MSNBC can select option 3)
and use
        option 4) as a default at the moment. Two releases later, you
make
        1) the default when 95% of your customers have Unicode browsers.

        If you are concerned about the amount of text, you can easily
move
        all of this into a help screen easily accessible from the menue.

> Now, if your competitor offered this option:
> d) Compatible with all browsers used _in your company_
> you would have a hard time competing. (Note the emphasis on
"in your
> company" in the fourth option, meaning the customer's company.
You could
> even go on to say "most browsers on the Internet", but that
got me in
> trouble last time :-))
>
> Erik raised an option of writing the actual byte value of the
characters
> in the file. It was my understanding that this can cause
trouble in some
> Unix servers that are not expecting byte vales in the
0x80-0x9F range.
> Can someone comment here?

        If you check my reply again, you'll find that I also suggested
the exactly
        same solution there, too (see option 4 above):

>> - output directly in CP1252 bytes (not NCR!) and make sure
that the
>> IANA registry contains a reasonable MIME entry for CP1252
and that
>> the HTTP server will announce CP1252 as the encoding.

        It is not really in the interest of finding a simple common
denominator
        among all plattforms, but it is formally better than using the
        CP1252 NCRs.

        I would be surprised if Unix servers have problems with bytes in
the
        C1 range. They should normally just pass these values on
transparently.

        Markus

        --
        Markus G. Kuhn, Computer Science grad student, Purdue
        University, Indiana, USA -- email: kuhn@cs.purdue.edu



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT