RE: Usage of CP1252 characters on www.msnbc.com

From: Chris Pratley (chrispr@microsoft.com)
Date: Mon Jul 07 1997 - 15:49:25 EDT


Actually Microsoft has a huge incentive to support Unicode, which is why
Unicode is at the core of NT, Office97, FrontPage, and all major new
applications coming from Microsoft. IE4 also supports UTF-8, UTF-7, UCS
(both flavors) and so on. Bill Gates even mentioned how important
Unicode was in his recent New York Times column.

You raise an excellent point about fidelity of Web pages on different
browsers, and the difficulty of moving immediately to Unicode even
though we all want to.

Regarding the problem of &#128 through &#159, I have a few questions for
this alias. Imagine you have a customer base who insists on perfect
backward compatibility with their existing solutions (surprise!). For
example, they want to take an existing Word document and save it to
HTML. How would you deal with this issue:

They have many documents that use "smart quotes" :
0x93 0x201c #LEFT DOUBLE QUOTATION MARK
0x94 0x201d #RIGHT DOUBLE QUOTATION MARK

How to represent these in HTML?

1. Convert to dumb quotes
They would be hopping mad if these turned into "dumb" straight quotes.
This may seem like a reasonable degradation to the average technical
person, but to customers this is known as "document corruption".

2. Write out as Unicode &# NCRs. (the "correct" way)
Unless they are using a Unicode enabled browser, these are ignored as
noted in this mail stream. On a corporate Intranet it is conceivable
that you could tell the customer they are required to upgrade their
browsers to the newest ones, but they really don't like that kind of
thing. On the Internet itself, it is obvious that only a fraction of
people upgrade to newer browsers as you noted. Fortunately this is a
growing fraction.

3. Write out as &#147 and &#148.
Oh look, on all of the customer's machines these display just fine. It
turns out that virtually all old browsers can understand these
characters. There is a small % that does not (e.g. some Unix browsers).
This is a problem for the external web site, but all the home users they
are trying to reach can read those characters fine.

When authoring new pages, you could tell the author to use only "dumb"
straight quotes. That might work for new content, but people want smart
quotes for a reason - they look better.

It's just my personal opinion, but I suspect that the range &#128-&#159
is not going away anytime soon in real use. I believe that this should
become a kind of deprecated compatibility area that is not recommended
and may have support removed in the future. In time, the vast majority
of browsers in use will be Unicode capable, and we can stop using these
values. In the short term, if I was a canny Unix browser writer, I would
add support for mapping &#128-&159 to the Unicode equivalents rather
than be unable to read this "illegal" HTML.

The same problem exists with UTF-8 of course. It's easy to take a high
moral ground and proclaim that everything should be Unicode, but reality
tends to rear its ugly head and a real customer solution has to be
adopted that hopefully provides a clean migration to the "correct"
solution. If you want to keep those customers that is.

I am not speaking for Microsoft in any way. Just personal opinion.

Cheers,
Chris

        -----Original Message-----
        From: Unicode Discussion [SMTP:unicode@unicode.org]
        Sent: Saturday, July 05, 1997 8:24 AM
        To: Multiple Recipients of
        Subject: Re: Usage of CP1252 characters on www.msnbc.com

>It would be nice if you could fix this quickly.

        Good luck. It's a noble effort, but what incentive does
Microsoft have
        to support Unicode today? This isn't just idle Microsoft bashing
- I'm
        curious if people in the Unicode consortium have thought about
the
        political issue of actually getting people to adopt Unicode.
MSNBC's
        web pages look fine on 80% of the web browsers - what percentage
would
        the Unicode pages look right on?

        I sent some mail to Brock Meeks saying much the same thing about
        journalistic integrity - he's working for MSNBC now, a
trouble-causing
        journalist who understands things like why Unicode is important.
It
        doesn't look like he's been able to get his own editors to
publish his
        articles correctly.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT