Usage of CP1252 characters on www.msnbc.com

From: Markus Kuhn (kuhn@cs.purdue.edu)
Date: Fri Jul 04 1997 - 23:57:22 EDT


Hello, MSNBC Webmaster!

I'd like to report a serious technical problem with your Web pages.
If you are not in charge of the maintenance of www.msnbc.com,
please forward this to your local HTML expert.

At many places, you use characters in the range 128-159 on your
Web pages.

For example, I saw today in one of your HTML files

  “Robotic exploration,” he says, “is just
  the first step.

The problem is the following: The HTML standard specifies that
the character set to be used on Web pages is the ISO 8859-1
character set. You are using instead the Microsoft Windows
character set known as "Code Page 1252". The only difference between
CP1252 and ISO 8859-1 is that CP1252 contains 24 additional
characters in the byte range 128-159. These characters are
*only* available under MS-Windows, they are not available under
*any* other operating system (Linux, Solaris, HP-UX, IRIX,
Amiga, Macintosh, BSDI, etc.) which are used by a significant
fraction of Internet users, especially at universities.

The annoying aspect of the problem is that the type of left and right
quotation marks that you use are among the non-ISO-8859-1 characters.
These characters just disappear on most systems except MS-Windows.
Disappearing quotation marks are not just a typographical annoyance,
they also can seriously endanger journalistic integrity: it becomes
much less obvious for the reader what is quoted and what are
the author's words.

It would be nice if you could fix this quickly. There are several
solutions. In case of the left and right quotation marks, just
replace them by the normal undirected ASCII quotation marks.

Alternatively: do not use the CP1252 code numbers, but use the
corresponding Unicode code numbers instead (see the table below).
This is then fully supported by the HTML standard.

I do not know what software you use to author HTML Web pages,
but using the character codes 128-159 on the Web pages is clearly
a bug in this software. In case you didn't develop the software
that generates your HTML pages yourself, this problem ultimately
should be fixed by the developer of your software. Please
feel free to forward this mail to the manufacturer.

In case you didn't understand anything of what I wrote, please
recognize that it is still a serious problem and please forward
this message to a computer expert knowledgeable about character
sets and HTML.

Below, I append as a reference the list of character names
in the 128-159 range of CP1252 together with their Unicode
code numbers in hexadecimal:

0x80 0x0080 #NOT USED
0x81 0x0081 #NOT USED
0x82 0x201a #SINGLE LOW-9 QUOTATION MARK
0x83 0x0192 #LATIN SMALL LETTER F WITH HOOK
0x84 0x201e #DOUBLE LOW-9 QUOTATION MARK
0x85 0x2026 #HORIZONTAL ELLIPSIS
0x86 0x2020 #DAGGER
0x87 0x2021 #DOUBLE DAGGER
0x88 0x02c6 #MODIFIER LETTER CIRCUMFLEX ACCENT
0x89 0x2030 #PER MILLE SIGN
0x8a 0x0160 #LATIN CAPITAL LETTER S WITH CARON
0x8b 0x2039 #SINGLE LEFT-POINTING ANGLE QUOTATION MARK
0x8c 0x0152 #LATIN CAPITAL LIGATURE OE
0x8d 0x008d #NOT USED
0x8e 0x008e #NOT USED
0x8f 0x008f #NOT USED
0x90 0x0090 #NOT USED
0x91 0x2018 #LEFT SINGLE QUOTATION MARK
0x92 0x2019 #RIGHT SINGLE QUOTATION MARK
0x93 0x201c #LEFT DOUBLE QUOTATION MARK
0x94 0x201d #RIGHT DOUBLE QUOTATION MARK
0x95 0x2022 #BULLET
0x96 0x2013 #EN DASH
0x97 0x2014 #EM DASH
0x98 0x02dc #SMALL TILDE
0x99 0x2122 #TRADE MARK SIGN
0x9a 0x0161 #LATIN SMALL LETTER S WITH CARON
0x9b 0x203a #SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
0x9c 0x0153 #LATIN SMALL LIGATURE OE
0x9d 0x009d #NOT USED
0x9e 0x009e #NOT USED
0x9f 0x0178 #LATIN CAPITAL LETTER Y WITH DIAERESIS

These are the characters that you should not use on HTML
pages.

Thanks for your attention. Please do not hesitate to contact me if
you have any questions. I'd be happy to assist. My complaint is
not just based on my personal opinion: the characters on the
www.msnbc.com Web pages have recently been criticized by many
Web experts and mentioned as a negative example on long
discussions in various Internet mailing lists and newsgroups.

Markus

Technical References:

http://curly.cc.utexas.edu/~churchh/latin1.html
ftp://ds.internic.net/rfc/rfc1866.txt
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

-- 
Markus Kuhn, Computer Science grad student, Purdue
University, Indiana, US, email: kuhn@cs.purdue.edu



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT