RE: Designing a multilingual web site

From: Chris Pratley (
Date: Wed Jul 19 2000 - 03:25:03 EDT

This makes sense if you realize that characters are just bytes, and you can
play around with the interpretation of the bytes as much as you like until
one interpretation works. Data in Unicode, however, breaks the (rough) rule
that bytes can be interpreted as any codepage. (note that this is a good
thing about Unicode, but in the transition to Unicode it can often seem to
be a pain).

So what happened in your message was that you received a mislabelled Arabic
win-1256-encoded mail from someone containing bytes such as A6 B2 9F, etc.

Outlook initially tries to display this message as ASCII (and of course
bytes over 7F don't exist in ASCII, so Outlook chooses to interpret these
miscreants as 1252 - as good a guess as any given that the mail is encoded
as ASCII). Note that Outlook mail "read notes" use Unicode internally, so in
fact the message was converted to Unicode using the 1252->Unicode mapping
before it was displayed.

When you save the message contents as ANSI, effectively that means you are
saving the bytes of the message UNCHANGED into a file. (Outlook maps from
the internal Unicode back to 1252 losslessly - assuming that is your
system's "ANSI" codepage) You can rename the file as HTML and open it in IE,
and IE's auto-detection will kick in, look at the bytes, and very possibly
decide correctly that these bytes are Arabic encoded in win-1256.

In the Notepad clipboard case, you copy the UNICODE text from Outlook to
Notepad. Now when you save as any encoding of Unicode, the text is still
garbage frozen into incorrect Unicode positions. If you save as 1252 however
(labelled "ANSI" in Notepad if your system uses 1252 as the "ANSI"
codepage), you will revert the Unicode garbage to a set of bytes that are
probably identical to the original message.

IE5.x guesses the encoding of a page using an algorithm that involves
knowledge of encoding sequences, state machines, statistical analysis, and
some knowledge of common words in the languages most likely to be encoded in
certain encodings. This is a little much to explain here, but it involves a
few patented algorithms beyond the obvious ones. IE4 uses a much weaker
version of detection. The good news is that other developers can use it as
well - it is built-in to the MLANG.DLL that comes with Windows (the later
versions have IE5.x integrated), or you can rely on IE5 being on the machine
you are running on. Most of the Office2000 apps use this MLANG service to
provide the same auto-detection features when loading HTML.


Sent with office10 build 1917ship wordmail on

-----Original Message-----
From: Munzir Taha []
Sent: July 18, 2000 5:58 PM
To: Unicode List
Subject: RE: Designing a multilingual web site

>You should explicitly set the encoding in the header of your page, and not
>leave it for the browser to guess. The following should go all in one line
>at the very top of the header:
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; CHARSET=UTF-8">

Yes, I understand the point of putting the header in each page explicitly.
But my question is how did the browser guessed it?

Another question: I received a message in Arabic thru Outlook 2000, It
doesn't appear right until I changed the encoding to Arabic (Windows). I
copied the (garbage) text (Which is encoded US-ASCII) and paste it to
notepad in Win2k. I saved the file into all available formats and renamed
each file to .htm. I then tried the different encodings but no use, the
garbage text doesn't changed at all.

I then went again to outlook, changed the encoding of the message into
Western European (Windows), saved it as Ansi text, renamed it into .htm,
changed the encoding to Arabic (Windows) and it's OK. Can you please explain
to me why the first failed whereas the second succeeded?

Thanks in advance for your quick reply

Do You Yahoo!?
Talk to your friends online with Yahoo! Messenger.

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT