Re: Question about some MS IE options

From: Otto Stolz (Otto.Stolz@uni-konstanz.de)
Date: Tue Dec 04 2001 - 04:46:03 EST


Robert M. Gerlach wrote:
> When saving a webpage from within Microsoft Internet Explorer,

Which Version? I've tested your issue with version 5 SP2 (more
precisely: 5.00.3314.2101).

> there are a few notable options...

...for the encoding of the file.

> and I'm really unsure as to what the differences are,

- "Unicode" saves the HTML source in UTF-16-LE encoding with BOM,
    cf. <http://www.unicode.org/unicode/faq/utf_bom.html>,
- "Unicode (UTF-8)" saves it in -- guess, what? -- UTF-8 encoding,
   cf. <http://www.unicode.org/unicode/faq/utf_bom.html>,
- "Western European (ISO)" saves it in ISO 8859-1 encoding,
   cf. <http://czyborra.com/charsets/iso8859.html#ISO-8859-1>,
- "Western European (Windows)" saves it in MS CP 1252 encoding,
   cf. <http://czyborra.com/charsets/codepages.html#CP1252>.

> which is "better," etc.

Some thoughts:

- CP 1252 is a proprietary encoding (though widely understood);
   I'd prefer a standard encoding for the sake of portability.

- Both ISO 8859-1 and and CP 1252 comprise a limited character
   set; if your HTML source contains characters outside this set,
   the UTFs are preferable. IE 5 SP2 does not warn you of this
   situation; rather, it replaces every single character not
   represantable in the encoding chosen with the pertinent NCR
   (cf. <http://www.w3.org/TR/html401/charset.html#h-5.3.1>).
   Drawbacks:
   · NCRs are hard to edit.
   · NCRs take excessive storage (6 to 7 byte per character).
   · NCRs outside the current encoding are not correctly dis-
     played by Netscape 4.7x browsers.

- UTF-8 is more common for HTML sources than UTF-16.

- UTF-8 does not suffer from the BE vs. LE issue.

- For all alphabetic scripts, a UTF-8 encoded HTML source
   takes less storage than an UTF-16 encoded one:
   UTF-8 takes 1 byte per ASCII character (used for the HTML
   tags, and in Latin-based scripts also for the bulk of the
   text); it takes two bytes per character for the rest of the
   alphabetic scripts. UTF-16 takes two bytes per character for
   both ASCII and non-ASCII characters from alphabetic scripts.

- Both ISO 8859-1 and CP 1252 are handled easily by all text
   editors; for the UTFs, you will need a Unicode-capable
   text editor (which is no big deal in Win 2000 and Win XP,
   otherwise cf. <http://www.hclrss.demon.co.uk/unicode/>
   and <http://www.unicode.org/unicode/onlinedat/products.html>).

So, if your HTML source is in a "Western" language, I'd re-
commend "Western European (ISO)", otherwise "Unicode (UTF-8)".

Best wishes,
   Otto Stolz



This archive was generated by hypermail 2.1.2 : Tue Dec 04 2001 - 04:49:27 EST