Markus Scherer replied to,
| > Don't we need some conventional file extensions for both plain
| > text and H T M L encoded in U T F 8, U T F 16, etc? E g
| > ".utf" => text/plain; charset = utf-8
| > ".uni" => text/plain; charset = utf-16
| > ".utfml" => text/html; charset = utf-8
| > ".uniml" => text/html; charset = utf-16
| It is not feasible to have a different extension per encoding, and
| is - luckily - not necessary with HTML and XML pages since they are
You must mean something I don't understand by "self-describing".
If I have 3 H T M L files side-by-side in a directory, one in U T F
8, another in, say, big-endian Unicode, and a third in shift-JIS,
there is no way they can be self describing, because in order to
parse the H T M L, you have to understand the encoding already.
The server could open the file and read some of it, and guess that
if every alternate byte is a 0---or a lot of them are---then it might
be Unicode; and that if it has a lot of characters with bit 7 clear,
and otherwise obeys the syntax of U T F 8, that it might be U T F 8;
you could even hope for a BOM. But these are heuristics, only. Why
should the server have to examine the file in order to be able to
serve it? There seems to be a category error of some sort here ...
And in any case, even if H T M L were self-describing, and we
didn't mind opening the file and checking the contents before serving
it, what about plain text? Near-arbitrary byte sequences are legal in
plain text---I imagine a short document could be contructed that is
legal, and even plausible, as both Unicode and U T F 8.
| You should provide your HTTP server with the
| information about your pages that you have it serve up. This
| information would include charset, language, and maybe more.
We are in complete agreement---but the way this is usually done is
via the file name extension, so I've been waiting for ".uni", ".utf"
etc to start appearing, and they haven't yet. I think the answer is
probably just that Unicode technology is still a little way away from
| If you don't provide this information, then the browser can still
| get it out of the HTML page's <meta> tags.
So what *should* the server put in the charset field of the
header? Something like
Content-Type: text/html; charset = unknown
(or, equivalently, just remain silent on the matter) and let the
browser figure it out? It might work---it seems that it is the status
quo---but I don't see it working for plain text.
| By the way, the default charset for HTML is ISO8859-1, not US-
| ASCII, I hope.
The default *document* character set is now ISO10646 (Unicode). It
used to be ISO8856-1. But this only specifies how to interpret
numeric entity references. The encoding by which the characters in
the file themselves are represented is another matter entirely. Shift-
JIS is not illegal, as far as I know, as long as it's announced
properly. Which is where I came in ...
o o o (_|/
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:01 EDT