> If I have 3 H T M L files side-by-side in a directory, one in U T F
> 8, another in, say, big-endian Unicode, and a third in shift-JIS,
> there is no way they can be self describing, because in order to
> parse the H T M L, you have to understand the encoding already.
HTML files are not just filled with completely unstructured data -- there is
a header, and it is supposed to be in some well-known format. Otherwise,
the situation devolves to precisely what we have today -- unstructured text
or data files filled with unmarked data in a variety of encodings.
If a format is to be "self describing" the header should be something like
ASCII, up to the point where one has enough information to know the encoding
of the rest of the document... I believe that's the case with HTML. Once
you have looked at the first part of the header, you should have discovered
the encoding and be able to parse the rest of the file (or know that it's
unparsable) without having to guess.
I really hope we don't start seeing lots of new file extensions... expecialy
if they're going to be limited to 3 or 4 letters and collide with everything
else that's 3 or 4 letters...
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:01 EDT