RE: UTF-8N?

From: Ayers, Mike (Mike_Ayers@bmc.com)
Date: Tue Jun 20 2000 - 16:20:50 EDT


> From: Juliusz Chroboczek [mailto:jec@dcs.ed.ac.uk]
> Sent: Tuesday, June 20, 2000 12:02 PM
>
> Of course, no mismatch happens if the OS keeps track of file types.
> Splitting in the octet manner a text/plain file leads to two
> octet-stream files, and the OS should ensure that you cannot merge
> them in the wrong way.

<soapbox>

        Bad OS. Bad, bad, bad!!! The problem with this is that the OS
will tend to make "assumptions" (I think we all know the translation for
that word) about files which are not locally created. While this seems
harmless most of the time, some very interesting problems tend to arise. As
an example, I once had to remove RealPlayer from my NT system (fortunately I
didn't want anyway) because it claimed the 'SMI' extension, and we use that
here for something entirely un-RealPlayer-like. I won't give some
hypothetical situation where an OS ends the world by trying to identify a
file type - rather, I'll just point out that it is bad form to rely on
another layer to solve your problems. Processing of text files occurs at
the application layer, and there it should stay. This means that
applications which split and join files should be responsible for their own
consistency (yes, I know that these apps often ship as part of the OS, but
they should, IMHO, remain only part of it).

</soapbox>

        In any case, it seemed that there has been almost enough information
for a filetype parser. I've got:

00 00 FE FF: UCS-4, big-endian machine (1234 order)
FF FE 00 00: UCS-4, little-endian machine (4321 order)
FE FF 00 ##: UTF-16, big-endian
FF FE ## 00: UTF-16, little-endian
EF BB BF: UTF-8

        Otherwise, you've got either UTF-8 with no header or one of the 8859
sets or possibly a non-latin multibyte alphabet, for which (as I understand
it) you need to construct a parser for each supported encoding scheme (I
didn't look up my terms - is that the correct one?). Hopefully, it will
parse without error in one and only one encoding scheme, at which point
you've (probably) identified the type.

        Do I have the right idea here (non-soapbox part)?

/|/|ike



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT