From: Philippe Verdy (firstname.lastname@example.org)
Date: Mon Jan 12 2004 - 08:54:01 EST
From: "Peter Kirk" <email@example.com>
> On 12/01/2004 03:09, Marco Cimarosti wrote:
> > ...
> >It is extremely unlikely that a text file encoded in any single- or
> >multi-byte encoding (including UTF-8) would contain a zero byte, so the
> >presence of zero bytes is a strong enough hint for UTF-16 (or UCS-2) or
> Is it not dangerous to assume that U+0000 is not used? This is a valid
> character and is commonly used e.g. as a string terminator. Perhaps it
> should not be used in truly plain text. But it is likely to occur in
> files which are basically text but include certain kinds of markup.
This character is invalid at least in HTML, XML, XHTML, SGML and text/plain
files. It's presence in a file will just indicate that this is not a plain
so it could have any arbitrary supplementary content which does not use
any relevenat text encoding.
More precisely, I think it's safer to consider that any file that seems to
contain NUL characters is not a text file or, if it is really so, it uses a
non-8-bit Uncode encoding scheme like UTF-16 or UTF-32 or a legacy
Any attempt to try matching the file containing any NUL byte as a plain-text
file with a 8-bit charset should fail (at least if the autodetection is
to parse an HTML or XML text file in a browser). Note that this check is
extended to the byte 0x01 which also unambiguously indicates that the file,
if it's really plain-text, cannot use a legacy 8bit charset but could be
with UTF-16, UTF-32, SCSU or a legacy 16-bit charset.
(However I can't remember if this applies to VISCII: does it encode a
Unicode character at position 0x01, instead of a C0 control?)
My opinion is that most C0 and C1 controls are used as part of an
protocol, and they are not valid and should not be present in plain text
once they have been decoded and converted to Unicode, where only a few
should remain: TAB, LF, FF, CR, NEL. Some controls are needed in encoded
plain-text files only for some encoding schemes, but they do not encode
actual characters after the encoding scheme has been parsed: BS, SO, SI,
ESC, DLE, SS2, SS3... If there's no specific precise support for these
legacy encoding schemes, there should not be any attempt to "detect" them by
assuming they could be present in a plain-text file.
This archive was generated by hypermail 2.1.5 : Mon Jan 12 2004 - 09:36:29 EST