Re: Detecting encoding in Plain text

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Jan 12 2004 - 08:54:01 EST

Next message: Mark Davis: "Re: Detecting encoding in Plain text"

Previous message: Peter Kirk: "Re: Detecting encoding in Plain text"
In reply to: Peter Kirk: "Re: Detecting encoding in Plain text"
Next in thread: Doug Ewell: "Re: Detecting encoding in Plain text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "Peter Kirk" <peterkirk@qaya.org>
> On 12/01/2004 03:09, Marco Cimarosti wrote:
>
> > ...
> >
> >It is extremely unlikely that a text file encoded in any single- or
> >multi-byte encoding (including UTF-8) would contain a zero byte, so the
> >presence of zero bytes is a strong enough hint for UTF-16 (or UCS-2) or
> >UTF-32.
> >
> >
> >
> Is it not dangerous to assume that U+0000 is not used? This is a valid
> character and is commonly used e.g. as a string terminator. Perhaps it
> should not be used in truly plain text. But it is likely to occur in
> files which are basically text but include certain kinds of markup.

This character is invalid at least in HTML, XML, XHTML, SGML and text/plain
files. It's presence in a file will just indicate that this is not a plain
text file,
so it could have any arbitrary supplementary content which does not use
any relevenat text encoding.

More precisely, I think it's safer to consider that any file that seems to
contain NUL characters is not a text file or, if it is really so, it uses a
non-8-bit Uncode encoding scheme like UTF-16 or UTF-32 or a legacy
16-bit charset.

Any attempt to try matching the file containing any NUL byte as a plain-text
file with a 8-bit charset should fail (at least if the autodetection is
needed
to parse an HTML or XML text file in a browser). Note that this check is
extended to the byte 0x01 which also unambiguously indicates that the file,
if it's really plain-text, cannot use a legacy 8bit charset but could be
matched
with UTF-16, UTF-32, SCSU or a legacy 16-bit charset.
(However I can't remember if this applies to VISCII: does it encode a
plain-text
Unicode character at position 0x01, instead of a C0 control?)

My opinion is that most C0 and C1 controls are used as part of an
out-of-band
protocol, and they are not valid and should not be present in plain text
files
once they have been decoded and converted to Unicode, where only a few
should remain: TAB, LF, FF, CR, NEL. Some controls are needed in encoded
plain-text files only for some encoding schemes, but they do not encode
actual characters after the encoding scheme has been parsed: BS, SO, SI,
ESC, DLE, SS2, SS3... If there's no specific precise support for these
legacy encoding schemes, there should not be any attempt to "detect" them by
assuming they could be present in a plain-text file.

Next message: Mark Davis: "Re: Detecting encoding in Plain text"
Previous message: Peter Kirk: "Re: Detecting encoding in Plain text"
In reply to: Peter Kirk: "Re: Detecting encoding in Plain text"
Next in thread: Doug Ewell: "Re: Detecting encoding in Plain text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jan 12 2004 - 09:36:29 EST