Re: Detecting encoding in Plain text

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Jan 12 2004 - 08:54:01 EST

  • Next message: Mark Davis: "Re: Detecting encoding in Plain text"

    From: "Peter Kirk" <peterkirk@qaya.org>
    > On 12/01/2004 03:09, Marco Cimarosti wrote:
    >
    > > ...
    > >
    > >It is extremely unlikely that a text file encoded in any single- or
    > >multi-byte encoding (including UTF-8) would contain a zero byte, so the
    > >presence of zero bytes is a strong enough hint for UTF-16 (or UCS-2) or
    > >UTF-32.
    > >
    > >
    > >
    > Is it not dangerous to assume that U+0000 is not used? This is a valid
    > character and is commonly used e.g. as a string terminator. Perhaps it
    > should not be used in truly plain text. But it is likely to occur in
    > files which are basically text but include certain kinds of markup.

    This character is invalid at least in HTML, XML, XHTML, SGML and text/plain
    files. It's presence in a file will just indicate that this is not a plain
    text file,
    so it could have any arbitrary supplementary content which does not use
    any relevenat text encoding.

    More precisely, I think it's safer to consider that any file that seems to
    contain NUL characters is not a text file or, if it is really so, it uses a
    non-8-bit Uncode encoding scheme like UTF-16 or UTF-32 or a legacy
    16-bit charset.

    Any attempt to try matching the file containing any NUL byte as a plain-text
    file with a 8-bit charset should fail (at least if the autodetection is
    needed
    to parse an HTML or XML text file in a browser). Note that this check is
    extended to the byte 0x01 which also unambiguously indicates that the file,
    if it's really plain-text, cannot use a legacy 8bit charset but could be
    matched
    with UTF-16, UTF-32, SCSU or a legacy 16-bit charset.
    (However I can't remember if this applies to VISCII: does it encode a
    plain-text
    Unicode character at position 0x01, instead of a C0 control?)

    My opinion is that most C0 and C1 controls are used as part of an
    out-of-band
    protocol, and they are not valid and should not be present in plain text
    files
    once they have been decoded and converted to Unicode, where only a few
    should remain: TAB, LF, FF, CR, NEL. Some controls are needed in encoded
    plain-text files only for some encoding schemes, but they do not encode
    actual characters after the encoding scheme has been parsed: BS, SO, SI,
    ESC, DLE, SS2, SS3... If there's no specific precise support for these
    legacy encoding schemes, there should not be any attempt to "detect" them by
    assuming they could be present in a plain-text file.



    This archive was generated by hypermail 2.1.5 : Mon Jan 12 2004 - 09:36:29 EST