Re: Detecting encoding in Plain text

From: Philippe Verdy (
Date: Mon Jan 12 2004 - 08:54:01 EST

  • Next message: Mark Davis: "Re: Detecting encoding in Plain text"

    From: "Peter Kirk" <>
    > On 12/01/2004 03:09, Marco Cimarosti wrote:
    > > ...
    > >
    > >It is extremely unlikely that a text file encoded in any single- or
    > >multi-byte encoding (including UTF-8) would contain a zero byte, so the
    > >presence of zero bytes is a strong enough hint for UTF-16 (or UCS-2) or
    > >UTF-32.
    > >
    > >
    > >
    > Is it not dangerous to assume that U+0000 is not used? This is a valid
    > character and is commonly used e.g. as a string terminator. Perhaps it
    > should not be used in truly plain text. But it is likely to occur in
    > files which are basically text but include certain kinds of markup.

    This character is invalid at least in HTML, XML, XHTML, SGML and text/plain
    files. It's presence in a file will just indicate that this is not a plain
    text file,
    so it could have any arbitrary supplementary content which does not use
    any relevenat text encoding.

    More precisely, I think it's safer to consider that any file that seems to
    contain NUL characters is not a text file or, if it is really so, it uses a
    non-8-bit Uncode encoding scheme like UTF-16 or UTF-32 or a legacy
    16-bit charset.

    Any attempt to try matching the file containing any NUL byte as a plain-text
    file with a 8-bit charset should fail (at least if the autodetection is
    to parse an HTML or XML text file in a browser). Note that this check is
    extended to the byte 0x01 which also unambiguously indicates that the file,
    if it's really plain-text, cannot use a legacy 8bit charset but could be
    with UTF-16, UTF-32, SCSU or a legacy 16-bit charset.
    (However I can't remember if this applies to VISCII: does it encode a
    Unicode character at position 0x01, instead of a C0 control?)

    My opinion is that most C0 and C1 controls are used as part of an
    protocol, and they are not valid and should not be present in plain text
    once they have been decoded and converted to Unicode, where only a few
    should remain: TAB, LF, FF, CR, NEL. Some controls are needed in encoded
    plain-text files only for some encoding schemes, but they do not encode
    actual characters after the encoding scheme has been parsed: BS, SO, SI,
    ESC, DLE, SS2, SS3... If there's no specific precise support for these
    legacy encoding schemes, there should not be any attempt to "detect" them by
    assuming they could be present in a plain-text file.

    This archive was generated by hypermail 2.1.5 : Mon Jan 12 2004 - 09:36:29 EST