RE: Detecting encoding in Plain text

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Mon Jan 12 2004 - 06:09:04 EST

  • Next message: K.G Sulochana: "Fw: Error in Editor..."

    Doug Ewell wrote:
    > In UTF-16 practically any sequence of bytes is valid, and since you
    > can't assume you know the language, you can't employ distribution
    > statistics. Twelve years ago, when most text was not Unicode and all
    > Unicode text was UTF-16, Microsoft documentation suggested a heuristic
    > of checking every other byte to see if it was zero, which of course
    > would only work for Latin-1 text encoded in UTF-16.

    I beg to differ. IMHO, analyzing zero bytes is a viable for detecting
    BOM-less UTF-16 and UTF-32. BTW, I didn't know (and I don't quite care) that
    this method was suggested first by Microsoft: to me, it seems quite
    self-evident.

    It is extremely unlikely that a text file encoded in any single- or
    multi-byte encoding (including UTF-8) would contain a zero byte, so the
    presence of zero bytes is a strong enough hint for UTF-16 (or UCS-2) or
    UTF-32.

    The next step is distinguishing between UTF-16 and UTF-32. A bullet-proof
    negative heuristic for UTF-32, is that a text file *cannot* be UTF-32 unless
    at least 1/4 of its bytes are zero. A positive heuristics for UTF-32 is
    detecting sequences of two consecutive zero bytes, the first of which having
    an odd index: as it is very unlikely that a UTF-16 file would a NULL
    character, zero 16-bit words must be part of a UTF-32 character. The
    combination of these two methods is pretty enough to tell apart UTF-16 and
    UTF-32.

    Once you determined whether the file is in UTF-16 or in UTF-32, a
    statistical analysis of the *indexes* of zero bytes should be pretty enough
    to determine the UTF's endianness. UTF-16 is likely to be little-endian if
    zero bytes are more frequent at even indexes than at odd indexed, and vice
    versa. This is due to the fact that, in any language, shared characters in
    the Latin-1 range (controls, space, digits, punctuation, etc.) should be
    more frequent than occasional code points of form <U+??00>. For UTF-32,
    determining endianness is even simpler: if *all* bytes whose index is
    divisible by 4 are zero, then it is little-endian, else it is big-endian.

    Of course, all this works only if it is true the basic assumption that the
    file is a plain text file: this method is not quite enough for telling apart
    text files from binary files.

    _ Marco



    This archive was generated by hypermail 2.1.5 : Mon Jan 12 2004 - 06:49:21 EST