Re: Detecting encoding in Plain text

From: Philippe Verdy (
Date: Mon Jan 12 2004 - 05:58:40 EST

  • Next message: Marco Cimarosti: "RE: Detecting encoding in Plain text"

    From: "Doug Ewell" <>
    > In UTF-16 practically any sequence of bytes is valid, and since you
    > can't assume you know the language, you can't employ distribution
    > statistics. Twelve years ago, when most text was not Unicode and all
    > Unicode text was UTF-16, Microsoft documentation suggested a heuristic
    > of checking every other byte to see if it was zero, which of course
    > would only work for Latin-1 text encoded in UTF-16. If you need to
    > detect the encoding of non-Western-European text, you would have to be
    > more sophisticated than this.

    Here I completely disagree: even though mostly any 16bit values in UTF-16
    are valid, they are NOT uniformly distributed. You'll see immediately that
    and odd bytes have very distinct distribution, with the bytes representing
    least significant bits of code units having a flatter distribution in a
    wider range
    than the other bytes which are distributed in very few byte values
    (rarely more than 2 or 3 for European languages, or with a mostly flat
    of some limited ranges for Korean or for Chinese).

    Even today, where Unicode has more than 1 plane, UTF-16 is still easy to
    determine, because you'll see sequences of bytes where any byte between
    0xD8 and 0xDB is followed by 2 bytes where the second is between 0xDC
    and 0xDF. The low bit of the positions of these two bytes reveals if it's
    with UTF16-BE or UTF16-LE, and then you can look at the effective ranges of
    decoded UTF16 code units to detect unassigned or illegal codepoints which
    would invalidate the UTF-16 possibility.

    This archive was generated by hypermail 2.1.5 : Mon Jan 12 2004 - 06:30:36 EST