Re: Detecting encoding in Plain text

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Jan 12 2004 - 05:58:40 EST

  • Next message: Marco Cimarosti: "RE: Detecting encoding in Plain text"

    From: "Doug Ewell" <dewell@adelphia.net>
    > In UTF-16 practically any sequence of bytes is valid, and since you
    > can't assume you know the language, you can't employ distribution
    > statistics. Twelve years ago, when most text was not Unicode and all
    > Unicode text was UTF-16, Microsoft documentation suggested a heuristic
    > of checking every other byte to see if it was zero, which of course
    > would only work for Latin-1 text encoded in UTF-16. If you need to
    > detect the encoding of non-Western-European text, you would have to be
    > more sophisticated than this.

    Here I completely disagree: even though mostly any 16bit values in UTF-16
    are valid, they are NOT uniformly distributed. You'll see immediately that
    even
    and odd bytes have very distinct distribution, with the bytes representing
    the
    least significant bits of code units having a flatter distribution in a
    wider range
    than the other bytes which are distributed in very few byte values
    (rarely more than 2 or 3 for European languages, or with a mostly flat
    distribution
    of some limited ranges for Korean or for Chinese).

    Even today, where Unicode has more than 1 plane, UTF-16 is still easy to
    determine, because you'll see sequences of bytes where any byte between
    0xD8 and 0xDB is followed by 2 bytes where the second is between 0xDC
    and 0xDF. The low bit of the positions of these two bytes reveals if it's
    coded
    with UTF16-BE or UTF16-LE, and then you can look at the effective ranges of
    decoded UTF16 code units to detect unassigned or illegal codepoints which
    would invalidate the UTF-16 possibility.



    This archive was generated by hypermail 2.1.5 : Mon Jan 12 2004 - 06:30:36 EST