Re: Autodetection of CP437 vs. Latin-1

From: Addison Phillips (
Date: Sat Feb 10 2007 - 14:17:31 CST

  • Next message: Richard Wordingham: "Re: missing symbol?"

    Mike wrote:
    >>> One problem is detecting text with the MS-DOS box-drawing characters
    > You could look for long runs of the single and/or double horizontal
    > box drawing characters. If you want to be extra careful, look at
    > the previous/next character to see if it's a corner or T.

    I suspect that Doug's real problem isn't so much with the box drawing
    characters per se. They're rare in any real user-entered text.

    The problem with the DOS code pages is that they use different byte
    values from the more modern Windows encodings (which tend to be based on
    standards such as the ISO 8859 series). In some ways, they kind of
    resemble Shift-JIS, with the box drawing gunk in the middle of the
    "extended ASCII" range and the accented letters appearing to one side or
    the other of that range.

    The problem here is more likely to be with letter pairs when guessing
    the encoding. For most Western European languages, the majority of the
    data will be 7-bit ASCII, and a smallish run of data might have only one
    or two non-ASCII characters embedded in it to assist in guessing.

    For example, in CP 850, U+00C8 (capital E with acute) is represented by
    the byte value 0xD4. In CP 1252, this same character is represented by
    the byte 0xC8 and 0xD4 represents U+00D4 (capital O with circumflex).
    Finally, in CP 850, the byte 0xC8 represents a box drawing character.
    The question is: given that I have a byte 0xD4, is it more likely to be
    an E-acute or O-cirumflex? If I guess CP 850, then any bytes 0xC8 that
    appear will be box drawing characters (that is, the "guess" is quite
    likely to be wrong).


    Addison Phillips
    Globalization Architect -- Yahoo! Inc.
    Internationalization is an architecture.
    It is not a feature.

    This archive was generated by hypermail 2.1.5 : Sat Feb 10 2007 - 14:19:37 CST