RE: Detecting encoding in Plain text

From: Tom Emerson (
Date: Mon Jan 12 2004 - 11:57:47 EST

  • Next message: Markus Scherer: "Re: Confusion about composition"

    Perhaps a meta question is this: how often are you going to encounter
    unBOMed UTF-32 or UTF-16 text? It's pretty rare --- certainly I've never
    seen it during the development of our language/encoding identifier.

    Sure, it's an interesting thought problem, but it doesn't happen.
    And fortunately detecting UTF-8 is relatively easy.

    The real problem is differentiating between the ISO 8859-x family and
    EUC-CN vs. EUC-KR. These are wondefully ambiguous.

    The key to doing this right is having _a_lot_ of valid training data.
    You also have to deal with oddities of language: I tried one open
    source implementation of the Cavnar and Trenkel algorithm THAT CLAIMED

    It's difficult to separate the language detection from the encoding
    Detection when dealing with non-Unicode text.


    Tom Emerson                                          Basis Technology Corp.
    Software Architect                       
      "Beware the lollipop of mediocrity: lick it once and you suck forever" 

    This archive was generated by hypermail 2.1.5 : Mon Jan 12 2004 - 12:43:50 EST