[SUM] encoding checker

From: Sebastian Hofer (sebastian.hofer@gistec-online.de)
Date: Wed May 14 2003 - 04:19:49 EDT

  • Next message: Andrew C. West: "Re: how to sort by stroke (not radical/stroke)"

    Hi List:
    Thanks to all who anwered. As all of the hints and links have different
    approaches it is hard to give a general statement. So give it a try.

    #####################################
    Thanks to:
    Edward Trager (Linux solution)
    T. "Kuro" Kurosaka (basitech)
    Marco Cimarosti (languageidentifier)
    Ben Dougall mlmassociates/Dcpcmd
    #####################################

    Here the links and solutions:
    -----------------------------
    > On Linux there is the command line utility called "file" which will
    > certainly segregate ASCII and UTF-8. Although it doesn't go very
    > far in detecting other unicode encoding possibilities, I'm sure one could
    > combine this with a little bit of Perl to meet your specific needs:
    > $> file *
    > images: directory
    > index.html: HTML document text
    > java.data: ASCII text
    > ucs2.data: MP3, 56 kBits2, 64 kBits, 48 kHz, Stereo
    > utf-16-be.data: data
    > utf-16-le.data: data
    > utf-7.data: ASCII text
    > utf8.data: UTF-8 Unicode text
    > utf8.data.png: PNG image data, 914 x 676, 2-bit colormap, non-interlaced

    ===============

    http://www.basistech.com/products/text-processing/euclid.html
    This is good although it is expensive. Free online demo!

    ===============

    http://www.languageidentifier.com/

    ===============

    have a look at the very recent thread on this list, in the archives:
    "suggestions for strategy on dealing with plain text in potentially any
    (unspecified) encoding?" there's a lot of useful stuff in that.
    basically nearly all text encodings just go ahead and use their
    encoding without stating "i'm 7bit ascii" or whatever, first. (even
    unicode, when it doesn't use a bom). so, often the required info simply
    isn't there. some html, most(maybe all) xml, some unicode(via a bom)
    and most(maybe all) emails have information to which encoding is being
    used.
    so it seems if anything is going to tell you explicitly which encoding
    is being used, it's going to be the text format rather than the
    encoding itself (apart from unicode and it's boms). if the text or the
    encoding itself does not specify the encoding, i don't think there is
    any absolute, sure way to find out. but there are various methods to
    make good, educated guesses (see the thread i mentioned).
    also someone on this list pointed me to this which you might find
    useful:
    <http://www.mlmassociates.cc/dl-win32.htm>
    Dcpcmd is a command line program that illustrates using the Windows
    IMultiLanguage interface to detect a code page.

    Cheers!
    Seb



    This archive was generated by hypermail 2.1.5 : Wed May 14 2003 - 05:26:03 EDT