Re: suggestions for strategy on dealing with plain text in potentially any (unspecified) encoding?

From: Ben Dougall (bend@freenet.co.uk)
Date: Sat May 10 2003 - 09:10:25 EDT

  • Next message: Michael Everson: "Good news, if true, about the Baghdad Museum"

    > >It would appear to be a three step process:
    > >
    > >(1) First, detect ...
    > >(2) Second, compare ...
    > >(3) Third, ... test
    >
    > (4) Give the user a chance to correct your program's guess -- some
    > users actually know!

    this is all very useful information, including the details of it, and
    the emacs related info (will follow that up definitely) - thanks very
    much.

    what should the default be though? post encoding detection, post fuzzy
    logic, post whatever other tricks, pre giving the user a chance to
    change it themselves: still don't know. so how should that particular
    decision be made (while knowing the user's main language)?

    if the user's main language was any latin based one - 8bit extended
    ascii would be the obvious one.

    but what if the user's main language is one based on a character set
    other than latin? would falling back to a character set other than
    extended ascii be in order in those cases? if so which basic character
    bases are there other than ascii? - i'm guessing there's not going to
    be many basic character bases (viewing ascii as the one for latin based
    scripts). OR should it not fall back to an alternative to extended
    ascii? but just fall back to 8bit ascii as default regardless of
    language setting?



    This archive was generated by hypermail 2.1.5 : Sat May 10 2003 - 10:39:10 EDT