suggestions for strategy on dealing with plain text in potentially any (unspecified) encoding?

From: Ben Dougall (bend@freenet.co.uk)
Date: Thu May 08 2003 - 16:50:16 EDT

  • Next message: akbar pasha: "unsubscribe"

    this is about all text encodings including unicode and converting that
    text into unicode:

    i'm starting to make an app in mac os x. i want to enable it to take in
    any plain text, in any encoding - at least the standard-ish ones in use
    - any language though, and convert that text to unicode. there's a
    whole array of encoding mappings already specified within os x, and i
    guess it wouldn't be too hard to add more if there were some ones
    missing, maybe. the problem is ascertaining which encoding the text
    being input into the app is in. i've found out that if there's no
    encoding specified in the text in some way, there's no sure
    programmatic way to do so - at least not confidently - you can
    programmatically guess, but no more. i can easily ascertain the
    preferred / main language of the user though. but i'm not sure how
    useful that is.

    the text that maybe input into my app could be from anywhere - from the
    net, wherever. plain text though (as opposed to rich / fancy text).
    this is hard for me to judge as i'm an english speaking user that
    hasn't bothered with other languages :/ (apart from a bit of c) so i've
    not really had any personal experience on how good or bad software is
    on dealing with this situation - it's very likely to get it right for
    me because i'm a default user - i use and expect english text. if i was
    chinese say, would it be a different story? does a chinese user for
    example often get confronted with initially completely garbled looking
    text until they tell the app to use chinese encoding to display the
    text, then it becomes readable?

    i'm aware that most html, all xml and some unicode have encoding tags
    of one sort or another. it's not the already tagged text i'm asking
    about, but the untagged.

    i got this about the same kind of question from apple's site:

    > If no encoding is found then to fall back to the default C string
    > encoding. This encoding would be Mac Roman for an English system,
    > possibly something else for an Asian-localized system; the idea is to
    > have the greatest chance of correctly interpreting existing files
    > generated by previous systems, and to require the user to override for
    > anything else.

    possibly something else for an asian system? any ideas what that should
    be? and what about other localisations. and then, what about if it is
    in say ascii, then defaulting to a localisation that was not latin
    based would render the text garbled? i don't know - i really can not
    see how to deal with this. any suggestions for a general way of dealing
    with this? bear in mind i will know the user's main language, if that
    helps?



    This archive was generated by hypermail 2.1.5 : Sat May 10 2003 - 14:30:02 EDT