RE: suggestions for strategy on dealing with plain text in potentially any (unspecified) encoding?

From: Maurice Bauhahn (bauhahnm@clara.net)
Date: Sat May 10 2003 - 07:40:31 EDT

Next message: John Delacour: "Re: suggestions for strategy on dealing with plain text in potentially any (unspecified) encoding?"

Previous message: Ben Dougall: "suggestions for strategy on dealing with plain text in potentially any (unspecified) encoding?"
In reply to: Ben Dougall: "suggestions for strategy on dealing with plain text in potentially any (unspecified) encoding?"
Next in thread: Jungshik Shin: "RE: suggestions for strategy on dealing with plain text in potentially any (unspecified) encoding?"
Reply: Jungshik Shin: "RE: suggestions for strategy on dealing with plain text in potentially any (unspecified) encoding?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

It would appear to be a three step process:

(1) First, detect whether there are patterns reflecting single or multiple
byte encoding and separate the text into apparent units. Hence work out
probabilities whether there are UTF-8, UTF-16LE, UTF-16BE patterns (or BOM
for the last two). I'm not aware of Shift-JIS, Big5, or EUC encoding
patterns, but presumably there are some characters for these. The units
could then be arranged in an array by order of frequency.

(2) Second, compare this list against a hash of reference frequencies versus
Unicode characters in various languages. These frequency patterns linked
with languages could then be joined against various encodings likely used to
represent those languages.

(3) Third, with a generous bit of fuzzy logic (!!), test against the most
likely encodings (normalising the assumed code points to Unicode) and run
language specific Unicode spell checkers against the results. The most
accurate spell-check (language) and encoding that perculate to the top might
be right;-)

Cheers,

Maurice

-----Original Message-----
From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
Behalf Of Ben Dougall
Sent: 10 May 2003 10:56
To: unicode@unicode.org
Subject: suggestions for strategy on dealing with plain text in
potentially any (unspecified) encoding?

this is about all text encodings including unicode and converting that
text to unicode:

i'm starting to make an app in mac os x. i want to give it the ability
to take in plain text, in any encoding - at least the standard-ish ones
in use - any language though, and convert that text to unicode. there's
a whole array of encoding mappings already specified within os x. the
problem is ascertaining which encoding the text being input into the
app is in. i've found out that if there's no encoding specified in the
text in some way, there's no sure programmatic way to do so - at least
not confidently - you can programmatically guess, but no more. i can
easily ascertain the preferred / main language of the user though. but
i'm not sure how useful that is.

the text that maybe input into my app could be from anywhere - from the
net, wherever. plain text though (as opposed to rich / fancy text).
this is hard for me to judge as i'm an english speaking user that
haven't had much experience of other languages :/ (apart from a bit of
c) so i've not really had any personal experience on how good or bad
software is on dealing with text encoding recognition - it's very
likely to get it right for me because i'm a default user - i use
english text. if i was chinese say, would that be a different story?
does a chinese user for example often get confronted with initially
completely garbled looking text until they tell the app to use chinese
encoding to display the text, then it becomes readable?

i'm aware that most html, all xml and some unicode have encoding tags
of one sort or another. it's not the already tagged text i'm asking
about, but the untagged.

i got this about the same kind of question from apple's site:

> If no encoding is found then to fall back to the default C string
> encoding. This encoding would be Mac Roman for an English system,
> possibly something else for an Asian-localized system; the idea is to
> have the greatest chance of correctly interpreting existing files
> generated by previous systems, and to require the user to override for
> anything else.

possibly something else for an asian system? any ideas what that should
be? and what about other localisations? and then, what about if it is
in say ascii, then defaulting to a localisation that was not a latin
based character set would render the ascii garbled? i don't know - i
really can not see how to deal with this.

anyone have any suggestions for a strategy for dealing with text that
could be any encoding? bear in mind i will know the user's main
language, if that helps?

Next message: John Delacour: "Re: suggestions for strategy on dealing with plain text in potentially any (unspecified) encoding?"
Previous message: Ben Dougall: "suggestions for strategy on dealing with plain text in potentially any (unspecified) encoding?"
In reply to: Ben Dougall: "suggestions for strategy on dealing with plain text in potentially any (unspecified) encoding?"
Next in thread: Jungshik Shin: "RE: suggestions for strategy on dealing with plain text in potentially any (unspecified) encoding?"
Reply: Jungshik Shin: "RE: suggestions for strategy on dealing with plain text in potentially any (unspecified) encoding?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat May 10 2003 - 08:15:51 EDT