Re: Problem facing while dealing with full width alpha numeric characters

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Sep 21 2006 - 16:21:36 CDT

Next message: Philippe Verdy: "Fw: Unicode & space in programming & l10n"

Previous message: Richard Wordingham: "Re: Unicode & space in programming & l10n"
Maybe in reply to: Rakesh Sharma: "Problem facing while dealing with full width alpha numeric characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Rakesh asked:

> I am facing a problem. We are using Unicode ICU 3.4 library.

Ordinarily, issues in the use of the ICU library should be taken
up with the folks at icu-project.org, particularly if you want
to file bugs against the library, but...

> When I use
> uscript_getScript to get unicode script of full width alpha numeric
> characters, it returns me Latin script for these characters.

That is actually the correct result, and is not a bug.

> But they are
> japanese specific charaters.

That assumption is incorrect. Fullwidth alphanumeric characters
in the range U+FF01..U+FF5E are also used with Chinese text and
occur on Chinese (and Korean) legacy code pages.

> They can't lie in latin block.

Script assignments are not necessarily the same as block assignments
in the Unicode Standard, and you cannot rely exclusively on either
of those values to determine the language of strings.

> My case is that i need to assign language to each character inserted in our
> text box.

Well, that sounds like a design problem to me. Language cannot
be reliably assigned on a per character basis.

> If it is hiragana character i assign it japanese, similarly i have
> to assign japanese language to full width roman characters but this API
> doesn't give me correct results.

There are reasonably reliable heuristics for East Asian text, which
can serve to distinguish Japanese from Korean from Chinese, and
any of those from European languages, but you need to work on
strings, not on a per-character basis.

The presence of a string of full-width Latin alphabetic characters
is a very good heuristic to indicate that you are dealing
with East Asian data, but is not itself sufficient to determine
whether the rest of the data is Japanese or Chinese, for example.
And the full-width Latin character string itself could be
in English (most likely) or some other language.

This topic in general has been discussed at various of the
International Unicode Conferences. See, for example:

http://www.unicode.org/iuc/iuc21/a343.html

There are people that have good heuristic algorithms
for what you are trying to do here, and you might be able to
get a tip on a canned library routine that would work well for
what you need.

--Ken

> Could you please look into this?
>
> Regards,
> ~Rakesh

Next message: Philippe Verdy: "Fw: Unicode & space in programming & l10n"
Previous message: Richard Wordingham: "Re: Unicode & space in programming & l10n"
Maybe in reply to: Rakesh Sharma: "Problem facing while dealing with full width alpha numeric characters"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Sep 21 2006 - 16:37:08 CDT