Re: Problem facing while dealing with full width alpha numeric characters

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Sep 21 2006 - 16:21:36 CDT

  • Next message: Philippe Verdy: "Fw: Unicode & space in programming & l10n"

    Rakesh asked:

    > I am facing a problem. We are using Unicode ICU 3.4 library.

    Ordinarily, issues in the use of the ICU library should be taken
    up with the folks at icu-project.org, particularly if you want
    to file bugs against the library, but...

    > When I use
    > uscript_getScript to get unicode script of full width alpha numeric
    > characters, it returns me Latin script for these characters.

    That is actually the correct result, and is not a bug.

    > But they are
    > japanese specific charaters.

    That assumption is incorrect. Fullwidth alphanumeric characters
    in the range U+FF01..U+FF5E are also used with Chinese text and
    occur on Chinese (and Korean) legacy code pages.

    > They can't lie in latin block.

    Script assignments are not necessarily the same as block assignments
    in the Unicode Standard, and you cannot rely exclusively on either
    of those values to determine the language of strings.

    > My case is that i need to assign language to each character inserted in our
    > text box.

    Well, that sounds like a design problem to me. Language cannot
    be reliably assigned on a per character basis.

    > If it is hiragana character i assign it japanese, similarly i have
    > to assign japanese language to full width roman characters but this API
    > doesn't give me correct results.

    There are reasonably reliable heuristics for East Asian text, which
    can serve to distinguish Japanese from Korean from Chinese, and
    any of those from European languages, but you need to work on
    strings, not on a per-character basis.

    The presence of a string of full-width Latin alphabetic characters
    is a very good heuristic to indicate that you are dealing
    with East Asian data, but is not itself sufficient to determine
    whether the rest of the data is Japanese or Chinese, for example.
    And the full-width Latin character string itself could be
    in English (most likely) or some other language.

    This topic in general has been discussed at various of the
    International Unicode Conferences. See, for example:

    http://www.unicode.org/iuc/iuc21/a343.html

    There are people that have good heuristic algorithms
    for what you are trying to do here, and you might be able to
    get a tip on a canned library routine that would work well for
    what you need.

    --Ken

    > Could you please look into this?
    >
    > Regards,
    > ~Rakesh



    This archive was generated by hypermail 2.1.5 : Thu Sep 21 2006 - 16:37:08 CDT