Unicode and Java Questions

From: Matt Chu (matt.chu@gmail.com)
Date: Wed Oct 01 2008 - 19:55:11 CDT

  • Next message: =?utf-8?Q?António MARTINS-Tuválkin?=: "Re: Proposal for .gb (great britain) suffix & alteration of the k alphabet in .uk"


    For a Java internationalization project I've been looking into the Unicode
    standard and the ICU4J library, and I'm confused about a few things.
    Hopefully somebody on this list can help me with one or two of the questions

    1) I want to standardize on a normalization form, but this sentence in Annex
    15 (Unicode Normalization Forms) gave me pause:

    "Normalization Forms KC and KD must not be blindly applied to arbitrary
    text. Because they erase many formatting distinctions, they will prevent
    round-trip conversion to and from many legacy character sets, and unless
    supplanted by formatting markup, they may remove distinctions that are
    important to the semantics of the text."

    For example, suppose that I use NFKC on text that has both halfwidth "カ" and
    fullwidth "カ". Thanks to NFKC, both are now converted to fullwidth "カ"
    (let's say). When I want to convert back to a Japanese-specific encoding, we
    no longer know which ones are halfwidth and which ones are fullwidth. The
    question is, how big of a deal is this in real-world, normal usage?

    2) Can string equivalence be both locale-agnostic and locale-sensitive? That
    is, can two code points be equal in some language and not equal in another;
    I am assuming some normalization form here. If this is possible, doesn't
    that mean String.equals(...) should have a parameter for locale? Also,
    doesn't this mean that all Collections should take a Collator as an

    3) Does it make sense to have locale-agnostic case conversion? Currently I'm
    using ICU4J's Transliterator.getInstance("Any-Lower") and
    Transliterator.getInstance("Any-Upper"). Is this correct?

    Thanks for any help you can provide!

    Matt Chu

    This archive was generated by hypermail 2.1.5 : Wed Oct 01 2008 - 20:29:17 CDT