Re: Unicode and Java Questions

From: Naoto Sato (Naoto.Sato@Sun.COM)
Date: Thu Oct 02 2008 - 13:58:26 CDT

  • Next message: Matt Chu: "Re: Unicode and Java Questions"

    As of JDK6, Locale class has Locale.ROOT constant, so that you don't
    have to new Locale("") each time for locale independent operation.

    Naoto

    Phillips, Addison wrote:
    >
    > 1) I want to standardize on a normalization form, but this sentence in
    > Annex 15 (Unicode Normalization Forms) gave me pause:
    >
    > "Normalization Forms KC and KD must not be blindly applied to arbitrary
    > text. Because they erase many formatting distinctions, they will prevent
    > round-trip conversion to and from many legacy character sets, and unless
    > supplanted by formatting markup, they may remove distinctions that are
    > important to the semantics of the text."
    >
    >
    >
    >
    >
    > If you are going to standardize on a normalization form, you should
    > standardize on NFC. You should know that there are cases in which NFC
    > alters data in ways incompatible with the best usage for certain
    > (relatively rare, minority) languages. But, generally, NFC is safe.
    >
    >
    >
    > NFKC is NOT safe. It alters data in a variety of ways. I can be very
    > useful in situations in which you mean to eliminate any
    > ambiguity—namespaces are a good example—but you cannot apply it blindly.
    > Legacy encodings are just one example.
    >
    >
    >
    > For example, suppose that I use NFKC on text that has both halfwidth "カ"
    > and fullwidth "カ". Thanks to NFKC, both are now converted to fullwidth
    > "カ" (let's say). When I want to convert back to a Japanese-specific
    > encoding, we no longer know which ones are halfwidth and which ones are
    > fullwidth. The question is, how big of a deal is this in real-world,
    > normal usage?
    >
    >
    >
    > That might not be a big deal (although it also can be), but other KC
    > normalizations deeply alter the text. For example, a circled digit
    > becomes just a number. Or the vulgar fractions like ½ become a sequence
    > (1 / 2---so Plan9’s old windowing system would be 81/2 J). And so on and
    > so forth. You should NEVER apply NFKC to data blindly. It’s a big deal.
    >
    >
    >
    > 2) Can string equivalence be both locale-agnostic and locale-sensitive?
    > That is, can two code points be equal in some language and not equal in
    > another; I am assuming some normalization form here. If this is
    > possible, doesn't that mean String.equals(...) should have a parameter
    > for locale?
    >
    >
    >
    > It helps to read the Javadoc here. String.equals() is about
    > code-point-by-code-point comparison. Any two code points, considered in
    > isolation, will always be equal in all locales. In addition, any two
    > strings that contain the same code points in the same order are equal,
    > regardless of locale (sensitivity or not), even in Collator. But
    > String’s comparisons are purely at the code point level.
    >
    >
    >
    > What Collator can do is compare strings as equal for a given weight that
    > are NOT equivalent code point sequences. For example, case differences
    > or accent differences might be ignored (at some weighting level), so you
    > might consider strings such as Muenchen/München or STRASSE/straße as
    > equivalent.
    >
    >
    >
    > Also, doesn't this mean that all Collections should take a Collator as
    > an argument?
    >
    >
    >
    > No. Sometimes you’ll want a strict code-point based comparison. It
    > depends on what you’re doing with a Collection as to whether using a
    > Collator is a good idea. Collators are expensive compared to String’s
    > comparators, so if your code’s main purpose is merely to do things in
    > **some** deterministic order (but not necessarily for presentation to
    > users), .compareTo() or .compareToIgnoreCase() may very well be good
    > enough. If you are, by contrast, sorting someone’s address book, yes,
    > you’ll need a Collator.
    >
    > 3) Does it make sense to have locale-agnostic case conversion? Currently
    > I'm using ICU4J's Transliterator.getInstance("Any-Lower") and
    > Transliterator.getInstance("Any-Upper"). Is this correct?
    >
    >
    >
    > “It depends”
    >
    >
    >
    > Using Transliterator is probably overkill for most case insensitive
    > comparisons. There are equalsIgnoreCase() methods right in String that
    > use default case folding. Non-default case folding is very
    > important---in some locales (notably Turkic languages, Latvian, and a
    > few others---see SpecialCasing.txt in the UCD). But for many
    > programmatic operations, you do not want locale-sensitive case folding.
    > It depends on why you are doing the case folding. Is it for a language
    > specific presentation? Then, probably, you want to use the proper
    > folding. Even then, I would REALLY question using ICU4J. I mean, isn’t
    > String’s toUpperCase(#locale) good enough for you?
    >
    >
    >
    > Now, turning it around for a second, you definitely should NEVER use
    > String.toUpperCase() or String.toLowerCase() without passing a locale
    > argument (new Locale(“”,””) is a good locale to use for default
    > behavior). These methods use the system default locale. If you expect a
    > locale-insensitive operation to follow, you’ll have peculiar code
    > failures in locales such as Turkish, where dotless/dotted “i" exists.
    >
    >
    >
    > Addison
    >
    >
    >
    >
    >
    > Addison Phillips
    >
    > Globalization Architect -- Lab126
    >
    > Chair -- W3C Internationalization Core WG
    >
    >
    >
    > Internationalization is not a feature.
    >
    > It is an architecture.
    >
    >
    >
    >
    >

    -- 
    Naoto Sato
    


    This archive was generated by hypermail 2.1.5 : Thu Oct 02 2008 - 14:00:59 CDT