Re: Unicode and Java Questions

From: Matt Chu (matt.chu@gmail.com)
Date: Thu Oct 02 2008 - 18:37:45 CDT

  • Next message: Phillips, Addison: "RE: Unicode and Java Questions"

    Hey everybody,

    Thanks so much for all your answers, it's given me a lot to consider. Let me
    make sure I understand correctly:

    1) There DOES exist language-dependent string equivalence, as well as Java's
    built-in language-independent string equivalence. That is, the follow
    situation exists:

    x = "\uXXXX";
    y = "\uYYYY";
    if (locale == A) then x == y else x != y

    2) Given that (1) is true and .equals changes based on locale, then doesn't
    that mean I have to override .hashCode in order to maintain the Java
    equals/hashcode contract (i.e. make sure my Collections don't break)? That
    is, I want _something_ like the following:

    Map<String, Boolean> map = new HashMap<String, Boolean>(Locale.GERMAN);
    map.put("STRASSE", true);
    map.put("STRAßE", true);
    System.out.println("size = " + map.size()); // I want this to print ONE, not
    two

    Map<String, Boolean> map = new HashMap<String, Boolean>(Locale.ENGLISH);
    map.put("STRASSE", true);
    map.put("STRAßE", true);
    System.out.println("size = " + map.size()); // I want this to print TWO, not
    one

    3) So I know that there exists some values locale1, locale2, and s such
    that:

    Locale locale1 = ...;
    Locale locale2 = ...;
    String s = "...";
    s.toLowerCase(locale1) != s.toLowerCase(locale2)

    is true.

    And I know that .toLowerCase()/.toUpperCase() is inherently
    language-dependent, where the locale is inferred from the JVM/environment.

    I'm trying to ask if *language-independent* case *conversions* (not
    case-folding) exists. That is:

    s.toLowerCase(Locale.NULL)

    or something like that. I guess I'm not sure on how to use the algorithms
    for case-folding with case conversion, and whether or not its even
    appropriate. If case conversion is not appropriate, would I be correct in
    that the right way to do it is to wrap string in ICU4J's
    CaseInsensitiveString class?

    Also, I'm on JDK5, so I don't have Locale.ROOT, but I don't fully understand
    what new Locale("") does in toUpperCase/toLowerCase; is this the
    language-independent case conversion I'm looking for?

    I hope my questions are clear, thanks for everybody's help.

    Matt

    On Thu, Oct 2, 2008 at 2:58 PM, Naoto Sato <Naoto.Sato@sun.com> wrote:

    > As of JDK6, Locale class has Locale.ROOT constant, so that you don't have
    > to new Locale("") each time for locale independent operation.
    >
    > Naoto
    >
    >
    > Phillips, Addison wrote:
    >
    >>
    >> 1) I want to standardize on a normalization form, but this sentence in
    >> Annex 15 (Unicode Normalization Forms) gave me pause:
    >>
    >> "Normalization Forms KC and KD must not be blindly applied to arbitrary
    >> text. Because they erase many formatting distinctions, they will prevent
    >> round-trip conversion to and from many legacy character sets, and unless
    >> supplanted by formatting markup, they may remove distinctions that are
    >> important to the semantics of the text."
    >>
    >>
    >>
    >> If you are going to standardize on a normalization form, you should
    >> standardize on NFC. You should know that there are cases in which NFC alters
    >> data in ways incompatible with the best usage for certain (relatively rare,
    >> minority) languages. But, generally, NFC is safe.
    >>
    >>
    >> NFKC is NOT safe. It alters data in a variety of ways. I can be very
    >> useful in situations in which you mean to eliminate any ambiguity—namespaces
    >> are a good example—but you cannot apply it blindly. Legacy encodings are
    >> just one example.
    >>
    >>
    >>
    >> For example, suppose that I use NFKC on text that has both halfwidth "カ"
    >> and fullwidth "カ". Thanks to NFKC, both are now converted to fullwidth "カ"
    >> (let's say). When I want to convert back to a Japanese-specific encoding, we
    >> no longer know which ones are halfwidth and which ones are fullwidth. The
    >> question is, how big of a deal is this in real-world, normal usage?
    >>
    >>
    >> That might not be a big deal (although it also can be), but other KC
    >> normalizations deeply alter the text. For example, a circled digit becomes
    >> just a number. Or the vulgar fractions like ½ become a sequence (1 / 2---so
    >> Plan9's old windowing system would be 81/2 J). And so on and so forth. You
    >> should NEVER apply NFKC to data blindly. It's a big deal.
    >>
    >>
    >>
    >> 2) Can string equivalence be both locale-agnostic and locale-sensitive?
    >> That is, can two code points be equal in some language and not equal in
    >> another; I am assuming some normalization form here. If this is possible,
    >> doesn't that mean String.equals(...) should have a parameter for locale?
    >>
    >>
    >> It helps to read the Javadoc here. String.equals() is about
    >> code-point-by-code-point comparison. Any two code points, considered in
    >> isolation, will always be equal in all locales. In addition, any two strings
    >> that contain the same code points in the same order are equal, regardless of
    >> locale (sensitivity or not), even in Collator. But String's comparisons are
    >> purely at the code point level.
    >>
    >>
    >> What Collator can do is compare strings as equal for a given weight that
    >> are NOT equivalent code point sequences. For example, case differences or
    >> accent differences might be ignored (at some weighting level), so you might
    >> consider strings such as Muenchen/München or STRASSE/straße as equivalent.
    >>
    >>
    >> Also, doesn't this mean that all Collections should take a Collator as an
    >> argument?
    >>
    >>
    >> No. Sometimes you'll want a strict code-point based comparison. It depends
    >> on what you're doing with a Collection as to whether using a Collator is a
    >> good idea. Collators are expensive compared to String's comparators, so if
    >> your code's main purpose is merely to do things in **some** deterministic
    >> order (but not necessarily for presentation to users), .compareTo() or
    >> .compareToIgnoreCase() may very well be good enough. If you are, by
    >> contrast, sorting someone's address book, yes, you'll need a Collator.
    >>
    >> 3) Does it make sense to have locale-agnostic case conversion? Currently
    >> I'm using ICU4J's Transliterator.getInstance("Any-Lower") and
    >> Transliterator.getInstance("Any-Upper"). Is this correct?
    >>
    >>
    >> "It depends"
    >>
    >>
    >> Using Transliterator is probably overkill for most case insensitive
    >> comparisons. There are equalsIgnoreCase() methods right in String that use
    >> default case folding. Non-default case folding is very important---in some
    >> locales (notably Turkic languages, Latvian, and a few others---see
    >> SpecialCasing.txt in the UCD). But for many programmatic operations, you do
    >> not want locale-sensitive case folding. It depends on why you are doing the
    >> case folding. Is it for a language specific presentation? Then, probably,
    >> you want to use the proper folding. Even then, I would REALLY question using
    >> ICU4J. I mean, isn't String's toUpperCase(#locale) good enough for you?
    >>
    >>
    >> Now, turning it around for a second, you definitely should NEVER use
    >> String.toUpperCase() or String.toLowerCase() without passing a locale
    >> argument (new Locale("","") is a good locale to use for default behavior).
    >> These methods use the system default locale. If you expect a
    >> locale-insensitive operation to follow, you'll have peculiar code failures
    >> in locales such as Turkish, where dotless/dotted "i" exists.
    >>
    >>
    >>
    >> Addison
    >>
    >>
    >>
    >> Addison Phillips
    >>
    >> Globalization Architect -- Lab126
    >>
    >> Chair -- W3C Internationalization Core WG
    >>
    >>
    >> Internationalization is not a feature.
    >>
    >> It is an architecture.
    >>
    >>
    >>
    >>
    >
    >
    > --
    > Naoto Sato
    >



    This archive was generated by hypermail 2.1.5 : Thu Oct 02 2008 - 18:43:05 CDT