Re: Unicode and Java Questions

From: Matt Chu (matt.chu@gmail.com)
Date: Thu Oct 02 2008 - 18:37:45 CDT

Next message: Phillips, Addison: "RE: Unicode and Java Questions"

Previous message: Naoto Sato: "Re: Unicode and Java Questions"
In reply to: Naoto Sato: "Re: Unicode and Java Questions"
Next in thread: Phillips, Addison: "RE: Unicode and Java Questions"
Reply: Phillips, Addison: "RE: Unicode and Java Questions"
Reply: John W Kennedy: "Re: Unicode and Java Questions"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hey everybody,

Thanks so much for all your answers, it's given me a lot to consider. Let me
make sure I understand correctly:

1) There DOES exist language-dependent string equivalence, as well as Java's
built-in language-independent string equivalence. That is, the follow
situation exists:

x = "\uXXXX";
y = "\uYYYY";
if (locale == A) then x == y else x != y

2) Given that (1) is true and .equals changes based on locale, then doesn't
that mean I have to override .hashCode in order to maintain the Java
equals/hashcode contract (i.e. make sure my Collections don't break)? That
is, I want _something_ like the following:

Map<String, Boolean> map = new HashMap<String, Boolean>(Locale.GERMAN);
map.put("STRASSE", true);
map.put("STRAßE", true);
System.out.println("size = " + map.size()); // I want this to print ONE, not
two

Map<String, Boolean> map = new HashMap<String, Boolean>(Locale.ENGLISH);
map.put("STRASSE", true);
map.put("STRAßE", true);
System.out.println("size = " + map.size()); // I want this to print TWO, not
one

3) So I know that there exists some values locale1, locale2, and s such
that:

Locale locale1 = ...;
Locale locale2 = ...;
String s = "...";
s.toLowerCase(locale1) != s.toLowerCase(locale2)

is true.

And I know that .toLowerCase()/.toUpperCase() is inherently
language-dependent, where the locale is inferred from the JVM/environment.

I'm trying to ask if *language-independent* case *conversions* (not
case-folding) exists. That is:

s.toLowerCase(Locale.NULL)

or something like that. I guess I'm not sure on how to use the algorithms
for case-folding with case conversion, and whether or not its even
appropriate. If case conversion is not appropriate, would I be correct in
that the right way to do it is to wrap string in ICU4J's
CaseInsensitiveString class?

Also, I'm on JDK5, so I don't have Locale.ROOT, but I don't fully understand
what new Locale("") does in toUpperCase/toLowerCase; is this the
language-independent case conversion I'm looking for?

I hope my questions are clear, thanks for everybody's help.

Matt

On Thu, Oct 2, 2008 at 2:58 PM, Naoto Sato <Naoto.Sato@sun.com> wrote:

> As of JDK6, Locale class has Locale.ROOT constant, so that you don't have
> to new Locale("") each time for locale independent operation.
>
> Naoto
>
>
> Phillips, Addison wrote:
>
>>
>> 1) I want to standardize on a normalization form, but this sentence in
>> Annex 15 (Unicode Normalization Forms) gave me pause:
>>
>> "Normalization Forms KC and KD must not be blindly applied to arbitrary
>> text. Because they erase many formatting distinctions, they will prevent
>> round-trip conversion to and from many legacy character sets, and unless
>> supplanted by formatting markup, they may remove distinctions that are
>> important to the semantics of the text."
>>
>>
>>
>> If you are going to standardize on a normalization form, you should
>> standardize on NFC. You should know that there are cases in which NFC alters
>> data in ways incompatible with the best usage for certain (relatively rare,
>> minority) languages. But, generally, NFC is safe.
>>
>>
>> NFKC is NOT safe. It alters data in a variety of ways. I can be very
>> useful in situations in which you mean to eliminate any ambiguity—namespaces
>> are a good example—but you cannot apply it blindly. Legacy encodings are
>> just one example.
>>
>>
>>
>> For example, suppose that I use NFKC on text that has both halfwidth "ｶ"
>> and fullwidth "カ". Thanks to NFKC, both are now converted to fullwidth "カ"
>> (let's say). When I want to convert back to a Japanese-specific encoding, we
>> no longer know which ones are halfwidth and which ones are fullwidth. The
>> question is, how big of a deal is this in real-world, normal usage?
>>
>>
>> That might not be a big deal (although it also can be), but other KC
>> normalizations deeply alter the text. For example, a circled digit becomes
>> just a number. Or the vulgar fractions like ½ become a sequence (1 / 2---so
>> Plan9's old windowing system would be 81/2 J). And so on and so forth. You
>> should NEVER apply NFKC to data blindly. It's a big deal.
>>
>>
>>
>> 2) Can string equivalence be both locale-agnostic and locale-sensitive?
>> That is, can two code points be equal in some language and not equal in
>> another; I am assuming some normalization form here. If this is possible,
>> doesn't that mean String.equals(...) should have a parameter for locale?
>>
>>
>> It helps to read the Javadoc here. String.equals() is about
>> code-point-by-code-point comparison. Any two code points, considered in
>> isolation, will always be equal in all locales. In addition, any two strings
>> that contain the same code points in the same order are equal, regardless of
>> locale (sensitivity or not), even in Collator. But String's comparisons are
>> purely at the code point level.
>>
>>
>> What Collator can do is compare strings as equal for a given weight that
>> are NOT equivalent code point sequences. For example, case differences or
>> accent differences might be ignored (at some weighting level), so you might
>> consider strings such as Muenchen/München or STRASSE/straße as equivalent.
>>
>>
>> Also, doesn't this mean that all Collections should take a Collator as an
>> argument?
>>
>>
>> No. Sometimes you'll want a strict code-point based comparison. It depends
>> on what you're doing with a Collection as to whether using a Collator is a
>> good idea. Collators are expensive compared to String's comparators, so if
>> your code's main purpose is merely to do things in **some** deterministic
>> order (but not necessarily for presentation to users), .compareTo() or
>> .compareToIgnoreCase() may very well be good enough. If you are, by
>> contrast, sorting someone's address book, yes, you'll need a Collator.
>>
>> 3) Does it make sense to have locale-agnostic case conversion? Currently
>> I'm using ICU4J's Transliterator.getInstance("Any-Lower") and
>> Transliterator.getInstance("Any-Upper"). Is this correct?
>>
>>
>> "It depends"
>>
>>
>> Using Transliterator is probably overkill for most case insensitive
>> comparisons. There are equalsIgnoreCase() methods right in String that use
>> default case folding. Non-default case folding is very important---in some
>> locales (notably Turkic languages, Latvian, and a few others---see
>> SpecialCasing.txt in the UCD). But for many programmatic operations, you do
>> not want locale-sensitive case folding. It depends on why you are doing the
>> case folding. Is it for a language specific presentation? Then, probably,
>> you want to use the proper folding. Even then, I would REALLY question using
>> ICU4J. I mean, isn't String's toUpperCase(#locale) good enough for you?
>>
>>
>> Now, turning it around for a second, you definitely should NEVER use
>> String.toUpperCase() or String.toLowerCase() without passing a locale
>> argument (new Locale("","") is a good locale to use for default behavior).
>> These methods use the system default locale. If you expect a
>> locale-insensitive operation to follow, you'll have peculiar code failures
>> in locales such as Turkish, where dotless/dotted "i" exists.
>>
>>
>>
>> Addison
>>
>>
>>
>> Addison Phillips
>>
>> Globalization Architect -- Lab126
>>
>> Chair -- W3C Internationalization Core WG
>>
>>
>> Internationalization is not a feature.
>>
>> It is an architecture.
>>
>>
>>
>>
>
>
> --
> Naoto Sato
>

Next message: Phillips, Addison: "RE: Unicode and Java Questions"
Previous message: Naoto Sato: "Re: Unicode and Java Questions"
In reply to: Naoto Sato: "Re: Unicode and Java Questions"
Next in thread: Phillips, Addison: "RE: Unicode and Java Questions"
Reply: Phillips, Addison: "RE: Unicode and Java Questions"
Reply: John W Kennedy: "Re: Unicode and Java Questions"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Oct 02 2008 - 18:43:05 CDT