Re: Unicode and Java Questions

From: Naoto Sato ([email protected])
Date: Thu Oct 02 2008 - 13:58:26 CDT

Next message: Matt Chu: "Re: Unicode and Java Questions"

Previous message: Mike: "Re: Unicode and Java Questions"
In reply to: Phillips, Addison: "RE: Unicode and Java Questions"
Next in thread: Matt Chu: "Re: Unicode and Java Questions"
Reply: Matt Chu: "Re: Unicode and Java Questions"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

As of JDK6, Locale class has Locale.ROOT constant, so that you don't
have to new Locale("") each time for locale independent operation.

Naoto

Phillips, Addison wrote:
>
> 1) I want to standardize on a normalization form, but this sentence in
> Annex 15 (Unicode Normalization Forms) gave me pause:
>
> "Normalization Forms KC and KD must not be blindly applied to arbitrary
> text. Because they erase many formatting distinctions, they will prevent
> round-trip conversion to and from many legacy character sets, and unless
> supplanted by formatting markup, they may remove distinctions that are
> important to the semantics of the text."
>
>
>
>
>
> If you are going to standardize on a normalization form, you should
> standardize on NFC. You should know that there are cases in which NFC
> alters data in ways incompatible with the best usage for certain
> (relatively rare, minority) languages. But, generally, NFC is safe.
>
>
>
> NFKC is NOT safe. It alters data in a variety of ways. I can be very
> useful in situations in which you mean to eliminate any
> ambiguity—namespaces are a good example—but you cannot apply it blindly.
> Legacy encodings are just one example.
>
>
>
> For example, suppose that I use NFKC on text that has both halfwidth "ｶ"
> and fullwidth "カ". Thanks to NFKC, both are now converted to fullwidth
> "カ" (let's say). When I want to convert back to a Japanese-specific
> encoding, we no longer know which ones are halfwidth and which ones are
> fullwidth. The question is, how big of a deal is this in real-world,
> normal usage?
>
>
>
> That might not be a big deal (although it also can be), but other KC
> normalizations deeply alter the text. For example, a circled digit
> becomes just a number. Or the vulgar fractions like ½ become a sequence
> (1 / 2---so Plan9’s old windowing system would be 81/2 J). And so on and
> so forth. You should NEVER apply NFKC to data blindly. It’s a big deal.
>
>
>
> 2) Can string equivalence be both locale-agnostic and locale-sensitive?
> That is, can two code points be equal in some language and not equal in
> another; I am assuming some normalization form here. If this is
> possible, doesn't that mean String.equals(...) should have a parameter
> for locale?
>
>
>
> It helps to read the Javadoc here. String.equals() is about
> code-point-by-code-point comparison. Any two code points, considered in
> isolation, will always be equal in all locales. In addition, any two
> strings that contain the same code points in the same order are equal,
> regardless of locale (sensitivity or not), even in Collator. But
> String’s comparisons are purely at the code point level.
>
>
>
> What Collator can do is compare strings as equal for a given weight that
> are NOT equivalent code point sequences. For example, case differences
> or accent differences might be ignored (at some weighting level), so you
> might consider strings such as Muenchen/München or STRASSE/straße as
> equivalent.
>
>
>
> Also, doesn't this mean that all Collections should take a Collator as
> an argument?
>
>
>
> No. Sometimes you’ll want a strict code-point based comparison. It
> depends on what you’re doing with a Collection as to whether using a
> Collator is a good idea. Collators are expensive compared to String’s
> comparators, so if your code’s main purpose is merely to do things in
> **some** deterministic order (but not necessarily for presentation to
> users), .compareTo() or .compareToIgnoreCase() may very well be good
> enough. If you are, by contrast, sorting someone’s address book, yes,
> you’ll need a Collator.
>
> 3) Does it make sense to have locale-agnostic case conversion? Currently
> I'm using ICU4J's Transliterator.getInstance("Any-Lower") and
> Transliterator.getInstance("Any-Upper"). Is this correct?
>
>
>
> “It depends”
>
>
>
> Using Transliterator is probably overkill for most case insensitive
> comparisons. There are equalsIgnoreCase() methods right in String that
> use default case folding. Non-default case folding is very
> important---in some locales (notably Turkic languages, Latvian, and a
> few others---see SpecialCasing.txt in the UCD). But for many
> programmatic operations, you do not want locale-sensitive case folding.
> It depends on why you are doing the case folding. Is it for a language
> specific presentation? Then, probably, you want to use the proper
> folding. Even then, I would REALLY question using ICU4J. I mean, isn’t
> String’s toUpperCase(#locale) good enough for you?
>
>
>
> Now, turning it around for a second, you definitely should NEVER use
> String.toUpperCase() or String.toLowerCase() without passing a locale
> argument (new Locale(“”,””) is a good locale to use for default
> behavior). These methods use the system default locale. If you expect a
> locale-insensitive operation to follow, you’ll have peculiar code
> failures in locales such as Turkish, where dotless/dotted “i" exists.
>
>
>
> Addison
>
>
>
>
>
> Addison Phillips
>
> Globalization Architect -- Lab126
>
> Chair -- W3C Internationalization Core WG
>
>
>
> Internationalization is not a feature.
>
> It is an architecture.
>
>
>
>
>

-- 
Naoto Sato

Next message: Matt Chu: "Re: Unicode and Java Questions"
Previous message: Mike: "Re: Unicode and Java Questions"
In reply to: Phillips, Addison: "RE: Unicode and Java Questions"
Next in thread: Matt Chu: "Re: Unicode and Java Questions"
Reply: Matt Chu: "Re: Unicode and Java Questions"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Oct 02 2008 - 14:00:59 CDT