From: Matt Chu (matt.chu@gmail.com)
Date: Wed Oct 01 2008 - 19:55:11 CDT
Hi,
For a Java internationalization project I've been looking into the Unicode
standard and the ICU4J library, and I'm confused about a few things.
Hopefully somebody on this list can help me with one or two of the questions
here?
1) I want to standardize on a normalization form, but this sentence in Annex
15 (Unicode Normalization Forms) gave me pause:
"Normalization Forms KC and KD must not be blindly applied to arbitrary
text. Because they erase many formatting distinctions, they will prevent
round-trip conversion to and from many legacy character sets, and unless
supplanted by formatting markup, they may remove distinctions that are
important to the semantics of the text."
For example, suppose that I use NFKC on text that has both halfwidth "カ" and
fullwidth "カ". Thanks to NFKC, both are now converted to fullwidth "カ"
(let's say). When I want to convert back to a Japanese-specific encoding, we
no longer know which ones are halfwidth and which ones are fullwidth. The
question is, how big of a deal is this in real-world, normal usage?
2) Can string equivalence be both locale-agnostic and locale-sensitive? That
is, can two code points be equal in some language and not equal in another;
I am assuming some normalization form here. If this is possible, doesn't
that mean String.equals(...) should have a parameter for locale? Also,
doesn't this mean that all Collections should take a Collator as an
argument?
3) Does it make sense to have locale-agnostic case conversion? Currently I'm
using ICU4J's Transliterator.getInstance("Any-Lower") and
Transliterator.getInstance("Any-Upper"). Is this correct?
Thanks for any help you can provide!
Matt Chu
This archive was generated by hypermail 2.1.5 : Wed Oct 01 2008 - 20:29:17 CDT