From: Richard T. Gillam (rgillam@las-inc.com)
Date: Thu Mar 03 2005 - 09:47:29 CST
>Thus I'm looking at ways to implement NFC that don't require me to drag
>around the Unicode data files or substantial chunks thereof. I notice
>that java.lang.Character has a getType method that returns the Unicode
>character class for each character. This is built-in to Java since 1.1.
>It lets me tell if a character is one of the following types:
>
>...
>
>Is this sufficient information to perform NFC normalization? Or is that
>a pipe dream and I'm just going to need to drag along the Unicode data
>file or part thereof?
If I'm reading your message right, this is definitely not enough to do
NFC. It isn't enough to know the types of various characters; you have
to have their actual decomposition mappings. Knowing you have a base
character next to a combining character only lets you know this is a
spot where you might be able to combine characters; it doesn't tell you
which character you replace these two with, or even if such a character
exists.
You could pare down the NFC algorithm by restricting its domain to just
characters you expect you're going to see (this might, for example, get
you out of dealing with non-BMP combining characters [yes, they do
exist]), and you could get rid of some of the more obscure code (for
example, you might have <B C1 C2> where B is a base character and C1 and
C2 are combining characters-- there's no single Unicode character for
B+C1 or for B+C1+C2, but there is one for B+C2, so NFC would result in
<B+C2 C1>, combining B with C2 even though C1 is in the way-- you could
eliminate this).
Of course, if you do any of these things, you're not doing NFC anymore,
at least not if you're presented with data that you didn't expect to
see. You're doing your own normalization that kind of looks like NFC.
In a closed system, where the stuff your normalization doesn't handle is
stuff the rest of your application can't handle either, this might be a
reasonable way to go. What you couldn't do is take the output of such a
modified converter and interchange it with the outside world claiming
it's NFC.
Hope this helps...
--Rich Gillam
Language Analysis Systems, Inc.
This archive was generated by hypermail 2.1.5 : Thu Mar 03 2005 - 09:48:20 CST