From: Asmus Freytag (firstname.lastname@example.org)
Date: Wed Sep 19 2007 - 17:56:02 CDT
On 9/19/2007 1:40 PM, Jonathan Pool wrote:
> In my work on another of these applications, I'm tentatively planning to
> normalize all input to NFKD. I'm concerned, though, that (1) some valid
> distinctions might thereby be erased,
Anything that's not in a natural language, but in the language of
mathematics will loose extremely valid distinctions. The same is true
for other technical/scientific notations.
> (2) some invalid distinctions may
> survive, and
What is an invalid distinction is defined by your application. If you
case-fold, case is an invalid distinction. If your goal is to be able to
represent text faithfully, then the "K" series of normalizations has no
place in your design (It's too haphazard - for example, also, 5¼ would
be turned into 51/4, which is decidedly not the same thing).
But, after applying NFC, or NFD, you may want to additionally apply some
specific foldings (see UTS#30). These, you would chose based on the
exact requirements of your implementation.
> (3) some user agents may misrender decomposed strings.
> Any thoughts about the best approach to normalization for PanImages and other
> applications using the same database would be welcome.
Beyond "don't use NFKx" I can only recommend you read up on character
foldings and decide which distinctions you (positively, not bey default)
decide are valid/invalid in your case.
This archive was generated by hypermail 2.1.5 : Wed Sep 19 2007 - 17:58:29 CDT