From: Kenneth Whistler (firstname.lastname@example.org)
Date: Wed Sep 19 2007 - 19:35:52 CDT
I'll second Asmus' strong suggestion not to use NFKD or NFKC
as the normalization form for such an application.
> But, after applying NFC, or NFD,
And additionally, I would suggest going with NFC, which
sees the widest usage, is most compatible with web format
and most likely to render well in most user agents.
Additionally, if your application has a significant database component,
having text data prenormalized in NFC may make it a no-op
going in and out of a commercial (or public domain)
Unicode-based database, leading to better performance
on queries and updates.
> you may want to additionally apply some
> specific foldings (see UTS#30). These, you would chose based on the
> exact requirements of your implementation.
> Beyond "don't use NFKx" I can only recommend you read up on character
> foldings and decide which distinctions you (positively, not bey default)
> decide are valid/invalid in your case.
Those are good suggestions. Just be careful what foldings
you do, because once you start folding distinctions in
data, it is difficult to recover those distinctions if
you later discover you've folded too much.
The safest bet is to simply use NFC normalization for your
master repository of data, however you implement that,
and then build selected foldings into views on that
data and/or for reports on that data. That gives you
the maximal flexibility for how you can view the data,
without losing possibly interesting distinctions in
the original data.
Note for example, that if you are mixing together language
data from different sources, you may have to keep track
of and mark orthographic differences in that data.
To do comparative searching in such a corpus, you will
need to be able to do "orthographic folding" -- i.e. be
able to take one chunk of data in orthography A and
convert it into orthography B before comparing. Unless
you are really, really sure of what you are doing, it
is better to leave the original material as it is,
and build the orthographic conversions into the application.
If you've got 350 bilingual and multilingual dictionaries,
I'm going to bet that they don't all represent the
same languages identically, so this orthography issue
is one you have to face, unless you think you can resolve
the problem of standardizing orthographies for all
languages of the world. ;-)
This archive was generated by hypermail 2.1.5 : Wed Sep 19 2007 - 19:37:34 CDT