Re: Normalization in panlingual application

From: Kenneth Whistler (
Date: Wed Sep 19 2007 - 19:35:52 CDT

  • Next message: Asmus Freytag: "Re: Normalization in panlingual application"

    I'll second Asmus' strong suggestion not to use NFKD or NFKC
    as the normalization form for such an application.

    > But, after applying NFC, or NFD,

    And additionally, I would suggest going with NFC, which
    sees the widest usage, is most compatible with web format
    and most likely to render well in most user agents.

    Additionally, if your application has a significant database component,
    having text data prenormalized in NFC may make it a no-op
    going in and out of a commercial (or public domain)
    Unicode-based database, leading to better performance
    on queries and updates.

    > you may want to additionally apply some
    > specific foldings (see UTS#30). These, you would chose based on the
    > exact requirements of your implementation.

    > Beyond "don't use NFKx" I can only recommend you read up on character
    > foldings and decide which distinctions you (positively, not bey default)
    > decide are valid/invalid in your case.

    Those are good suggestions. Just be careful what foldings
    you do, because once you start folding distinctions in
    data, it is difficult to recover those distinctions if
    you later discover you've folded too much.

    The safest bet is to simply use NFC normalization for your
    master repository of data, however you implement that,
    and then build selected foldings into views on that
    data and/or for reports on that data. That gives you
    the maximal flexibility for how you can view the data,
    without losing possibly interesting distinctions in
    the original data.

    Note for example, that if you are mixing together language
    data from different sources, you may have to keep track
    of and mark orthographic differences in that data.
    To do comparative searching in such a corpus, you will
    need to be able to do "orthographic folding" -- i.e. be
    able to take one chunk of data in orthography A and
    convert it into orthography B before comparing. Unless
    you are really, really sure of what you are doing, it
    is better to leave the original material as it is,
    and build the orthographic conversions into the application.

    If you've got 350 bilingual and multilingual dictionaries,
    I'm going to bet that they don't all represent the
    same languages identically, so this orthography issue
    is one you have to face, unless you think you can resolve
    the problem of standardizing orthographies for all
    languages of the world. ;-)


    This archive was generated by hypermail 2.1.5 : Wed Sep 19 2007 - 19:37:34 CDT