Re: Normalization in panlingual application

From: Asmus Freytag (
Date: Wed Sep 19 2007 - 17:56:02 CDT

  • Next message: Rick McGowan: "New Public Review Issue: Proposed Update UTS #18"

    On 9/19/2007 1:40 PM, Jonathan Pool wrote:
    > ....
    > In my work on another of these applications, I'm tentatively planning to
    > normalize all input to NFKD. I'm concerned, though, that (1) some valid
    > distinctions might thereby be erased,
    Anything that's not in a natural language, but in the language of
    mathematics will loose extremely valid distinctions. The same is true
    for other technical/scientific notations.
    > (2) some invalid distinctions may
    > survive, and
    What is an invalid distinction is defined by your application. If you
    case-fold, case is an invalid distinction. If your goal is to be able to
    represent text faithfully, then the "K" series of normalizations has no
    place in your design (It's too haphazard - for example, also, 5ΒΌ would
    be turned into 51/4, which is decidedly not the same thing).

    But, after applying NFC, or NFD, you may want to additionally apply some
    specific foldings (see UTS#30). These, you would chose based on the
    exact requirements of your implementation.
    > (3) some user agents may misrender decomposed strings.
    > Any thoughts about the best approach to normalization for PanImages and other
    > applications using the same database would be welcome.
    Beyond "don't use NFKx" I can only recommend you read up on character
    foldings and decide which distinctions you (positively, not bey default)
    decide are valid/invalid in your case.


    This archive was generated by hypermail 2.1.5 : Wed Sep 19 2007 - 17:58:29 CDT