Re: Normalization in panlingual application

From: Asmus Freytag (
Date: Thu Sep 20 2007 - 11:41:36 CDT

  • Next message: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"

    On 9/20/2007 6:02 AM, John D. Burger wrote:
    >>> It should at best have been just a non-mandatory recommendation,
    >>> allowing tailoring (even IDN no longer refers to it directly, and
    >>> needed to redefine its own foldings).
    >> That's because IDN is morphing beyond simple identifiers as
    >> traditionally understood for programming languages and the like. IDN
    >> is attempting to be closer to ordinary language, and that's why the
    >> limitations of NFKD/NFKC become apparent.
    > I'm not that familiar with IDN - do the foldings specified by IDN
    > constitute a useful "sweet spot" for normalization/folding, somewhere
    > in between NFD and NFKD? That is, might there be broad classes of
    > applications (such as the original poster's) for which "IDN
    > normalization" is a good solution? I understand that any particular
    > application would ideally pick and choose from the possibilities in
    > UTR 30, but it'd be great if I could say "start with IDN" when people
    > ask me about these issues.
    IDN still operates on a restricted domain of characters, many characters
    that are part of ordinary text are disallowed from the get-go (I haven't
    checked where that subset is at recently, but that's the general idea).
    At the minimum, the transformations that are designed into IDN would
    need to be modified or extended to handle such characters. Because of
    that alone, the normalization and folding aspect of IDN is unlikely to
    be suitable for general text. There are likely additional issues.

    If you suggest that any scheme in which you can't represent the word
    "can't" is suitable for the class of applications that the original
    poster represents, then I fail to follow you.

    Also, in the case of foldings, there's not necessarily a single
    continuum. Yes, if you look at UTS#30 it does point out that the
    compatibility mappings can be separated into several types of foldings -
    but there are other foldings that cut across the spectrum in different
    ways, for example case folding. Finally, compatibility mappings are
    immutable and assigned rather mechanically to new characters added to
    the standard (mostly based on analogy with existing, similar
    characters). However, a well defined folding may exclude or include a
    slightly different set of characters, or the folding may act on a
    string, not an isolated character. Therefore foldings are not like
    little lego blocks that you add one by one until you get from NFD to NFKD.


    This archive was generated by hypermail 2.1.5 : Thu Sep 20 2007 - 11:45:16 CDT