RE: Normalization in panlingual application

From: Philippe Verdy (
Date: Thu Sep 20 2007 - 18:20:37 CDT

  • Next message: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"

    IDN means the Internationized Domain Names. They are definitely not designed
    to support corpus of texts and keep all linguistic distinctions.

    IDN is supported by a type of folding that not only includes some form of
    normalization, but also compatibility mappings, case mappings and foldings
    (not the same as those defined in Unicode, but a much weaker mapping that is
    even more destructive).

    IDN are the results of two parts: a NamePrep algorithm that performs the
    actual mappings and folding, and a encoding known as "Punycode" to allow
    representing the mapped/folded strings within the very limited constraints
    of DNS, using specially reserved prefixes.

    IDN is not made to support all characters: punctuations and spaces are not
    supported, as well as significant typographic differences.

    The Punycode encoding is not the problem here, it's not different from other
    charset encodings, and could support anyvalid Unicode string and would
    preserve the Unicode differences. We are talking here about NamePrep, as it
    is specified in its RFC.

    Note that the stability rules for IDN are different from those used in
    Unicode, even though there are largely common rules.

    For Unicode conformance, what is really important is to preserve the
    canonical-equivalence of results from distinct but canonically-equivalent
    input data. At least, this condition should be respected by NamePrep. But
    nothing is defined really to make NamePrep conforming with the
    compatibility-equivalences of outputs from distinct but
    compatibility-equivalent inputs, and make this assumption stable over time
    and across all IDN implementations.

    So disregard Nameprep for your project. It is not for you as it is extremely
    lossy, and does support only a subset of the UCS.

    > -----Message d'origine-----
    > De : [] De la
    > part de John D. Burger
    > Envoyé : jeudi 20 septembre 2007 15:03
    > À : Unicode
    > Objet : Re: Normalization in panlingual application
    > >> It should at best have been just a non-mandatory recommendation,
    > >> allowing tailoring (even IDN no longer refers to it directly, and
    > >> needed to redefine its own foldings).
    > >
    > > That's because IDN is morphing beyond simple identifiers as
    > > traditionally understood for programming languages and the like.
    > > IDN is attempting to be closer to ordinary language, and that's why
    > > the limitations of NFKD/NFKC become apparent.
    > I'm not that familiar with IDN - do the foldings specified by IDN
    > constitute a useful "sweet spot" for normalization/folding, somewhere
    > in between NFD and NFKD? That is, might there be broad classes of
    > applications (such as the original poster's) for which "IDN
    > normalization" is a good solution? I understand that any particular
    > application would ideally pick and choose from the possibilities in
    > UTR 30, but it'd be great if I could say "start with IDN" when people
    > ask me about these issues.
    > Thanks.
    > - John D. Burger
    > MITRE

    This archive was generated by hypermail 2.1.5 : Thu Sep 20 2007 - 21:52:42 CDT