RE: Normalization in panlingual application

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Sep 20 2007 - 18:20:37 CDT

Next message: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"

Previous message: Philippe Verdy: "RE: Normalization in panlingual application"
In reply to: John D. Burger: "Re: Normalization in panlingual application"
Next in thread: Philippe Verdy: "RE: Normalization in panlingual application"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

IDN means the Internationized Domain Names. They are definitely not designed
to support corpus of texts and keep all linguistic distinctions.

IDN is supported by a type of folding that not only includes some form of
normalization, but also compatibility mappings, case mappings and foldings
(not the same as those defined in Unicode, but a much weaker mapping that is
even more destructive).

IDN are the results of two parts: a NamePrep algorithm that performs the
actual mappings and folding, and a encoding known as "Punycode" to allow
representing the mapped/folded strings within the very limited constraints
of DNS, using specially reserved prefixes.

IDN is not made to support all characters: punctuations and spaces are not
supported, as well as significant typographic differences.

The Punycode encoding is not the problem here, it's not different from other
charset encodings, and could support anyvalid Unicode string and would
preserve the Unicode differences. We are talking here about NamePrep, as it
is specified in its RFC.

Note that the stability rules for IDN are different from those used in
Unicode, even though there are largely common rules.

For Unicode conformance, what is really important is to preserve the
canonical-equivalence of results from distinct but canonically-equivalent
input data. At least, this condition should be respected by NamePrep. But
nothing is defined really to make NamePrep conforming with the
compatibility-equivalences of outputs from distinct but
compatibility-equivalent inputs, and make this assumption stable over time
and across all IDN implementations.

So disregard Nameprep for your project. It is not for you as it is extremely
lossy, and does support only a subset of the UCS.

> -----Message d'origine-----
> De : unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] De la
> part de John D. Burger
> Envoyé : jeudi 20 septembre 2007 15:03
> À : Unicode
> Objet : Re: Normalization in panlingual application
>
> >> It should at best have been just a non-mandatory recommendation,
> >> allowing tailoring (even IDN no longer refers to it directly, and
> >> needed to redefine its own foldings).
> >
> > That's because IDN is morphing beyond simple identifiers as
> > traditionally understood for programming languages and the like.
> > IDN is attempting to be closer to ordinary language, and that's why
> > the limitations of NFKD/NFKC become apparent.
>
> I'm not that familiar with IDN - do the foldings specified by IDN
> constitute a useful "sweet spot" for normalization/folding, somewhere
> in between NFD and NFKD? That is, might there be broad classes of
> applications (such as the original poster's) for which "IDN
> normalization" is a good solution? I understand that any particular
> application would ideally pick and choose from the possibilities in
> UTR 30, but it'd be great if I could say "start with IDN" when people
> ask me about these issues.
>
> Thanks.
>
> - John D. Burger
> MITRE
>
>
>
>

Next message: Mike: "Re: New Public Review Issue: Proposed Update UTS #18"
Previous message: Philippe Verdy: "RE: Normalization in panlingual application"
In reply to: John D. Burger: "Re: Normalization in panlingual application"
Next in thread: Philippe Verdy: "RE: Normalization in panlingual application"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Sep 20 2007 - 21:52:42 CDT