Re: Unicode and RFC 4690

From: Steve Summit (scs@eskimo.com)
Date: Wed Oct 04 2006 - 11:48:27 CST

Next message: Jukka K. Korpela: "Re: "Visually approximate" conversion from unicode to Windows-1251 (or similar code page)"

Previous message: Paul Johnston: ""Visually approximate" conversion from unicode to Windows-1251 (or similar code page)"
In reply to: Stephane Bortzmeyer: "Re: Unicode and RFC 4690"
Next in thread: Jefsey_Morfin: "Re: Unicode and RFC 4690"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Stephane Bortzmeyer wrote:
> On Tue, Oct 03, 2006 at 10:22:33PM +0200,
> Jefsey_Morfin <jefsey@jefsey.com> wrote
>> RFC 4690 documents a certain number of difficulties resulting of the
>> choice of Unicode as the reference table of the punycode process.
>
> Not at all... The RFC is available here:
> http://www.ietf.org/rfc/rfc4690.txt
> and does not discuss the choice of Unicode.

Well, it does discuss some significant, significant difficulties.
They're not surprising or new, and they pretty much all revolve
around: normalization. The problem, as I read that RFC, is that
DNS requires perhaps the highest levels of accuracy, reliability,
and security when it comes to normalizing and comparing Unicode
strings, but has a relatively minimal amount of context available
to it to assist in the various mapping and matching tasks.

As we know, normalization is a hard problem. The existing
normalization algorithms go a long way towards solving it, but of
course they do not solve it completely. However, what is "good
enough" for some users of the existing normalization algorithms
is probably not good enough for DNS.

I think what Jefsey was asking about was whether anyone has
done any real work on what we might call the "next phase" of
normalization, namely that which considers all pairs and sets
of likely visually-similar glyphs (what RFC 4690 calls
"confusables"), across all languages and scripts.

To cite the simplest example: everyone knows that U+0041 Latin
Capital Letter A, U+0391 Greek Capital Letter Alpha, and U+0410
Cyrillic Capital Letter A are likely to be visibly very similar,
if not identical. But, as far as I know, the existing
normalization algorithms don't touch them. Are there any
established, respected, comprehensive repositories of such
equivalences? (I've got my own ad-hoc attempt, but I couldn't
call it "respected", nor even particularly comprehensive.)

[P.S. I don't know nearly as much about the existing
normalization algorithms as I ought to. Apologies in advance
if I'm utterly wrong in my statement that they don't address
the A/Α/А issue.]

Next message: Jukka K. Korpela: "Re: "Visually approximate" conversion from unicode to Windows-1251 (or similar code page)"
Previous message: Paul Johnston: ""Visually approximate" conversion from unicode to Windows-1251 (or similar code page)"
In reply to: Stephane Bortzmeyer: "Re: Unicode and RFC 4690"
Next in thread: Jefsey_Morfin: "Re: Unicode and RFC 4690"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Oct 04 2006 - 11:49:53 CST