From: CE Whitehead (firstname.lastname@example.org)
Date: Sun Aug 01 2010 - 20:40:22 CDT
> Date: Fri, 30 Jul 2010 06:04:26 +0200
> From: email@example.com
> To: firstname.lastname@example.org; email@example.com
> CC: firstname.lastname@example.org; email@example.com; firstname.lastname@example.org; email@example.com
> For Arabic ther are clearly two separate sets of digits, but the
> possibility of mixing them arbitrarily is still a problem for IDNA (if
> both sets are accepted)
Both are accepted (according to online info and Martin Duerst, as I understand things).
, notably because most digits (except 4 to 6)
> are completely identical. So registries will have to:
> - either accept one set and reject the other one
> - accept both, but only one within the same domain label, reserving
> also the label using the other set (as if they were canonically
Saudi's registry is folding these; that is, somehow the .sa registry plans to have that done -- though I did not realize the registry could control folding, but perhaps it is just recommending folding.
> Such equivalences (which are definitely not canonical)
> can be handled
> by tailored collation compares (operating at collation level 2 only,
> when non-IDN registries operate only at level 1),
So you are proposing something like folding these in string prep? But I am confused about why level 1 would not work (sorry to ask a dumb question).
> where IDN registries
> will use their own tailoring. I just see the IDN "StringPrep" as a
> particular application of the general concept of collation mappings
> (except that it was not designed on linguistic bases, but an IDN
> registry can be viewed as a locale for collation purposes).
The Saudi registry's policy is to accept both number sets it seems and then fold the two varieties into the non-Eastern variety (both varieties are apparently available on the Saudi keyboard -- or should be). But there are other registries that will handle Arabic script domains.
> All these
> complex rules and mappings of IDN can be written in terms of a set
> collation rules, added on top of the DUCET.
O.k. -- a possibility. One can add these to the DUCET, but collation is always tailorable, according to the whims of the application programmer (the browser developer), as far as I understand things. But it's better than not having a standard, and not specifying what to do (so that each registry and application programmer might very likely handle these differently).
(NOTE: Bank1 in Persian and Bank1 in Arabic will look identical, except a different number 1 will be used in each case --- unless something can be worked out as a standard.
According to Saudi Arabia's registry [for the domain .sa] [which recommends something like Phillipe has suggested]:
". . . both sets should be supported in the user interface and both are folded to one set (Set I) at the preparation of internationalized strings [e.g., "stringprep"] phase."
[But I am confused: how does Saudi Arabia's registry control stringprep ?;
On the other hand, tr36 recommends an alert for such confusables, if I understand things [ http://www.unicode.org/reports/tr36/proposed.html#Visual_Spoofing_Recommendation
In any case the only other two countries that will be able to register Arabic-language domains at this point, as far as I can tell, are Egypt and the United Arab Emirates.
http://www.idnnews.com/?p=9809). However, I do not know if all three policies will be the same as Saudi's, or what other countries (Iran, Pakistan, India) will register Arabic-script domains soon. And I do not know what each browser developer will do about confusables including numbers (I checked a little -- I found various policies).
(Of course, a smart banker would not register a bank1 in the Mideast/Arabic-Indic digit system)
--C. E. Whitehead
This archive was generated by hypermail 2.1.5 : Sun Aug 01 2010 - 20:44:48 CDT