RE: Digit/letter variants in the "same" unified script (was: stability policy on numeric type = decimal)

From: CE Whitehead (
Date: Sun Aug 01 2010 - 20:40:22 CDT

  • Next message: Shriramana Sharma: "Re: Indian new rupee sign"

    > Date: Fri, 30 Jul 2010 06:04:26 +0200
    > From:
    > To:;
    > CC:;;;
    > For Arabic ther are clearly two separate sets of digits, but the
    > possibility of mixing them arbitrarily is still a problem for IDNA (if
    > both sets are accepted)
    Both are accepted (according to online info and Martin Duerst, as I understand things).
    , notably because most digits (except 4 to 6)
    > are completely identical. So registries will have to:
    > - either accept one set and reject the other one
    > - accept both, but only one within the same domain label, reserving
    > also the label using the other set (as if they were canonically
    > equivalent).
    Saudi's registry is folding these; that is, somehow the .sa registry plans to have that done -- though I did not realize the registry could control folding, but perhaps it is just recommending folding.
    > Such equivalences (which are definitely not canonical)
    > can be handled
    > by tailored collation compares (operating at collation level 2 only,
    > when non-IDN registries operate only at level 1),
    So you are proposing something like folding these in string prep? But I am confused about why level 1 would not work (sorry to ask a dumb question).
    > where IDN registries
    > will use their own tailoring. I just see the IDN "StringPrep" as a
    > particular application of the general concept of collation mappings
    > (except that it was not designed on linguistic bases, but an IDN
    > registry can be viewed as a locale for collation purposes).
    The Saudi registry's policy is to accept both number sets it seems and then fold the two varieties into the non-Eastern variety (both varieties are apparently available on the Saudi keyboard -- or should be). But there are other registries that will handle Arabic script domains.
    > All these
    > complex rules and mappings of IDN can be written in terms of a set
    > collation rules, added on top of the DUCET.

    O.k. -- a possibility. One can add these to the DUCET, but collation is always tailorable, according to the whims of the application programmer (the browser developer), as far as I understand things. But it's better than not having a standard, and not specifying what to do (so that each registry and application programmer might very likely handle these differently).

    (NOTE: Bank1 in Persian and Bank1 in Arabic will look identical, except a different number 1 will be used in each case --- unless something can be worked out as a standard.

    According to Saudi Arabia's registry [for the domain .sa] [which recommends something like Phillipe has suggested]:
    ". . . both sets should be supported in the user interface and both are folded to one set (Set I) 
at the preparation of internationalized strings [e.g., "stringprep"] phase."
    [But I am confused: how does Saudi Arabia's registry control stringprep ?;


    On the other hand, tr36 recommends an alert for such confusables, if I understand things [

    In any case the only other two countries that will be able to register Arabic-language domains at this point, as far as I can tell, are Egypt and the United Arab Emirates.
    (see: However, I do not know if all three policies will be the same as Saudi's, or what other countries (Iran, Pakistan, India) will register Arabic-script domains soon. And I do not know what each browser developer will do about confusables including numbers (I checked a little -- I found various policies).

    (Of course, a smart banker would not register a bank1 in the Mideast/Arabic-Indic digit system)


    --C. E. Whitehead

    This archive was generated by hypermail 2.1.5 : Sun Aug 01 2010 - 20:44:48 CDT