RE: Digit/letter variants in the "same" unified script (was: stability policy on numeric type = decimal)

From: CE Whitehead (cewcathar@hotmail.com)
Date: Sun Aug 01 2010 - 20:40:22 CDT

  • Next message: Shriramana Sharma: "Re: Indian new rupee sign"

    > Date: Fri, 30 Jul 2010 06:04:26 +0200
    > From: verdy_p@wanadoo.fr
    > To: public@khwilliamson.com; mark@macchiato.com
    > CC: duerst@it.aoyama.ac.jp; asmusf@ix.netcom.com; kent.karlsson14@telia.com; unicode@unicode.org
    >
    > For Arabic ther are clearly two separate sets of digits, but the
    > possibility of mixing them arbitrarily is still a problem for IDNA (if
    > both sets are accepted)
    Both are accepted (according to online info and Martin Duerst, as I understand things).
    , notably because most digits (except 4 to 6)
    > are completely identical. So registries will have to:
    > - either accept one set and reject the other one
    > - accept both, but only one within the same domain label, reserving
    > also the label using the other set (as if they were canonically
    > equivalent).
    Saudi's registry is folding these; that is, somehow the .sa registry plans to have that done -- though I did not realize the registry could control folding, but perhaps it is just recommending folding.
    >
    > Such equivalences (which are definitely not canonical)
    Yes.
    > can be handled
    > by tailored collation compares (operating at collation level 2 only,
    > when non-IDN registries operate only at level 1),
    So you are proposing something like folding these in string prep? But I am confused about why level 1 would not work (sorry to ask a dumb question).
    > where IDN registries
    > will use their own tailoring. I just see the IDN "StringPrep" as a
    > particular application of the general concept of collation mappings
    > (except that it was not designed on linguistic bases, but an IDN
    > registry can be viewed as a locale for collation purposes).
    The Saudi registry's policy is to accept both number sets it seems and then fold the two varieties into the non-Eastern variety (both varieties are apparently available on the Saudi keyboard -- or should be). But there are other registries that will handle Arabic script domains.
    > All these
    > complex rules and mappings of IDN can be written in terms of a set
    > collation rules, added on top of the DUCET.
    >

    O.k. -- a possibility. One can add these to the DUCET, but collation is always tailorable, according to the whims of the application programmer (the browser developer), as far as I understand things. But it's better than not having a standard, and not specifying what to do (so that each registry and application programmer might very likely handle these differently).

    (NOTE: Bank1 in Persian and Bank1 in Arabic will look identical, except a different number 1 will be used in each case --- unless something can be worked out as a standard.

    According to Saudi Arabia's registry [for the domain .sa] [which recommends something like Phillipe has suggested]:
    ". . . both sets should be supported in the user interface and both are folded to one set (Set I) 
at the preparation of internationalized strings [e.g., "stringprep"] phase."
    [But I am confused: how does Saudi Arabia's registry control stringprep ?;

    see: http://www.iana.org/domains/idn-tables/tables/sa_ar_1.0.html]

    On the other hand, tr36 recommends an alert for such confusables, if I understand things [ http://www.unicode.org/reports/tr36/proposed.html#Visual_Spoofing_Recommendation
    ].)

    In any case the only other two countries that will be able to register Arabic-language domains at this point, as far as I can tell, are Egypt and the United Arab Emirates.
    (see: http://www.itp.net/580094-gulf-countries-can-now-register-arabic-domains
    http://www.idnnews.com/?p=9809). However, I do not know if all three policies will be the same as Saudi's, or what other countries (Iran, Pakistan, India) will register Arabic-script domains soon. And I do not know what each browser developer will do about confusables including numbers (I checked a little -- I found various policies).

    (Of course, a smart banker would not register a bank1 in the Mideast/Arabic-Indic digit system)

    Best,

    --C. E. Whitehead
    cewcathar@hotmail.com



    This archive was generated by hypermail 2.1.5 : Sun Aug 01 2010 - 20:44:48 CDT