RE: Digit/letter variants in the "same" unified script (was: stability policy on numeric type = decimal)

From: CE Whitehead (
Date: Thu Jul 29 2010 - 15:00:01 CDT

  • Next message: karl williamson: "Re: Digit/letter variants in the "same" unified script (was: stability policy on numeric type = decimal)"

    > Date: Thu, 29 Jul 2010 14:57:17 +0200
    > Subject: Digit/letter variants in the "same" unified script (was: stability policy on numeric type = decimal)
    > From:
    > To:;
    > CC:;;
    > "Martin J. Dürst" <> wrote:
    >> On 2010/07/29 13:33, karl williamson wrote:
    >>> Asmus Freytag wrote:
    >>>> On 7/25/2010 6:05 PM, Martin J. Dürst wrote:
    >>>>> Well, there actually is such a script, namely Han. The digits (一、
    >>>>> 二、三、四、五、六、七、八、九、〇) are used both as letters and as
    >>>>> decimal place-value digits, and they are scattered widely, and of
    >>>>> course there are is a lot of modern living practice.
    >>>> The situation is worse than you indicate, because the same characters
    >>>> are also used as elements in a system that doesn't use place-value,
    >>>> but uses special characters to show powers of 10.
    >> No. Sequences of numeric Kanji are also used in names and word-plays,
    >> and as sequences of individual small numbers.
    > (1) Existing exception :
    > There's one example of a digit which has a numeric type = decimal, AND
    > is encoded in a "scattered" way:
    > 19DA;6618;᧚;New Tai Lue Tham Digit One;Nd;0;L;...;1;1;1;N
    > The other decimal nine digits for the Tham variant of the New Tai Lue
    > digits are borrowed from another sequence of decimal digits, starting
    > at U+19D0 (for digit zero) with the exception of U+19D1 which is
    > replaced (for digit one). Both sets are assigned in the same
    > "New_Tai_Lue" script property value.
    > So the additional stability proposal will not be enforceable.
    > (2) Arabic digits :
    > Such case was avoided for the Eastern/Extended variant of Arabo-Indic
    > digits in U+06F0..U+06F9, without borrowing the common forms for the
    > Standard variant in U+0660.U+0669: they were reencoded separately to
    > create a complete sequence of 10 digits, even if most of them (all
    > except 4 to 6) are exactly similar and belong to the same unified
    > "script".
    > But what is even more "strange" is that the Standard Arabic digits are
    > assigned to the "Common" script, when the Eastern/Extended variant is
    > assigned to the "Arabic" script (look at the Unicode script property
    > value, from the file "Scripts-5.2.0.txt" in the UCD).
    > If you just look at this property, you may think that the
    > Extended/Eastern digits are the standard ones for the Arabic script:
    > this is a side-effect of unification of Western and Eastern variants
    > of the Arabic script.
    > (3) Unification of the Arabic script:
    > Ideally, there should be two additional separate ISO 15924 script
    > codes for the Western and Eastern variants the Arabic script (possibly
    > [Arbs] for Standard/Western, and [Arbx] for Extended/Eastern), and the
    > Unicode "script" property value alias for the Western and Eastern
    > . . .
    > Most Arabic characters should remain in the common "Arabic" script,
    > and those that are differentiated should be assigned in a
    > "Standard_Arabic" or "Extended_Arabic" script. But this may cause some
    > complication for the script inheritance in spans of texts (because the
    > "Arabic" script property value would behave a bit like what the
    > "Common" does for alphabetic scripts, i.e. like a group of scripts).
    > Such change for the assigned script property value (if it's not
    > already stabilized) would require documentation, and changes in a few
    > other core or derived datafiles:
    > - PropertyValueAliases.txt (adding two new property values for "sc"):
    > sc ; Arab ; Arabic # All forms, includes "sc=Arbc", "sc=Arbs" and
    > "sc=Arbx" in regexps)
    > sc ; Arbc ; Common_Arabic
    > sc ; Arbs ; Standard_Arabic # (also includes "sc=Arbc" in regexps)
    > sc ; Arbx ; Extended_Arabic # (also includes "sc=Arbc" in regexps)
    > - Script.txt (assigning the two new property values to remap existing "Arabic")
    > - Arabic-Shaping.txt (possibly adding comments at end of lines where
    > this is not the Common Arabic)
    > - Joining-Groups.txt (same remark)
    > - Bidi-Mirroring.txt (same remark)
    > And in the description of some standard script identification and
    > segmentation algorithms. I don't know if IDNA should continue to use
    > "Arab" (all forms) or if it should segregate "Arbs" and "Arbx" (to
    > avoid mixing digits that are visually confusable), as it uses such
    > segmentation (note that these characters are canonically different,
    > for normalization purposes).
    > Philippe.
    Hi. Regarding your proposal, for IDN's, I have a security concern:

    In the list of unicode allowed characters, the Eastern set of numbers seems to be allowed;(
    Saudi Arabia has got the other set in its allowed list
    so I gather both are allowed in IDN's.
    You would then have mixed scripts in IDN's for Arab with either Arbx alone or Arbs (if those are the names chosen).
    You do not want to display a mixed script warning for that.
    (That would be tantamount to my security event viewer's displaying a login failure in addition to a login success everytime I login successfully; you start to ignore the failure messages.)
    (I cannot find these digits in the normalization charts. Sorry. I suppose however that they do not normalize to one another because that would destroy sequential processing of them -- which is what Karl is looking for -- although sequential processing does not apply to idn's; too bad they cannot just be normalized in idn's, that there cannot be a different standard for idn's . . . would that be an option? That's kind of a wild idea too.)
    C. E. Whitehead

    This archive was generated by hypermail 2.1.5 : Thu Jul 29 2010 - 15:03:02 CDT