Digit/letter variants in the "same" unified script (was: stability policy on numeric type = decimal)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Jul 29 2010 - 07:57:17 CDT

  • Next message: Shriramana Sharma: "Re: Why not just change the glyph of 20A8 RUPEE SIGN?"

    "Martin J. Dürst" <duerst@it.aoyama.ac.jp> wrote:
    > On 2010/07/29 13:33, karl williamson wrote:
    > > Asmus Freytag wrote:
    > >> On 7/25/2010 6:05 PM, Martin J. Dürst wrote:
    > >>> Well, there actually is such a script, namely Han. The digits (一、
    > >>> 二、三、四、五、六、七、八、九、〇) are used both as letters and as
    > >>> decimal place-value digits, and they are scattered widely, and of
    > >>> course there are is a lot of modern living practice.
    > >> The situation is worse than you indicate, because the same characters
    > >> are also used as elements in a system that doesn't use place-value,
    > >> but uses special characters to show powers of 10.
    > No. Sequences of numeric Kanji are also used in names and word-plays,
    > and as sequences of individual small numbers.

      (1) Existing exception :

    There's one example of a digit which has a numeric type = decimal, AND
    is encoded in a "scattered" way:

    19DA;6618;᧚;New Tai Lue Tham Digit One;Nd;0;L;...;1;1;1;N

    The other decimal nine digits for the Tham variant of the New Tai Lue
    digits are borrowed from another sequence of decimal digits, starting
    at U+19D0 (for digit zero) with the exception of U+19D1 which is
    replaced (for digit one). Both sets are assigned in the same
    "New_Tai_Lue" script property value.

    So the additional stability proposal will not be enforceable.

      (2) Arabic digits :

    Such case was avoided for the Eastern/Extended variant of Arabo-Indic
    digits in U+06F0..U+06F9, without borrowing the common forms for the
    Standard variant in U+0660.U+0669: they were reencoded separately to
    create a complete sequence of 10 digits, even if most of them (all
    except 4 to 6) are exactly similar and belong to the same unified

    But what is even more "strange" is that the Standard Arabic digits are
    assigned to the "Common" script, when the Eastern/Extended variant is
    assigned to the "Arabic" script (look at the Unicode script property
    value, from the file "Scripts-5.2.0.txt" in the UCD).

    If you just look at this property, you may think that the
    Extended/Eastern digits are the standard ones for the Arabic script:
    this is a side-effect of unification of Western and Eastern variants
    of the Arabic script.

      (3) Unification of the Arabic script:

    Ideally, there should be two additional separate ISO 15924 script
    codes for the Western and Eastern variants the Arabic script (possibly
    [Arbs] for Standard/Western, and [Arbx] for Extended/Eastern), and the
    Unicode "script" property value alias for the Western and Eastern
    digits or letters should be segregated, using a separate Script
    property value (splitting the Arabic script, where it is significant,
    just like it occured for Georgian and Greek/Coptic alphabets).

    Nothing will be changed for the existing Arabic script, but the
    "Extended/Eastern Arabic" script (assigned with a new ISO 15924 code
    and mapped with a new property alias in Unicode), will still borrow
    most of its letters from the standard script without reencoding them.

    No character or block will be renamed (and I DO NOT propose to
    disunifying existing common Arabic letters, or assigning them in the
    "Common" script), it should just be a better sub-classification, where
    the characters are clearly distinguished between the two variants.

    Most Arabic characters should remain in the common "Arabic" script,
    and those that are differentiated should be assigned in a
    "Standard_Arabic" or "Extended_Arabic" script. But this may cause some
    complication for the script inheritance in spans of texts (because the
    "Arabic" script property value would behave a bit like what the
    "Common" does for alphabetic scripts, i.e. like a group of scripts).

    Such change for the assigned script property value (if it's not
    already stabilized) would require documentation, and changes in a few
    other core or derived datafiles:

    - PropertyValueAliases.txt (adding two new property values for "sc"):
    sc ; Arab ; Arabic # All forms, includes "sc=Arbc", "sc=Arbs" and
    "sc=Arbx" in regexps)
    sc ; Arbc ; Common_Arabic
    sc ; Arbs ; Standard_Arabic # (also includes "sc=Arbc" in regexps)
    sc ; Arbx ; Extended_Arabic # (also includes "sc=Arbc" in regexps)

    - Script.txt (assigning the two new property values to remap existing "Arabic")
    - Arabic-Shaping.txt (possibly adding comments at end of lines where
    this is not the Common Arabic)
    - Joining-Groups.txt (same remark)
    - Bidi-Mirroring.txt (same remark)

    And in the description of some standard script identification and
    segmentation algorithms. I don't know if IDNA should continue to use
    "Arab" (all forms) or if it should segregate "Arbs" and "Arbx" (to
    avoid mixing digits that are visually confusable), as it uses such
    segmentation (note that these characters are canonically different,
    for normalization purposes).


    This archive was generated by hypermail 2.1.5 : Thu Jul 29 2010 - 08:00:43 CDT