# Re: Digit/letter variants in the "same" unified script (was: stability policy on numeric type = decimal)

From: Mark Davis ☕ (mark@macchiato.com)
Date: Thu Jul 29 2010 - 09:51:59 CDT

• Next message: Philippe Verdy: "Re: Digit/letter variants in the "same" unified script (was: stability policy on numeric type = decimal)"

Mark

*— Il meglio è l’inimico del bene —*

On Thu, Jul 29, 2010 at 05:57, Philippe Verdy <verdy_p@wanadoo.fr> wrote:

> "Martin J. Dürst" <duerst@it.aoyama.ac.jp> wrote:
> >
> > On 2010/07/29 13:33, karl williamson wrote:
> > > Asmus Freytag wrote:
> > >> On 7/25/2010 6:05 PM, Martin J. Dürst wrote:
> >
> > >>> Well, there actually is such a script, namely Han. The digits (一、
> > >>> 二、三、四、五、六、七、八、九、〇) are used both as letters and as
> > >>> decimal place-value digits, and they are scattered widely, and of
> > >>> course there are is a lot of modern living practice.
> >
> > >> The situation is worse than you indicate, because the same characters
> > >> are also used as elements in a system that doesn't use place-value,
> > >> but uses special characters to show powers of 10.
> >
> > No. Sequences of numeric Kanji are also used in names and word-plays,
> > and as sequences of individual small numbers.
>
> (1) Existing exception :
>
> There's one example of a digit which has a numeric type = decimal, AND
> is encoded in a "scattered" way:
>
> 19DA;6618;᧚;New Tai Lue Tham Digit One;Nd;0;L;...;1;1;1;N
>
> The other decimal nine digits for the Tham variant of the New Tai Lue
> digits are borrowed from another sequence of decimal digits, starting
> at U+19D0 (for digit zero) with the exception of U+19D1 which is
> replaced (for digit one). Both sets are assigned in the same
> "New_Tai_Lue" script property value.
>
> So the additional stability proposal will not be enforceable.
>

On the contrary. Were we do want such a policy, the implication would be
either to:
(a) change the type of 19DA from Nd to No (what I think would be the right
thing to do)
(b) grandfather in the character.

>
> (2) Arabic digits :
>
> Such case was avoided for the Eastern/Extended variant of Arabo-Indic
> digits in U+06F0..U+06F9, without borrowing the common forms for the
> Standard variant in U+0660.U+0669: they were reencoded separately to
> create a complete sequence of 10 digits, even if most of them (all
> except 4 to 6) are exactly similar and belong to the same unified
> "script".
>
> But what is even more "strange" is that the Standard Arabic digits are
> assigned to the "Common" script, when the Eastern/Extended variant is
> assigned to the "Arabic" script (look at the Unicode script property
> value, from the file "Scripts-5.2.0.txt" in the UCD).
>
> If you just look at this property, you may think that the
> Extended/Eastern digits are the standard ones for the Arabic script:
> this is a side-effect of unification of Western and Eastern variants
> of the Arabic script.
>

It is not so strange. Read
http://www.unicode.org/reports/tr24/proposed.html#Multiple_Script_Values,
and other parts of #24 describing Common.

>
>
> (3) Unification of the Arabic script:
>
> Ideally, there should be two additional separate ISO 15924 script
> codes for the Western and Eastern variants the Arabic script (possibly
> [Arbs] for Standard/Western, and [Arbx] for Extended/Eastern), and the
> Unicode "script" property value alias for the Western and Eastern
> digits or letters should be segregated, using a separate Script
> property value (splitting the Arabic script, where it is significant,
> just like it occured for Georgian and Greek/Coptic alphabets).
>

There is no likelihood of that happening, simply for the sake of these
digits.

The original characters were just font variants; they were really split to a
large extend because of the UBA (which I think in retrospect was a mistake,
but c'est la vie, n'est pas?).

> Nothing will be changed for the existing Arabic script, but the
> "Extended/Eastern Arabic" script (assigned with a new ISO 15924 code
> and mapped with a new property alias in Unicode), will still borrow
> most of its letters from the standard script without reencoding them.
>
> No character or block will be renamed (and I DO NOT propose to
> disunifying existing common Arabic letters, or assigning them in the
> "Common" script), it should just be a better sub-classification, where
> the characters are clearly distinguished between the two variants.
>
> Most Arabic characters should remain in the common "Arabic" script,
> and those that are differentiated should be assigned in a
> "Standard_Arabic" or "Extended_Arabic" script. But this may cause some
> complication for the script inheritance in spans of texts (because the
> "Arabic" script property value would behave a bit like what the
> "Common" does for alphabetic scripts, i.e. like a group of scripts).
>
> Such change for the assigned script property value (if it's not
> already stabilized) would require documentation, and changes in a few
> other core or derived datafiles:
>
> - PropertyValueAliases.txt (adding two new property values for "sc"):
> sc ; Arab ; Arabic # All forms, includes "sc=Arbc", "sc=Arbs" and
> "sc=Arbx" in regexps)
> sc ; Arbc ; Common_Arabic
> sc ; Arbs ; Standard_Arabic # (also includes "sc=Arbc" in regexps)
> sc ; Arbx ; Extended_Arabic # (also includes "sc=Arbc" in regexps)
>
> - Script.txt (assigning the two new property values to remap existing
> "Arabic")
> this is not the Common Arabic)
> - Joining-Groups.txt (same remark)
> - Bidi-Mirroring.txt (same remark)
>
> And in the description of some standard script identification and
> segmentation algorithms. I don't know if IDNA should continue to use
> "Arab" (all forms) or if it should segregate "Arbs" and "Arbx" (to
> avoid mixing digits that are visually confusable), as it uses such
> segmentation (note that these characters are canonically different,
> for normalization purposes).
>
> Philippe.
>
>
>

This archive was generated by hypermail 2.1.5 : Thu Jul 29 2010 - 09:54:32 CDT