L2/13-039R

To: UTC

From: Mark Davis

Re: Script and Script Extension property principles

Live doc: https://docs.google.com/document/d/1-bufL5J3tTzzlozlWFok_3_S0qzTeJaJXdtTxVPJ77I/pub 


This is in response  to action 133-A044 “Provide a proposal for the January 2013 UTC meeting for a principle for how to allow characters to have explicit script values and multiple script extension values; and suggested changes to the text and properties to accord with that.” (referenced doc)

At the bottom of this document is a comparison of the current (U6.2) Script property and Script Extensions property values, where they are not identical, followed by a list of the affected characters.

Textual Changes to UAX #24

Here are suggested changes to the text, following Ken’s suggestion of making this an exception rather than a “principle”.

2.1 Special vs. Explicit Script Property Values

...

OLD

NEW

If a character is only regularly used with a single script, then it is given that specific Script property value (as opposed to Common or Inherited). This facilitates the use of the script property for common tasks such as regular expressions, but it also means that some characters that are definite members of a given script, based on their forms and history, nevertheless are assigned one of the generic values. As more data on the usage of individual characters is collected, the Script property value assigned to a character may change. Rarely would a character change from one specific script to another. However, if it becomes established that a character is regularly used with more than one script, it will be assigned the Common or Inherited Script property value. Similarly, if it becomes established that a character is regularly used with only a single, specific script, it will be assigned a specific Script property value. The occasional use of character from one script in the context of another script, as for instance the citation of a Greek letter used as a mathematical constant in the midst of Latin text, or the use of a Latin letter in the midst of Han text, is not considered sufficient evidence of "regular use" requiring a designation of Common Script property value. It is also possible for a character, once given a Common or Inherited Script property value, upon further research, to be changed to a specific script, instead.

If a character is only regularly used with a single script, then it is given that specific Script property value (as opposed to Common or Inherited). In few instances, characters known to be used with more than one script, but which are overwhelmingly associated with and used with a single script, also take the Script property value of that script. The assignment of a single script  facilitates the use of the script property for common tasks such as regular expressions. but it also means that some characters that are definite members of a given script, based on their forms and history, nevertheless are assigned one of the generic values. 

As more data on the usage of individual characters is collected, the Script property value assigned to a character may change. Rarely would a character change from one specific script to another. However, if it becomes established that a character is regularly used with more than one script, it may be assigned the Common or Inherited Script property value. Similarly, if it becomes established that a character is regularly used with only a single, specific script, it may be assigned a specific Script property value.

The occasional use of character from one script in the context of another script, as for instance the citation of a Greek letter used as a mathematical constant in the midst of Latin text, or the use of a Latin letter in the midst of Han text, is not considered sufficient evidence of "regular use" requiring a designation of Common Script property value. It is also possible for a character, once given a Common or Inherited Script property value, upon further research, to be changed to a specific script, instead.

2.9 Script_Extensions Property

(add just before “The Script_Extensions property values are given in the file ScriptExtensions.txt in the Unicode Character Database [UCD].”)

However, there are some invariants that can be depended on:

  1. The  Script Extensions property value for a character will never contain Common or Inherited, unless the that value is the only item, and it is identical with the Script property value for that character.
  1. For example, ScriptExtensions={Common, Arab} will not occur.
  1. If the Script Property value is explicit, then the Script Extensions property value will include it.
  1. For example, Script=Arab & ScriptExtensions={Latn, Deva}  will not occur.

A character could have any of the following combinations of properties:

  1. Script=Arab; ScriptExtensions={Arab}
  1. Meaning: the character is only regularly  used with Arabic
  1. Script=Arab; ScriptExtensions={Arab, Thaa}
  1. Meaning: the character  is regularly used with Arabic, but is occasionally also used with Thaana. The script property value is just Arab, because the overwhelming use is with Arab script characters.
  1. Script=Common; ScriptExtensions={Arab, Deva}
  1. Meaning: the character  is regularly used with Arabic and with Devanagari
  1. Script=Common; ScriptExtensions={Common}
  1. Meaning: the character is regularly used with many scripts; it is not primarily used with some single script or subset of scripts.

Minor editorial fix

I found the following while looking at the text. Although we define “explicit” in the following, we don’t always use it consistently. We should search for “specific” and change if necessary.

“All other Script property values are referred to as explicit script values, because they each refer to one specific script.”

Property Changes

In accordance with the first change above, we’d make the following property changes:

Change to Script=Arabic

from: Script=Common        SCX={Arabic Mandaic Syriac}

U+0640 ( ‎ـ‎ ) ARABIC TATWEEL

from: Script=Common        SCX={Arabic Syriac Thaana}

U+060C ( ، ) ARABIC COMMA

U+061B ( ‎؛‎ ) ARABIC SEMICOLON

U+061F ( ‎؟‎ ) ARABIC QUESTION MARK

from: Script=Common        SCX={Arabic Thaana}

U+0660 ( Ù  ) ARABIC-INDIC DIGIT ZERO

...

U+0669 ( Ù© ) ARABIC-INDIC DIGIT NINE

U+FDFD ( ï·½ ) ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM

from: Script=Inherited        SCX={Arabic Syriac}

U+064B ( ً  ) ARABIC FATHATAN

...

U+0655 ( ٕ  ) ARABIC HAMZA BELOW

U+0670 ( Ù° ) ARABIC LETTER SUPERSCRIPT ALEF

2. Change to Script=Latn, SCX={Latn}

from: Script=Inherited, SCX={Inherited}

U+0363 ( ͣ ) COMBINING LATIN SMALL LETTER A

…

U+036F ( ͯ ) COMBINING LATIN SMALL LETTER X

U+1DD3 ( á·“ ) COMBINING LATIN SMALL LETTER FLATTENED OPEN A ABOVE

...

U+1DE6 ( á·¦ ) COMBINING LATIN SMALL LETTER Z

from: Script=Common, SCX={Common}

U+1DCA ( á·Š ) COMBINING LATIN SMALL LETTER R BELOW

3. Change to Script=Greek

from: Script=Inherited, SCX={Grek}

U+0342 (  ͂) COMBINING GREEK PERISPOMENI

U+0345 (  ͅ) COMBINING GREEK YPOGEGRAMMENI


Original values in U6.2

SC                SCX                        Chars

Arabic                Arabic Thaana                [ﷲ]

Common        Arabic Mandaic Syriac        [ـ]

Common        Arabic Syriac Thaana        [،؛؟]

Common        Arabic Thaana                [٠-٩﷽]

Inherited        Arabic Syriac                [ً-ٰٕ]

Common        Armenian Georgian        [։]

Inherited        Greek                        [͂᷀᷁ͅ]

Inherited        Latin                        [ͣ-ͯ]

Inherited        Cyrillic Latin                [҅҆]

Inherited        Devanagari Latin                [॒॑]

Common        Devanagari                [᳡ᳲᳳ]

Inherited        Devanagari                [᳐-᳔᳒-᳢᳠-᳨᳭᳴]

Common        Bengali Devanagari Gurmukhi Oriya Takri        [।॥]

Common        Devanagari Gujarati Gurmukhi Kaithi Takri        [꠰-꠹]

Inherited        Bopomofo Han                [〪-〭]

Common        Bopomofo Hangul Han Hiragana Katakana

        [〃〓〜-〟〰〷〾〿㇀-㇣㈠-㉃㊀-㊰㋀-㋋㍘-㍰㍻-㍿㏠-㏾﹅﹆]

Common        Bopomofo Hangul Han Hiragana Katakana Yi        [、。〈-】〔-〛・。-・]

Common        Han Hiragana Katakana        [〆〼〽㆐-㆟]

Common        Hiragana Katakana        [〱-〵゛゜゠ーー゙゚]

Inherited        Hiragana Katakana        [゙゚]

Common        Mongolian Phags_Pa        [᠂᠃᠅]

Common        Cypriot Linear_B                [𐄀-𐄂𐄇-𐄳𐄷-𐄿]

Common        Buhid Hanunoo Tagbanwa Tagalog        [᜵᜶]