L2/11-406

Re: | Script Extensions as a Unicode Property |

To: | UTC |

From: | Mark Davis |

Date: | 2011-11-25 |

The script extensions just exist as a data file, and not a formal property. That makes them clumsier to cite and use. See, for example, the feedback from Karl Williamson on UTS #18, April 30.

We already have many multivalued Unicode character properties, in Unihan, so there is no formal difficulty in adding Script_Extensions as a provisional property. Here is a proposed description.

The ScriptExtension (scx) property has as values a set of one or more Script property values. The ScriptExtension value for a given character C is defined based on the UCD data files as follows:

- If C is in a field 0 in an entry of ScriptExtensions.txt, then the scx value is the set of script codes in field 1 of that entry.

- For example:
- The scx value for U+064B is {Arab, Syrc} because of the line "064B..0655 ; Arab Syrc".

- Otherwise, the scx value for C is a set consisting of a single element, the Script property value of C.

- For example:
- The scx value for U+0600 is {Arab}, and
- The value for U+0710 is {Syrc}.

When used in an expression to denote a set of characters, such as in the regular expression \p{scx=Arab}, the value of that expression is the set of all code points whose ScriptExtension value contains the given script. Thus:

- \p{scx=Arab} includes both U+064B and U+0600, but not U+0710
- \p{scx=Syrc} includes both U+064B and U+0710, but not U+0600.

The PropertyAliases.txt line would be:

SCX ; Script_Extensions

We would also add the following to the data file:

# @missing: 0000..10FFFF; Script_Extensions; <script>