L2/12-072

To UTC
From: Mark Davis
Re:   Proposed UCD property: Script Identifier Status.
Date: 2012-02-04


In a growing number of places, we reference the data in Tables 4-7 of UAX #31.
However, we force developers to scrape that data, instead of providing
a machine-readable data file in the UCD. That is both clumsy and error-prone.

I propose that we add a new provisional enumerated UCD property, tentatively called Script_Identifier_Status (sis). This property maps script codes to one of a set of values, which we would describe in UAX #31.

The data reflects what is in UAX #31 already.

Issues:

1. In the proposed data file below, I introduce a new value "Mixed" for 3 scripts that are not covered in UAX #31. I also give "Exclusion" to the 4th script (Unknown).

2. There are a few items in Table 4 that are described by properties other than script. Their inclusion was motivated by IDNA2008; they are more properly covered in UTS #36. They are:

	[[:Extender=True:]&[:Joining_Type=Join_Causing:]]
	[:Default_Ignorable_Code_Point:]
	[:block=Combining_Diacritical_Marks_for_Symbols:]
	[:block=Musical_Symbols:]
	[:block=Ancient_Greek_Musical_Notation:]
	[:block=Phaistos_Disc:]

3. Aside from the data file, related changes would include:
1) making modifications to UAX 31 to point to the data file
2) Like all new properties:
  a) adding the property and short name to PropertyAliases.txt
  b) adding the enum values (and abbreviated names) to PropertyValueAliases.txt
  c) updating UAX 44

=============================================
DRAFT DATA FILE
=============================================
# ScriptIdentifierStatus.txt
# Date: xxx
#
# Copyright (c) 1991-2011 Unicode, Inc.
# For terms of use, see http://www.unicode.org/terms_of_use.html
#
# This file provides a recommended status for the use of different scripts
# in identifiers. For more information, see 
#   UAX #31: Unicode Identifier and Pattern Syntax
#   http://www.unicode.org/reports/tr31/
#
# Each line contains 2 fields, separated by a semicolon.
#
# Field 0: The UCD script code.
#
# Field 1: The status value associated with the script code:
#
#  Recommended (rec)
#  Mixed (mix)
#  Asperational (asp)
#  Limited_Use (lim)
#  Exclusion (exc)
#
# For a description of the meaning and usage of these values,
# see UAX #31.

Latn	;	Recommended	#	Latin
Hani	;	Recommended	#	Han
Cyrl	;	Recommended	#	Cyrillic
Hira	;	Recommended	#	Hiragana
Kana	;	Recommended	#	Katakana
Thai	;	Recommended	#	Thai
Arab	;	Recommended	#	Arabic
Hang	;	Recommended	#	Hangul
Deva	;	Recommended	#	Devanagari
Grek	;	Recommended	#	Greek
Hebr	;	Recommended	#	Hebrew
Taml	;	Recommended	#	Tamil
Knda	;	Recommended	#	Kannada
Geor	;	Recommended	#	Georgian
Mlym	;	Recommended	#	Malayalam
Telu	;	Recommended	#	Telugu
Armn	;	Recommended	#	Armenian
Mymr	;	Recommended	#	Myanmar
Gujr	;	Recommended	#	Gujarati
Beng	;	Recommended	#	Bengali
Guru	;	Recommended	#	Gurmukhi
Laoo	;	Recommended	#	Lao
Khmr	;	Recommended	#	Khmer
Tibt	;	Recommended	#	Tibetan
Sinh	;	Recommended	#	Sinhala
Ethi	;	Recommended	#	Ethiopic
Thaa	;	Recommended	#	Thaana
Orya	;	Recommended	#	Oriya
Bopo	;	Recommended	#	Bopomofo
Zyyy	;	Mixed	#	Common
Zinh	;	Mixed	#	Inherited
Brai	;	Mixed	#	Braille
Cans	;	Asperational	#	Canadian_Aboriginal
Yiii	;	Asperational	#	Yi
Mong	;	Asperational	#	Mongolian
Tfng	;	Asperational	#	Tifinagh
Plrd	;	Asperational	#	Miao
Syrc	;	Limited_Use	#	Syriac
Nkoo	;	Limited_Use	#	Nko
Cher	;	Limited_Use	#	Cherokee
Vaii	;	Limited_Use	#	Vai
Bali	;	Limited_Use	#	Balinese
Bamu	;	Limited_Use	#	Bamum
Batk	;	Limited_Use	#	Batak
Cham	;	Limited_Use	#	Cham
Java	;	Limited_Use	#	Javanese
Kali	;	Limited_Use	#	Kayah_Li
Lepc	;	Limited_Use	#	Lepcha
Limb	;	Limited_Use	#	Limbu
Lisu	;	Limited_Use	#	Lisu
Mand	;	Limited_Use	#	Mandaic
Mtei	;	Limited_Use	#	Meetei_Mayek
Talu	;	Limited_Use	#	New_Tai_Lue
Olck	;	Limited_Use	#	Ol_Chiki
Saur	;	Limited_Use	#	Saurashtra
Sund	;	Limited_Use	#	Sundanese
Sylo	;	Limited_Use	#	Syloti_Nagri
Tale	;	Limited_Use	#	Tai_Le
Lana	;	Limited_Use	#	Tai_Tham
Tavt	;	Limited_Use	#	Tai_Viet
Cakm	;	Limited_Use	#	Chakma
Zzzz	;	Exclusion	#	Unknown
Samr	;	Exclusion	#	Samaritan
Copt	;	Exclusion	#	Coptic
Glag	;	Exclusion	#	Glagolitic
Avst	;	Exclusion	#	Avestan
Brah	;	Exclusion	#	Brahmi
Bugi	;	Exclusion	#	Buginese
Buhd	;	Exclusion	#	Buhid
Cari	;	Exclusion	#	Carian
Xsux	;	Exclusion	#	Cuneiform
Cprt	;	Exclusion	#	Cypriot
Dsrt	;	Exclusion	#	Deseret
Egyp	;	Exclusion	#	Egyptian_Hieroglyphs
Goth	;	Exclusion	#	Gothic
Hano	;	Exclusion	#	Hanunoo
Armi	;	Exclusion	#	Imperial_Aramaic
Phli	;	Exclusion	#	Inscriptional_Pahlavi
Prti	;	Exclusion	#	Inscriptional_Parthian
Kthi	;	Exclusion	#	Kaithi
Khar	;	Exclusion	#	Kharoshthi
Linb	;	Exclusion	#	Linear_B
Lyci	;	Exclusion	#	Lycian
Lydi	;	Exclusion	#	Lydian
Ogam	;	Exclusion	#	Ogham
Ital	;	Exclusion	#	Old_Italic
Xpeo	;	Exclusion	#	Old_Persian
Sarb	;	Exclusion	#	Old_South_Arabian
Orkh	;	Exclusion	#	Old_Turkic
Osma	;	Exclusion	#	Osmanya
Phag	;	Exclusion	#	Phags_Pa
Phnx	;	Exclusion	#	Phoenician
Rjng	;	Exclusion	#	Rejang
Runr	;	Exclusion	#	Runic
Shaw	;	Exclusion	#	Shavian
Tglg	;	Exclusion	#	Tagalog
Tagb	;	Exclusion	#	Tagbanwa
Ugar	;	Exclusion	#	Ugaritic
Merc	;	Exclusion	#	Meroitic_Cursive
Mero	;	Exclusion	#	Meroitic_Hieroglyphs
Shrd	;	Exclusion	#	Sharada
Sora	;	Exclusion	#	Sora_Sompeng
Takr	;	Exclusion	#	Takri