L2/10-122

Working Draft Unicode Technical Report #48

Unicode Script Edge Cases

Editors	Mark Davis (markdavis@google.com)
Date	2010-04-15 (draft 1)
This Version	http://www.unicode.org/reports/tr48/tr48-1.html
Previous Version	n/a
Latest Version	http://www.unicode.org/reports/tr48/
Latest Proposed Update	http://www.unicode.org/reports/tr48/proposed.html
Revision	1

Summary

This document provides a data file for Script_Specials, which provides additional information to allow users to enhance their use of the script property.

Status

This is a draft document which may be updated, replaced, or superseded by other documents at any time. Publication does not imply endorsement by the Unicode Consortium. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.

A Unicode Technical Report (UTR) contains informative material. Conformance to the Unicode Standard does not imply conformance to any UTR. Other specifications, however, are free to make normative references to a UTR.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in the References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].

1. Introduction
2. Usage
3. Stability
4. Data File
Acknowledgements
References
Modifications

1. Introduction

The Unicode script property provides a basic categorization of Unicode characters based on script (see UAX #24, Script Names [UAX24]). However, it does not handle cases well where a character should be categorized as belonging to more than one script (but not all scripts). This is especially important in the area of mixed script detection for spoofing, but also for other applications such as text layout. In addition, there are cases where the Unicode script property follows the appearance or derivation of the character, without regard to the usage. This document provides a data file for Script_Specials, which provides additional information to allow users to enhance their use of the script property.

[Review Note: There are two options for the additional script information: it could be a formal Unicode property and data file, similar to SpecialCasing.txt; or it could be additional information associated with another specification, such as UTS #39. If it is a new property, it is a bit different than the current ones because logically the value is a set of script tags.]

[Review Note: The title "Script Edge Cases" is just a working title: depending on what we do with this, we should change it to be more appropriate.]

A view of the draft data is provided at unicode utilities. It lists the characters with special script values first by their Unicode script property values, then by the special script value. The data file backing this view is described in Section 4, Data File.

Many of the Script_Specials values map Script=Common characters to the subset of all scripts that the characters are actually used with. For example, U+0660 ( ٠ ) ARABIC-INDIC DIGIT ZERO used both with Arabic and with Syriac; similarly, U+30FC ( ー ) KATAKANA-HIRAGANA PROLONGED SOUND MARK is shared between Hiragana and Katanana. Other Script_Specials values override the Unicode script property value to reflect the usage of the character (rather than the form or derivation). For example, U+1D5D ( ᵝ ) MODIFIER LETTER SMALL BETA is used in IPA, which is handled in Unicode as an extension of Latin. Encountering that character is not a signal that the text is Greek, but rather that it is Latin (IPA). One part of the data is derived: characters are given values of Hant or Hans based on the properties in Unihan. That allows for the distinction between simplified and traditional Chinese using the script_specials.

Suggestions for additions or modifications of the values is requested, using the online reporting form [Feedback].

2. Usage

The following examples illustrate where the Script property alone is insufficient for common tasks:

Example 1. Mixed script detection for spoofing.

Using the Unicode script property alone, for example, will not detect that U+0660 ( ٠ ) ARABIC-INDIC DIGIT ZERO should not be mixed with Latin, nor that U+30FC ( ー ) KATAKANA-HIRAGANA PROLONGED SOUND MARK should not be mixed with Latin. See [UTS39] and [UTS46].

Example 2. Determination of script runs for text layout.

The Common characters listed in the above example should not continue a Latin script run, but should continue runs of the listed scripts.

Example 3. Regex property testing.

For many common tasks, the regex expression [:script=Arabic:] is too narrow, because it does not include U+0660 ( ٠ ) ARABIC-INDIC DIGIT ZERO, but the expression [[:script=Arabic:][:script=Common:]] is far too broad, since it also includes U+30FC ( ー ) KATAKANA-HIRAGANA PROLONGED SOUND MARK. Providing an extended property [:script_specials=Arabic:] allows for much finer-grained applications. See also UTS #18, Unicode Regular Expressions [UTS18].

3. Stability

The Script_Specials data depends on determinations of usage. It thus can be expected to change more frequently than more established Unicode properties, as more information is gleaned about the usage of given characters. Thus implementers should be prepared for enhancements and corrections to the data when they upgrade to the next version.

4. Data File

The working draft data file is at: ScriptSpecials.txt. [TBD: the location and name of this file may change.]

The file has the following format for each data line.

Field 1: character or character range

Field 2: space-delimited list of ISO 15924 script codes, currently limited to Unicode Script property values, with the addition of Hant and Hans.

Comments: currently list the Unicode script property value (short and long) followed by the character name.

Where there is no data for a character, the value defaults to the same as the Unicode Script property value.

For example:

# scriptSpecials: [Arabic]
   0600  ; Arab                     # Zyyy (Common)             ARABIC NUMBER SIGN
...
# scriptSpecials: [Hans]
3469     ; Hans                     # Hani (Han)                CJK UNIFIED IDEOGRAPH-3469
...
# scriptSpecials: [Bopomofo, Han, Hangul, Hiragana, Katakana, Phags_Pa, Tibetan, Yi]
   3001  ; Bopo Hang Hani Hira Kana Phag Tibt Yiii #Zyyy (Common) IDEOGRAPHIC COMMA
...
# scriptSpecials: [Arabic]
   0600  ; Arab                     # Zyyy (Common)             ARABIC NUMBER SIGN
...

[Review Note: We may want to separate off the Hant and Hans data into a separate file.]

Acknowledgements

Mark Davis authored the bulk of the text, under direction from the Unicode Technical Committee. Thanks also to the following people for their feedback or contributions to this document or earlier versions of it: [TBD].

References

[Feedback]	Reporting Form http://www.unicode.org/reporting.html For reporting errors and requesting information online.
[Reports]	Unicode Technical Reports http://www.unicode.org/reports/ For information on the status and development process for technical reports, and for a list of technical reports.
[UAX24]	UAX #24: Script Names http://www.unicode.org/reports/tr24/
[Unicode]	The Unicode Standard For the latest version, see: http://www.unicode.org/versions/latest/ For the 5.2.0 version, see: http://www.unicode.org/versions/Unicode5.2.0/
[UTR36]	UTR #36: Unicode Security Considerations http://www.unicode.org/reports/tr36/
[UTS18]	UTS #18: Unicode Regular Expressions http://www.unicode.org/reports/tr18/
[UTS39]	UTS #39: Unicode Security Mechanisms http://www.unicode.org/reports/tr39/
[UTS46]	Unicode IDNA Compatibility Processing http://www.unicode.org/reports/tr46/
[Versions]	Versions of the Unicode Standard http://www.unicode.org/standard/versions/ For information on version numbering, and citing and referencing the Unicode Standard, the Unicode Character Database, and Unicode Technical Reports.

Modifications

The following summarizes modifications from the previous revision of this document.

Revision 1

First proposed draft version

Copyright © 2010 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The Unicode Terms of Use apply.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.