Draft Unicode Technical Report #24

Script Names

Version	1.2
Authors	Mark Davis (mark.davis@us.ibm.com)
Date	2000-10-27
This Version	http://www.unicode.org/unicode/reports/tr24/tr24-1.2.html
Previous Version	http://www.unicode.org/unicode/reports/tr24/tr24-1.1.html
Latest Version	http://www.unicode.org/unicode/reports/tr24/tr24

Summary

This document provides an assignment of script names to all Unicode code points. This information is useful in mechanisms such as regular expressions, where it produces much better results than simple matches on block names.

Status

This document has been approved by the Unicode Technical Committee for public review as a Draft Unicode Technical Report. Publication does not imply endorsement by the Unicode Consortium. This is a draft document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.

A list of current Unicode Technical Reports is found on http://www.unicode.org/unicode/reports/. For more information about versions of the Unicode Standard, see http://www.unicode.org/unicode/standard/versions/.

1 Description

1 Description

Scripts.txt provides a mapping from Unicode characters to script names. This information is useful for mechanisms such as regular expressions, where it produces much better results than simple matches on block names. (See the discussion of the deficiencies of Character Blocks in UTR #18: Unicode Regular Expression Guidelines.)

Script values cannot simply be extracted from the block ranges in Blocks.txt. In some cases, blocks contain more than two scripts, in other cases a single script is split over several blocks.

1.1 Common Script

Although script names are generally much more useful than simple block names, one cannot make too many assumptions; in some cases languages may use characters from more than one script. This is especially the case for non-letters: for that reason, generally only characters of General Category Letter are given distinct script names: all others are given the script name Common, indicating an undetermined script.

In many cases, programs will override the script name based upon the context of the surrounding characters, especially for the case of Common. A simple heuristic is to use the script of the preceding character, which works well in many cases. However, this may not always produce optimal results: for example, in the text "... gamma (γ) is ..." this heuristic would cause matching parentheses to be in different scripts. Thus more sophisticated programs may use more complex heuristics.

1.2 Format

The format of the file is similar to that of Blocks.txt. The fields are separated by semicolons. The first two fields provide the first and last code points in a range. The third field provides the script name for that range. The comment (after a #) provides the names for the first and last characters in the range. On the basis of this file, script values for any character in a string are derived as follows:

If the code point is illegal or is unassigned (General Category = Cn), it is given the script name Unknown.
Otherwise, if a code point falls within a range provided in the data file, it is given the corresponding script name.
Otherwise, if the code point is for a Combining Mark (General Category = Mc, Mn, or Me), and it has a base character in the string, it is given the same script name as the base character.
Otherwise, it is given the script value Common, indicating an undetermined script.

The script names form a full partition of the code space: every codepoint is assigned a single script name. As new scripts are added to the standard, additional script names will be added. In some cases, characters may change script names in the future.

Note: The assignment of scripts in this report are preliminary, and may change at any time.

1.3 Data

The Scripts.txt is currently available at Scripts-1d3.txt. The contents are preliminary, and may change in the future. There is an additional set of charts that can be used to see the assignment of scripts. These charts show the entire range of Unicode characters broken down by script name (for letters) and general category (for others). To properly view these charts, you should install a Unicode font for use by your browser.

1.4 Script Names

The following table lists the Script Name values used in the file, and the corresponding DIS 15924 code (where possible). The names are not case-sensitive.

Note: DIS 15924 (http://www.egt.ie/standards/iso15924/) provides an enumeration of four-letter script codes. In some cases the match between these script names and the DIS 15924 codes is not precise, since the goals are somewhat different. DIS 15924 is aimed primarily at the bibliographic identification of scripts; because of that it occasionally identifies varieties of scripts that are of significance for book cataloging, but which are not considered distinct as scripts in the Unicode Standard. For example, DIS 15924 has separate script codes for Fraktur and Gaelic varieties of the Latin script. Where there are no corresponding DIS 15924 codes, the "private use" ones starting with Q are used.

Script Name	Draft ISO 15924 code
UNKNOWN	Zyyy
COMMON	Qaaa
LATIN	Latn (Latf, Latg)
GREEK	Grek
COPTIC	Qaab
CYRILLIC	Cyrl (Cyrs)
ARMENIAN	Armn
HEBREW	Hebr
ARABIC	Arab
SYRIAC	Syrc (Syrj, Syrn, Syre)
THAANA	Thaa
DEVANAGARI	Deva
BENGALI	Beng
GURMUKHI	Guru
GUJARATI	Gujr
ORIYA	Orya
TAMIL	Taml
TELUGU	Telu
KANNADA	Knda
MALAYALAM	Mlym
SINHALA	Sinh
THAI	Thai
LAO	Laoo
TIBETAN	Tibt
MYANMAR	Mymr
GEORGIAN	Geor (Geon, Geoa)
JAMO	Qjam
HANGUL	Hang
ETHIOPIC	Ethi
CHEROKEE	Cher
UCAS	Cans
OGHAM	Ogam
RUNIC	Runr
KHMER	Khmr
MONGOLIAN	Mong
HIRAGANA	Hira
KATAKANA	Kana
BOPOMOFO	Bopo
HAN	Hani
YI	Yiii

Copyright © 1999-2000 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.