[Unicode]  Technical Reports
 

Draft Unicode Technical Report #24

Script Names

Version 1.2
Authors Mark Davis (mark.davis@us.ibm.com)
Date 2000-10-27
This Version http://www.unicode.org/unicode/reports/tr24/tr24-1.2.html
Previous Version http://www.unicode.org/unicode/reports/tr24/tr24-1.1.html
Latest Version http://www.unicode.org/unicode/reports/tr24/tr24

Summary

This document provides an assignment of script names to all Unicode code points. This information is useful in mechanisms such as regular expressions, where it produces much better results than simple matches on block names.

Status

This document has been approved by the Unicode Technical Committee for public review as a Draft Unicode Technical Report. Publication does not imply endorsement by the Unicode Consortium. This is a draft document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.

A list of current Unicode Technical Reports is found on http://www.unicode.org/unicode/reports/. For more information about versions of the Unicode Standard, see http://www.unicode.org/unicode/standard/versions/.

Contents


1 Description

Scripts.txt provides a mapping from Unicode characters to script names. This information is useful for mechanisms such as regular expressions, where it produces much better results than simple matches on block names. (See the discussion of the deficiencies of Character Blocks in UTR #18: Unicode Regular Expression Guidelines.)

Script values cannot simply be extracted from the block ranges in Blocks.txt. In some cases, blocks contain more than two scripts, in other cases a single script is split over several blocks.

1.1 Common Script

Although script names are generally much more useful than simple block names, one cannot make too many assumptions; in some cases languages may use characters from more than one script. This is especially the case for non-letters: for that reason, generally only characters of General Category Letter are given distinct script names: all others are given the script name Common, indicating an undetermined script.

In many cases, programs will override the script name based upon the context of the surrounding characters, especially for the case of Common. A simple heuristic is to use the script of the preceding character, which works well in many cases. However, this may not always produce optimal results: for example, in the text "... gamma (γ) is ..." this heuristic would cause matching parentheses to be in different scripts. Thus more sophisticated programs may use more complex heuristics.

1.2 Format

The format of the file is similar to that of Blocks.txt. The fields are separated by semicolons. The first two fields provide the first and last code points in a range. The third field provides the script name for that range. The comment (after a #) provides the names for the first and last characters in the range. On the basis of this file, script values for any character in a string are derived as follows:

The script names form a full partition of the code space: every codepoint is assigned a single script name. As new scripts are added to the standard, additional script names will be added. In some cases, characters may change script names in the future.

Note: The assignment of scripts in this report are preliminary, and may change at any time.

1.3 Data

The Scripts.txt is currently available at Scripts-1d3.txt. The contents are preliminary, and may change in the future. There is an additional set of charts that can be used to see the assignment of scripts. These charts show the entire range of Unicode characters broken down by script name (for letters) and general category (for others). To properly view these charts, you should install a Unicode font for use by your browser.

1.4 Script Names

The following table lists the Script Name values used in the file, and the corresponding DIS 15924 code (where possible). The names are not case-sensitive.

Note: DIS 15924 (http://www.egt.ie/standards/iso15924/) provides an enumeration of four-letter script codes. In some cases the match between these script names and the DIS 15924 codes is not precise, since the goals are somewhat different. DIS 15924 is aimed primarily at the bibliographic identification of scripts; because of that it occasionally identifies varieties of scripts that are of significance for book cataloging, but which are not considered distinct as scripts in the Unicode Standard. For example, DIS 15924 has separate script codes for Fraktur and Gaelic varieties of the Latin script. Where there are no corresponding DIS 15924 codes, the "private use" ones starting with Q are used.

Script Name Draft ISO 15924 code
UNKNOWN Zyyy
COMMON Qaaa
LATIN Latn (Latf, Latg)
GREEK Grek
COPTIC Qaab
CYRILLIC Cyrl (Cyrs)
ARMENIAN Armn
HEBREW Hebr
ARABIC Arab
SYRIAC Syrc (Syrj, Syrn, Syre)
THAANA Thaa
DEVANAGARI Deva
BENGALI Beng
GURMUKHI Guru
GUJARATI Gujr
ORIYA Orya
TAMIL Taml
TELUGU Telu
KANNADA Knda
MALAYALAM Mlym
SINHALA Sinh
THAI Thai
LAO Laoo
TIBETAN Tibt
MYANMAR Mymr
GEORGIAN Geor (Geon, Geoa)
JAMO Qjam
HANGUL Hang
ETHIOPIC Ethi
CHEROKEE Cher
UCAS Cans
OGHAM Ogam
RUNIC Runr
KHMER Khmr
MONGOLIAN Mong
HIRAGANA Hira
KATAKANA Kana
BOPOMOFO Bopo
HAN Hani
YI Yiii

 


Copyright © 1999-2000 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.