[Unicode]  Technical Reports
 

Draft Unicode Technical Report #24

Script Names

Version 1.1
Authors Mark Davis (mark.davis@us.ibm.com)
Date 2000-08-31
This Version http://www.unicode.org/unicode/reports/tr24/tr24-1.1.html
Previous Version http://www.unicode.org/unicode/reports/tr24/tr24-1.html
Latest Version http://www.unicode.org/unicode/reports/tr24/tr24

Summary

This document provides an assignment of script names to all Unicode code points. This information is useful in mechanisms such as regular expressions, where it produces much better results than simple matches on block names.

Status

This document has been approved by the Unicode Technical Committee for public review as a Proposed Draft Unicode Technical Report. Publication does not imply endorsement by the Unicode Consortium. This is a draft document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.

A list of current Unicode Technical Reports is found on http://www.unicode.org/unicode/reports/. For more information about versions of the Unicode Standard, see http://www.unicode.org/unicode/standard/versions/.

Contents


1 Description

Scripts.txt provides a mapping from Unicode characters to script names. This information is useful for mechanisms such as regular expressions, where it produces much better results than simple matches on block names. (See the discussion of Character Blocks in UTR #18: Unicode Regular Expression Guidelines.) The script names form a full partition of the code space: every codepoint is assigned a single script name.

Note: it is expected than the Scripts.txt file will eventually be part of the Unicode Character Database. Until such time, the assignment of scripts is preliminary.

Script values cannot simply be extracted from the block ranges in Blocks.txt. In some cases, blocks contain more than two scripts, in other cases a single script is split over several blocks.

1.1 Common Script

Although script names are generally more useful than simple block names, one cannot make too many assumptions; in some cases languages may use characters from more than one script. This is especially the case for non-letters: for that reason, only characters of General Category Letter are given distinct script names: all others are given the script name Common, indicating an undetermined script.

In many cases, programs will override the Common script based upon the surrounding characters. A simple heuristic is to use the script of the preceding character, which works well in many cases. However, this may not always produce optimal results: for example, in the text "... gamma (γ) is ..." this heuristic would cause matching parentheses to be in different scripts. Thus more sophisticated programs may use more complex heuristics.

1.2 Format

The format of the file is similar to that of Blocks.txt. The fields are separated by semicolons. The first two fields provide the first and last code points in a range. The third field provides the script name for that range. The comment (after a #) provides the names for the first and last characters in the range. All unassigned or illegal code points in each range must be ignored, and are given the script value Unknown instead.

Note: Draft ISO 15924 (http://www.egt.ie/standards/iso15924/) provides an enumeration of four-letter script codes. Once this standard is final, these codes can be used to represent the script names.

1.3 Viewing

The Scripts.txt is currently available at Scripts-1d3.txt. The contents are preliminary, and may change in the future. There is an additional set of charts that can be used to see the assignment of scripts. These charts show the entire range of Unicode characters broken down by script name (for letters) and general category (for others). To properly view these charts, you should install a Unicode font for use by your browser.


Copyright © 1999-2000 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.