[Unicode]  Technical Reports
 

L2/04-289

[Working Draft] Proposed Update Unicode Standard Annex #24

Script Names

Version

4.1.0 proposed

Authors Mark Davis (mark.davis@us.ibm.com), Asmus Freytag (asmus@unicode.org)
Date 2004-07-14
This Version http://www.unicode.org/reports/tr24/tr24-6.html
Previous Version http://www.unicode.org/reports/tr24/tr24-5.html
Latest Version http://www.unicode.org/reports/tr24/tr24
Tracking Number

6


Summary

This document provides an assignment of script names to all Unicode code points. This information is useful in mechanisms such as regular expressions and other text processing tasks.

Status

This document is a Proposed Update to a previously approved Unicode Standard Annex. Publication does not imply endorsement by the Unicode Consortium. This is a draft document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.

A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard, but is published as a separate document. The Unicode Standard may require conformance to normative content in a Unicode Standard Annex, if so specified in the Conformance chapter of that version of the Unicode Standard. The version number of a UAX document corresponds to the version number of the Unicode Standard at the last point that the UAX document was updated.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].

Contents

  1. Introduction
  2. Usage Model
  3. Values
  4. Data File

1 Introduction

——Script. A collection of symbols used to represent textual information in one or more writing systems.

The majority of characters encoded in the Unicode Standard [Unicode] are elements of collections called scripts. Symbols, punctuation characters intended for use with multiple scripts, as well as characters that do not have a standalone script identity because they are intended to be used in combination with another character are exceptions.

Therefore, a text in a given script is likely to consist of characters from that script, together with shared punctuation and characters whose script identity depends on the characters they are used with.

1.1    Classification of Text by Script Name

The Unicode Character Database [UCD] provides a mapping from Unicode characters to script name values. This information is useful for a variety of tasks that need to analyze a piece of text and determine what parts of it are in which script. Examples include regular expressions or assigning different fonts to parts of a plain text stream based on the prevailing script.

These processes are similar to the task of bibliographers in cataloging documents by their script. However, bibliographers often ignore small inclusions of other scripts in form of quoted material, while significant differences in the writing style for the same script may be reflected in the classification, for example Fraktur or Gaelic for the Latin script.

Script information is also taken into consideration in collation. The data in the Default Unicode Collation Element Table (DUCET) are grouped by script, so that letters of different script value have different primary weights. However, numbers, symbols and punctuation are not grouped with the letters. For the purposes of ordering, therefore, script is most significant for the letters. For more information see UTS #10: Unicode Collation Algorithm [UCA].

These examples demonstrate that the definition of script depends on the intended purposes of the classification. The following table summarizes some of the purposes for which text elements can be classified by script.

Table 1: Classification of Text by Script Name
Granularity Classification Purpose Special Values
document bibliographical record  in which script a text is printed or published; sub-divides some scripts, for example Latin into normal, Fraktur and Gaelic style Unknown
 
character graphological / typographical describe to which script a character belongs to based on its origin  
orthographical describe with which script (or scripts) a character is used Common, Inherited
for collation group letters by script in collation element table  
run for font binding or search determine extent of run of like script in (potentially) mixed- script text  

Bibliographical, graphological, or historical classifications of scripts need different distinctions than the type of text-processing related needs supported by Unicode script values. This requirements of the task not only affect how fine-grained the classification is, but also the kinds of special values needed to make the system work. For example, when bibliographers are unable to determine the script of a document, they may classify it using a special value for Unknown. In text-processing the identity of all characters is normally known, but some characters are shared across scripts, or can be attached to any character, thus requiring special values for Common and Inherited.

Despite these differences, the vast majority of Unicode script values correspond more or less directly to the script identifiers used by bibliographers and others. Unicode script values are therefore mapped to their equivalents in the registry of script identifiers defined by ISO 15924 [15924].

1.2 Scripts and Blocks

Unicode characters are also divided into non-overlapping ranges called blocks [Blocks]. Many of these blocks have the same name as one of the scripts, because characters of that script are primarily encoded in that block. However, blocks and scripts differ in the following ways:

As a result, for mechanisms such as regular expressions, using script values produces more meaningful results than simple matches on block names.

For more information, see Character Blocks in UTS #18: Unicode Regular Expressions [UTS18].

2 Usage Model

The script values form a full partition of the code space: every code point is assigned a single script value. This is either the value of a specific script such as Cyrillic, or one of the following two special values:

In some cases where a character is used with two or more related scripts, a multi-valued script value such as HiraganaOrKatakana may be assigned. As new scripts are added to the standard, additional script values will be added. See Section 3.2 Maintenance.

A character is assigned a specific Unicode script value (as opposed to Common or Inherited) only when it is clearly not used with other scripts. This facilitates the use of these script values for common tasks such as regular expressions, but means that some characters that are  definite members of a given script by their graphology nevertheless are assigned one of the generic values. As more data on the usage of individual characters is collected, characters may be moved between the Common group and a more specific script (including Inherited).

2.1 Handling Characters with the Common Script Property

In determining the boundaries of a run of text in a given script programs must resolve any of the special script values, such as Common, based upon the context of the surrounding characters. A simple heuristic uses the script of the preceding character, which works well in many cases. However, this may not always produce optimal results. For example, in the text "... gamma (γ) is ..." this heuristic would cause matching parentheses to be in different scripts.

Generally, paired punctuation, such as brackets or quotation marks belong to the enclosing or outer level of the text and should therefore match the script of the enclosing text. In addition, opening and closing elements of a pair must resolve to the same script values, where possible. The use of quotation marks is language dependent, so from the character code alone, it is not possible to tell whether a particular quotation mark is used as an opening or closing punctuation.  For more information, see Section 6.2 of [Unicode].

Some characters that are normally used as paired punctuation, may also be used singly. An example is U+2019 RIGHT SINGLE QUOTATION MARK is also used as apostrophe, which is then no longer an enclosing punctuation. An example from physics would be <ψ| or |ψ>, where the enclosing punctuation characters may not form consistent pairs.

2.2 Handling Nonspacing Marks

Implementations that determine the boundaries between characters of given scripts should never break between a nonspacing mark  (a character with General Category value of Mn or Me) and its base character. Thus for boundary determinations and similar sorts of processing, a nonspacing mark — whatever its script value — should inherit the script value of its base character.

Normally, a nonspacing mark has the Inherited script value to reflect this. However, in cases where the best interpretation of a nonspacing mark in isolation would be a specific script, its script property value may be different from Inherited. For example the Hebrew marks and accents are used only with Hebrew characters and are therefore assigned the Hebrew script value.

2.3 Using Script Names in Regular Expressions

The script property is useful in regular expression syntax for easy specification of spans of text which consist of a single script or mixture of scripts. In general, regular expressions should only use specific script values in conjunction with both Common and Inherited. For example, to distinguish a sequence of characters appropriate for Greek, one would use:

((Greek | Common) (Inherited | Me | Mn)?)*

That is, characters that are either in Greek or in Common, optionally followed by those in Inherited. Some languages commonly use multiple scripts, so for Japanese one might use:

((Hiragana | Katakana | HiraganaOrKatakana | Han | Latin | Common) (Inherited | Me | Mn)?)*

Note that while including Latin in the above expression is necessary to ensures that it can cover the typical script use found in many Japanese texts, it would make it difficult to isolate a run of Japanese inside an English document, for example.

For more information, see UTS #18: Unicode Regular Expressions [UTS18].

2.4 Limitations

The script values form a full partition of the Unicode code space, but that partition does not exhaust the possibilities for useful and relevant script-like subsets of Unicode characters.

For example, a user might wish to define a regular expression to span typical mathematical expressions, but the subset of Unicode characters used in mathematics does not correspond to any particular script. Instead, it requires use of the Math property, other character properties, and particular subsets of Latin, Greek, and Cyrillic letters. For information on other character properties, see the [UCD].

In texts of an academic, scientific or engineering nature, the use of isolated Greek characters is common, for example Ω for Ohm, α, β, and γ for types of radioactive decays or in names of chemical compounds, π for 3.1415... etc. It is generally undesirable to treat such usage the same as ordinary text in the Greek script. Some commonly used characters, such as µ already exist twice in the Unicode Standard, but with different script value.

2.5 Spoofing

The script property values may also be useful in providing user feedback to help signal possible spoofing, where visually-similar characters (confusable characters) are substituted in an attempt to mislead a user. For example, a domain name such as macchiato.com could be spoofed with macchiatο.com (using the GREEK LETTER SMALL LETTER OMICRON for the first 'o') or maссhiato.com (using CYRILLIC SMALL LETTER ES for the first two 'c's). The user can be alerted to odd cases by displaying mixed scripts with different colors, highlighting, or boundary marks: macchiatο.com or maссhiato.com, for example.

Possible spoofing is not limited to mixtures of scripts. Even in ASCII, there are confusable characters such as 0 and O, or 1 and l. For a more complete approach, the use of script values needs to be augmented with other information such as General Category values, and lists of individual characters that are not distinguished by other Unicode properties.

3 Values

The following table illustrates some of the script values used in the data file. The short name for the Unicode script value matches the ISO 15924 code. Further subdivision of scripts by ISO 15924 into varieties are shown in parentheses. For a complete list of values and short names, see the Property Value Aliases [PropValue]. As with all property value aliases, the script values are not case-sensitive, and the presence of hyphen or underscore is optional. The order in which the scripts are listed here or in the data file is not significant.

Table 2: Unicode Script Values and ISO 15924 Codes
Script Value ISO 15924
Common Zyyy
Inherited Qaai
LATIN Latn (Latf, Latg)
CYRILLIC Cyrl (Cyrs)
ARMENIAN Armn
HEBREW Hebr
ARABIC Arab
SYRIAC Syrc (Syrj, Syrn, Syre)
BRAILLE Brai
... ...

Although Braille is not a script in the same sense that Latin or Greek is, it is given a script value in [Scripts]. This is useful for the kinds of intended applications of these script values, such as matching spans of similar characters in regular expressions.

3.1 Relation to ISO 15924 Codes

ISO 15924: Code for the Representation of Names of Scripts [ISO15924] provides an enumeration of four-letter script codes. In the [UCD] file [PropValue], corresponding codes from [ISO15924] are provided as short names for the scripts.

In some cases the match between these script values and the ISO 15924 codes is not precise, because the goals are somewhat different. ISO 15924 is aimed primarily at the bibliographic identification of scripts; consequently it occasionally identifies varieties of scripts that may be useful for book cataloging, but which are not considered distinct as scripts in the Unicode Standard. For example, ISO 15924 has separate script codes for the Fraktur and Gaelic varieties of the Latin script.

Where there are no corresponding ISO 15924 codes, the private use ones starting with Q are used. Such values are likely to change in the future. In such a case, the Q-names will be retained as aliases in the [PropValue] for backwards compatibility.

3.2 Maintenance

New characters and scripts are continually added to the Unicode Standard in an ongoing process. The following methodology is used to assign script values when new characters are added to the Unicode Standard:

  1. If a character is only used in one script, assign it to that script.

  2. Otherwise, nonspacing marks (Mn, Me) are Inherited

  3. Otherwise, letters are in a "joint" script (such as KatakanaOrHiragana)

  4. Otherwise, use Common


As more data on the usage of individual characters is collected, script values may be reassigned using the above methodology.
 

4 Data File

The Scripts.txt data file is available at [Scripts]. The format of the file is similar to that of Blocks.txt [Blocks]. The fields are separated by semicolons. The first field contains either a single code point, or the first and last code points in a range separated by "..". The second field provides the script value for that range. The comment (after a #) indicates the General Category, and the character name. For each range, it adds the character count in square brackets and uses the names for the first and last characters in the range. For example:

0B01;       ORIYA # Mn ORIYA SIGN CANDRABINDU
0B02..0B03; ORIYA # Mc [2] ORIYA SIGN ANUSVARA..ORIYA SIGN VISARGA

The value Common is the default value, given to all code points that are not explicitly mentioned in the data file.

References

[Blocks] Blocks.txt
For the latest version, see:
http://www.unicode.org/Public/UNIDATA/Blocks.txt
For other versions, see:
http://www.unicode.org/standard/versions/
[Charts] Script Charts
http://www.unicode.org/reports/tr24/charts/
[Feedback] Reporting Errors and Requesting Information Online
http://www.unicode.org/reporting.html
[FAQ] Unicode Frequently Asked Questions
http://www.unicode.org/faq/
For answers to common questions on technical issues.
[Glossary] Unicode Glossary
http://www.unicode.org/glossary/
For explanations of terminology used in this and other documents.
[ISO15924] ISO 15924: Code for the Representation of Names of Scripts
http://www.unicode.org/iso15924/
[PropValue] Property Value Aliases data file
For the latest version, see:
http://www.unicode.org/Public/UNIDATA/PropertyValueAliases.txt
For other versions, see:
http://www.unicode.org/standard/versions/
[Reports] Unicode Technical Reports
http://www.unicode.org/reports/
For information on the status and development process for technical reports, and for a list of technical reports.
[Scripts] Scripts data file
For the latest version, see:
http://www.unicode.org/Public/UNIDATA/Scripts.txt
For other versions, see:
http://www.unicode.org/standard/versions/
[UCA] Unicode Technical Standard #10: Unicode Collation Algorithm
http://www.unicode.org/reports/tr10/
[UCD] Unicode Character Database
http://www.unicode.org/ucd
For and overview of the Unicode Character Database and a list of its associated files.
[Unicode] The Unicode Standard
For the latest version see:
http://www.unicode.org/versions/latest/.
For the current version see: http://www.unicode.org/versions/Unicode4.1.0/.
For the last major version see: The Unicode Consortium. The Unicode Standard, Version 4.0. (Boston, MA, Addison-Wesley, 2003. 0-321-18578-1).
[UTS18] Unicode Technical Standard #18: Unicode Regular Expressions
http://www.unicode.org/reports/tr18/
[Versions] Versions of the Unicode Standard
http://www.unicode.org/standard/versions
For information on version numbering, and citing and referencing the Unicode Standard, the Unicode Character Database, and Unicode Technical Reports.

Modifications

Each of the following entries summarizes modifications from the previous version of this document.

6
  • Major rewrite of Introduction and usage model. Added section on Maintenance and table of classifications types
5
  • Changed to Proposed Update UAX
  • Added note on the stability of Q names
  • Abbreviated the list of values, so that people would not get the mistaken impression that it was complete
  • Added note on Braille
  • Added note on Mn, Me characters
  • Added note on use of scripts with regard to spoofing
  • Minor edits
4
  • Updated references, including reference to Property Value Aliases
  • Clarified that the list is for illustration only; the definitive values are in the UCD
  • Minor edits
3
  • Minor link editing only