Re: Proposed Informative Data File: Scripts
By: Mark Davis
On: 2000.04.21

I propose that the following file be considered for inclusion as an informative data file.


Scripts

Scripts.txt provides a mapping from Unicode characters to script codes. This information is useful in such circumstances as regular expressions (see UTR #18: Unicode Regular Expression Guidelines).

The script codes provided here generally follow the draft ISO 15924 (http://www.egt.ie/standards/iso15924/). They deviate slightly in semantics for two reasons. First, ISO 15924 applies only to characters, whereas a software programmer must also deal with illegal or unassigned codepoints. Second, ISO 15924 is designed with the sense of scripts as being typefaces: so Latin letters in Fraktur are considered to have a different script than Latin letters in a Roman typeface.

Script values cannot simply be extracted from the block ranges in Blocks.txt: in some cases, blocks contain more than two scripts; in other cases a single script is split over several blocks. The script codes form a full partition: every codepoint is assigned a single script code. Note, however, that although script codes are often more useful than simple block codes, one cannot make too many assumptions; in some cases languages may use characters from more than one script. This is especially the case for non-letters: for that reason, only characters of general category Letter are given distinct script codes: all others are given the script code Zy, for undetermined script.

The format of the file is similar to Blocks.txt. The fields are separated by semicolons. The first two fields provide the first and last characters in a range. The third field provides the script code for that range. The comment (after a #) provides the names for the first and last characters in the range.

Codepoints are given special codes if they are not assigned to characters, or are illegal, or do not fall in one of the ranges in the Scripts.txt file:

Script Code Codepoints 15924 Description
Zi Illegal (< 0, > 10FFFF, D800..DFFF, **FFFE, **FFFF) Not in 15924
Zy Unassigned in the current version of Unicode (UniData.txt) Undetermined script
Zn Not a letter (Lu, Ll, Lt, Lo in UniData.txt);
also not in a range in Scripts.txt
Not in 15924

Thus any software that uses this table must first separate out the above codepoints and give them the appropriate script values according to the above table. For the remaining codepoints, the script table is used to supply a value according to the range the codepoint falls in.