L2/09-120

Source: Mark Davis
Date: April 10, 2009
Subject: Allocating Unicode Characters

We need to make sure that people understand that there are certain special ranges of Unicode code points that have restrictions on their contents, for stability. While some of these are captured in the WG2 Principles and Procedures (http://www.dkuug.dk/JTC1/SC2/WG2/docs/n3452.pdf), some are not or are unclear. Suggest adding new sections:

D.2.6. Reserved code points for BIDI.
D.2.7. Reserved code points for Default Ignorable Code Points

In addition, if people have programs for checking other consistency issues for new code points, like name collisions, we should encourage them to add tests for these as well, to at least flag those cases.

Details

1. There is a range allocated for non-identifier characters -- nothing suitable for encoding in identifiers are encoded there.

Properties: Pattern_Syntax

http://unicode.org/reports/tr31/#Default_Identifier_Syntax

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Pattern_Syntax:]

There is already a section in N3452 on this: D.2.5 Reserved code points for stability of identifiers. Despite that, there is currently one character that would otherwise qualify as an identifier character: U+2E2F ( ⸯ ) VERTICAL TILDE, allocated in U5.1. This particular instance is not a problem, but this should not be repeated.

2. There is a range allocated for Bidi characters - no Bidi-Left characters [:bc=L:] are encoded there.

Ranges:
U+0590…U+08FF,
U+FB1D…U+FB4F,
U+00010800…U+00010FFF,
plus [:blk=Arabic Presentation Forms A:][:blk=Arabic Presentation Forms B:]

http://unicode.org/reports/tr9/#Directional_Formatting_Codes

3. There is a range allocated for default ignorable characters - no other characters are allocated there, and all new DI characters should be there.

Ranges:
U+2065…U+2069
U+FFF0…U+FFF8
U+E0000…U+E0FFF

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[:di:]&[:cn:]]

Mark