ISO/IEC JTC 1/SC 2/WG 2
Universal Multiple-Octet Coded Character Set
(U C S)
ISO/IEC JTC1/SC2/WG2 N1838
Date: 1998-09-02
Title: |
Proposal to add the letters LATIN SMALL / CAPITAL LETTER A WITH DOT ABOVE to the BMP |
Source: |
Mark Davis |
Status: |
Expert Contribution |
Action: |
For consideration by JTC1/SC2/WG2 |
This document contains the proposal summary (ISO/IEC JTC1/SC2/WG2 form N1352) and a full proposal for the encoding of two new characters in the BMP of ISO/IEC 10646.
1. |
Title |
Proposal to add LATIN SMALL/CAPITAL LETTER A WITH DOT ABOVE to the BMP |
2. |
Requester's name |
|
3. |
Requester type |
Expert contribution |
4. |
Submission date |
1998-09-02 |
5. |
Requester's reference |
|
6a. |
Completion |
This is a complete proposal. |
6b. |
More information to be provided? |
No |
1a. |
New script? Name? |
No |
1b. |
Addition of characters to existing block? Name? |
Yes, to Latin. Suggested locations are U+1E9C/U+1E9D. However, the characters could be added at any reasonable place in the BMP. |
2. |
Number of characters |
2 |
3. |
Proposed category |
Category A |
4. |
Proposed level of implementation and rationale |
Level 1 |
5a. |
Character names included in proposal? |
Yes |
5b. |
Character names in accordance with guidelines? |
Yes |
5c. |
Character shapes reviewable? |
Yes |
6a. |
Who will provide computerized font? |
Mark Davis |
6b. |
Font currently available? |
No, but it can be generated quickly |
6c. |
Font format? |
TrueType |
7a. |
Are references (to other character sets, dictionaries, descriptive texts, etc.) provided? |
N/A--See below |
7b. |
Are published examples (such as samples from newspapers, magazines, or other sources) of use of proposed characters attached? |
N/A--See below |
8. |
Does the proposal address other aspects of character data processing? |
Yes |
1. |
Has this proposal been submitted before? |
No |
2. |
Contact with the user community? |
N/A--See below |
3. |
Information on the user community? |
N/A--See below |
4a. |
The context of use for the proposed characters? |
N/A--See below |
4b. |
Reference |
N/A--See below |
5a. |
Proposed characters in current use? |
N/A--See below |
5b. |
Where? |
N/A--See below |
6a. |
Characters should be encoded entirely in BMP? |
Yes |
6b. |
Rationale |
Required for efficient normalization of Unicode/10646, as described below. |
7. |
Should characters be kept in a continuous range? |
It would be useful, but not absolutely necessary |
8a. |
Can the characters be considered a presentation form of an existing character or character sequence? |
To the same degree as U+01E0 LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON |
8b. |
Where? |
N/A--See below |
8c. |
Reference |
N/A--See below |
9a. |
Can any of the characters be considered to be similar (in appearance or function) to an existing character? |
No |
9b. |
Where? |
|
9c. |
Reference |
|
10a. |
Combining characters or use of composite sequences included? |
No |
10b. |
List of composite sequences and their corresponding glyph images provided? |
No |
11. |
Characters with any special properties such as control function, etc. included? |
No |
To be completed by SC2/WG2
1. |
Relevant SC 2/WG 2 document numbers: |
|
2. |
Status (list of meeting number and corresponding action or disposition) |
|
3. |
Additional contact to user communities, liaison organizations etc. |
|
4. |
Assigned category and assigned priority/time frame |
|
5. |
Other Comments |
|
While the character A WITH DOT ABOVE may indeed occur in natural languages or academic use, the principal reason for this proposal has to do with the nature of normalization. There has been a great deal of interest in providing complete specifications for different normalized forms of Unicode/10646. (Cf. http://www.unicode.org/unicode/reports/techreports.html)
One of the normalization forms of particular interest is one that normalizes to precomposed forms--for example, that always uses U+00C0 LATIN CAPITAL LETTER A WITH GRAVE instead of the sequence of A with a separate combining grave accent <U+0041, U+0300>.
Implementations can be particularly efficient if Unicode and 10646 are coded such that whenever a single composed character X is canonically equivalent to composed character sequence <B, C1, C2,...,Cn> then there is another composed character Y which is equivalent to the sequence without the final combining mark <B, C1, C2,...,Cn-1>. For the purposes of this discussion, Y is called the completion character for X. If X does not have a completion character, X is called incomplete. Notice that only characters with two or more combining marks need to be checked for completeness.
There are only two incomplete characters in 10646:
U+01E0 LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON
U+01E1 LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON
By adding these characters, we can insure that implementations of normalization can uniformly apply the best algorithms to all text. By not having to check for special cases, the inner loops of the transformations can be as fast as possible.
The value of composed characters is fundamentally a product of their usefulness in implementations, since they could be expressed with composed character sequences. This is a special case where the addition of these characters is of particular value.
|
LATIN CAPITAL LETTER A WITH DOT ABOVE |
|
LATIN SMALL LETTER A WITH DOT ABOVE |
XXXX;LATIN CAPITAL LETTER A WITH DOT ABOVE;Lu;0;L;0041 0307;;;;N;;;;YYYY;
YYYY;LATIN SMALL LETTER A WITH DOT ABOVE;Ll;0;L;0061 0307;;;;N;;;XXXX;;XXXX