Universal Multiple-Octet Coded Character Set
(U C S)


Date: 2000-08-09



Proposal for addition of COMBINING GRAPHEME JOINER


Unicode Technical Committee


Liaison Communication


For consideration by JTC1/SC2/WG2

Graphemes are sequences of one or more encoded characters that correspond to what users think of as characters. They include, but are not limited to, combining character sequences such as (g + °), digraphs such as Slovak “ch”, or sequences with letter modifiers such as kw. Grapheme boundaries are important for collation, regular-expressions, and counting “character” positions within text. The Unicode Standard provides a determination of where the default grapheme boundaries fall in a string of characters. This algorithm can be overriden for specific locales, which is what is done in providing contracting characters in collation tailoring tables. For more information, see [Boundaries].

There are circumstances where even the locale-specific determination of grapheme boundaries may need to be overridden on a local basis. These include:

  1. Determining the placement of combining accents that should apply to a sequence of base characters, rather than a single base character.
  2. Distinguishing in collation between sequences of characters that are normally considered a grapheme in a particular language, and that same sequence in foreign words.

To address this issue, the UTC has approved the addition of a new character at U+0363, the COMBINING GRAPHEME JOINER. The properties of this character are tuned to work well with current software so that such processes as grapheme determination, line-break, and collation will work well with this character. In terms of grapheme determination it functions like the virama. As with a virama, the grapheme joiner is only useful if immediately followed by a base character, so it should always be placed at the end of a combining character sequence. Thus a sequence of <base, grapheme joiner, base> will function as a single grapheme.

In terms of line-break, the character is in the category GLUE (the same as a zero-width no-break space see [LineBreak]). In collation, the grapheme joiner should be ignored unless it specifically occurs within a tailored collation element mapping. Thus it is given a completely ignorable collation element in the default collation table, like NULL (see [Collation]). However, it can be entered into the tailoring rules for any given language, using the UCA/14651 tailoring capabilities.

In terms of display, the grapheme joiner is an invisible combining character with canonical class of zero. It binds adjacent characters into a single grapheme as the base for combining marks, such as an underbar in "th". For any specified repertoire, implementation support for this capability can be provided by means of ligature tables in the font, or by means of special placement rules (see [OpenType]). Some display engines may be able to supply runtime generative support. As with other combining marks, there is considerable latitude for display depending on the environment (such as the choice of font). Some possibilities are:

The UTC urges WG2 to also approve this character for addition to ISO 10646.  The character should be encoded in the BMP, since it is similar to other characters there.


ISO/IEC JTC 1/SC 2/WG 2 - N2236 Attachment

Please fill Sections A, B and C below. Section D will be filled by SC 2/WG 2.

For instructions and guidance for filling in the form please see the document " Principles and Procedures for Allocation of New Characters and Scripts" (http://www.dkuug.dk/JTC1/SC2/WG2/prot)

A. Administrative


2. Requester's name: Unicode Technical Committee

3. Requester type (Member body/Liaison/Individual contribution):    Liaison

4. Submission date: 2000-08-10

5. Requester's reference (if applicable):

6. (Choose one of the following:) This is a complete proposal
This is a complete proposal: ; or,
More information will be provided later:

B. Technical - General

1. (Choose one of the following:)

a. This proposal is for a new script (set of characters): No

Proposed name of script:

b. The proposal is for addition of character(s) to an existing block: Yes

Name of the existing block: 0300; 036F; Combining Diacritical Marks

2. Number of characters in proposal: One

3. Proposed category (see section II, Character Categories): Combining Mark

4. Proposed Level of Implementation (see clause 15, ISO/IEC 10646-1): Any level is acceptable

Is a rationale provided for the choice? N/A

If Yes, reference:

5. Is a repertoire including character names provided?: Yes

a. If YES, are the names in accordance with the 'character naming guidelines' in Annex K of ISO/IEC 10646-1? Yes

b. Are the character shapes attached in a reviewable form? N/A

6. Who will provide the appropriate computerized font (ordered preference: True Type, PostScript or 96x96 bit-mapped format) for publishing the standard? The Unicode Technical Committee

If available now, identify source(s) for the font (include address, e-mail, ftp-site, etc.) and indicate the tools used:

7. References:

a. Are references (to other character sets, dictionaries, descriptive texts etc.) provided? N/A

b. Are published examples (such as samples from newspapers, magazines, or other sources) of use of proposed characters attached? N/A

8. Special encoding issues:

Does the proposal address other aspects of character data processing (if applicable) such as input, presentation, sorting, searching, indexing, transliteration etc. (if yes please enclose information): Yes, see ISO/IEC JTC1/SC2/WG2 N2236

C. Technical - Justification

1. Has this proposal for addition of character(s) been submitted before? No

If YES explain

2. Has contact been made to members of the user community (for example: National Body, user groups of the script or characters, other experts, etc.)? Yes

If YES, with whom? Unicode member companies (see http://www.unicode.org/unicode/consortium/memblogo.html)

If YES, available relevant documents?

3. Information on the user community for the proposed characters (for example: size, demographics, information technology use, or publishing use) is included? major IT industry leaders


4. The context of use for the proposed characters (type of use; common or rare) YES

Reference: see

5. Are the proposed characters in current use by the user community? N/A

If YES, where? Reference:

6. After giving due considerations to the principles in N 1352 must the proposed characters be entirely in the BMP? Yes

If YES, is a rationale provided? Yes

If YES, reference: Yes, see

7. Should the proposed characters be kept together in a contiguous range (rather than being scattered)? N/A

8. Can any of the proposed characters be considered a presentation form of an existing character or character sequence? No

If YES, is a rationale for its inclusion provided?

If YES, reference:

9. Can any of the proposed character(s) be considered to be similar (in appearance or function) to an existing character? No

If YES, is a rationale for its inclusion provided?

If YES, reference:

10. Does the proposal include use of combining characters and/or use of composite sequences (see clause 4.11 and 4.13 in ISO/IEC 10646-1)? Yes

If YES, is a rationale for such use provided? Yes

If YES, reference: see

Is a list of composite sequences and their corresponding glyph images (graphic symbols) provided? No

If YES, reference:

11. Does the proposal contain characters with any special properties such as control function or similar semantics? Yes
If YES, describe in detail (include attachment if necessary) see

D. SC 2/WG 2 Administrative (To be completed by SC 2/WG 2)

1. Relevant SC 2/WG 2 document numbers:

2. Status (list of meeting number and corresponding action or disposition):

3. Additional contact to user communities, liaison organizations etc:

4. Assigned category and assigned priority/time frame: