No: L2/00-156-R1
Re: Zero Width Grapheme Joiner
By: Mark Davis
On: 2000-04-30

Problem: There appears to be an unending stream of proposals for various decorative variants of existing characters, in particular, characters presented in a circle, square, diamond, some other adornment. While Unicode provides generative capability for this with single characters, using the combining characters, it does not provide such a capability for multiple characters. This proposal is to address that, making use of a mechanism that leverages the current architecture for VIRAMA.

The UTC discussed this approach at UTC#83. It decided to put a document out for discussion on the unicore@unicode.org mailing list, then reconsider the issue at UTC#84 in August. This document is somewhat revised from the one handed out in the meeting; as a result of the second discussion it adds GB for use in collation, and discusses the interaction with other format characters more.

Proposal

This proposal is for two characters, with suggested codepoints.

Proposed Encoded Character Function
U+2069* ZERO-WIDTH GRAPHEME JOINER (GJ) Requests that any grapheme boundary between adjacent graphic characters be ignored
U+206A* ZERO-WIDTH GRAPHEME BREAK (GB) Requests a grapheme boundary between the adjacent graphic characters.

Background: Graphemes are sequences of one or more encoded characters that correspond to what users think of as characters. They include, but are not limited to, combining character sequences such as (g + °). Grapheme boundaries are important for collation, regular-expressions, and counting “character” positions within text. The Unicode Standard provides a determination of where the default grapheme boundaries fall in a string of characters. This algorithm can be overriden for specific locales, which is what is done in providing contracting characters in collation tailoring tables. For more information, see [Boundaries].

There are two circumstances where even the locale-specific determination of grapheme boundaries may need to be overridden on a local basis:

  1. Determining the placement of combining accents that should apply to a sequence of base characters, rather than a single base character.
  2. Distinguishing in collation between sequences of characters that are normally considered a grapheme in a particular language, and that same sequence in foreign words. For example, in Slovak a ch may be normally considered a single grapheme, yet a user may want to have an English word containing ch to be simply sorted as a c followed by an h.

The GJ basically prevents adjacent characters from being considered separate graphemes. These characters then act as a unit for display, grapheme analysis, and (if included in a tailored collation element table) collation. In mapping from other character sets, the GJ can be used in the representation of such entities as circled numbers. Thus a circled "1a" could be represented as <0031, 2069, 0061, 20DD>, or "MED" in a diamond as <004D, 2069, 0045, 2069, 0044, 20DD>.

 

Implementation: In grapheme analysis, the GJ functions like the VIRAMA in Rules 5 and 10 of Table 5-3 Grapheme Boundaries. As a grapheme is being determined, the GJ prevents the following base character from causing a boundary. Linebreak does not permit graphemes to be broken across lines, but is otherwise not sensitive to grapheme boundaries. Thus the GJ is in class ZW and the GB is in class CM. See [LineBreak].

In display, a single grapheme acts as the base for combining marks, including the surrounding combining marks, such as the circle. For any specified repertoire, such as the characters in JIS 213, support can be provided by means of ligature tables in the font (for example, see the [OpenType]). Some display engines may be able to supply runtime generative support. In such cases there is considerable latitude for display, depending on the environment (such as the choice of font, whether the text is ideographic, etc.). Some examples are:

While the primary application of the GJ in display is for surrounding combining marks, it can also be used for others such as a breve. In cursive connection or ligatures, both GJ and GB should be ignored.

In collation, the grapheme joiner should be ignored unless it specifically occurs within a tailored collation element mapping. Thus it is given a completely ignorable collation element, like NULL. See [Collation]. In regular expressions, the GJ and GB may be used to affect the grapheme analysis used at either level 2 or level 3, based on their effect on normal grapheme boundary mechanisms used at those levels.

Like the other format characters (and unlike the VIRAMA), the GJ and GB have combining classes of ZERO; thus they do not reorder in combination with combining marks.

Application: GJ and GB request certain behavior, if the particular process can support it, and only in certain circumstances.

Fallbacks: As with the other format characters, if an implemenation cannot support the GJ or the GB, then they should be ignored completely in all processing; fonts should have no visible glyph for these characters. Some implemenations may only be able to support a limited length of characters in graphemes; in such implementations, subsequent GJs can be ignored. A sequence such as M <GJ> E <SQUARE> <GJ> D <SQUARE>, while permitted, may not render correctly unless the rendering engine is fairly sophisticated.

Interactions: As with other format characters, these should be ignored in processing that is irrelevant to them. Other than as specified above, they should have no effect on (and no interaction with) the line break format characters (ZWSP, ZWNBSP), nor with the cursive format characters (ZWJ, ZWNJ).

[The Mongolian variation selector and the proposed general purpose variation selectors are quite different. They are used to extend coding, and only have meaning if they occur immediately after a distinguished codepoint. Any other codepoint between the distinguished codepoint and the variation selector will cause the variation selector to be ignored.]

Use with markup: The GJ may be unsuitable for some markup languages. In those circumstances it should be replaced by appropriate style information.

References