L2/04-234

Using SPACE as a base character

Eric Muller, Adobe Systems Inc.
June 7, 2004

Document History

TUS 4.0, page 46, states:

By convention, diacritical marks used in the Unicode Standard may be exhibited in (apparent) isolation by applying them to U+0020 SPACE or to U+00A0 NO-BREAK SPACE. This tactic might be employed, for example, when talking about the diacritical mark itself as a mark, rather than using it in its normal way in text.

While this paragraph is under the heading “Spacing Clones of European Diacritical Marks” and uses the term “diacritical mark”, there is a general understanding that the convention applies to all combining characters, regardless of their script (if any).

Supporting this behavior is not terribly onerous: essentially, a process that interprets SPACE only needs to check the next character; if it is a combining character, then the space “looses” its normal behaviour and instead acts like a letter. This exceptional behaviour is not terribly pleasant, but it is tolerable.

Things start to get interesting if we want to speak about other objects that behave typographically like diacritics but are not combining characters. Check for example Rule R6 on page 226 in TUS 4.0, which deals with the “subscript nonspacing mark RAsub” (in the words of TUS) and uses the dotted circle notation in the example line. Clearly, the subscript form of RA is something we may want to talk about when describing Devanagari, just like we may want to talk about the circumflex in Latin. So it is natural (although not blessed by the standard) to extend the convention: <SPACE, VIRAMA, RA> exhibits the subscript form of RA in (apparent) isolation. However, this becomes more cumbersome for text processing: fundamentally, a space now behaves like a Devanagari consonant, and the discovery that such spaces do not have their normal behaviour is a bit more complicated.

Things get ugly when we continue down that path: let’s say we want to speak about the superscript form of RA, which is typographically similar to the subscript form and to diacritic marks (e.g. in Rule R2 on page 225 of TUS 4.0). We establish the convention that <RA, VIRAMA, SPACE> exhibits the superscript form of RA in (apparent) isolation. Again, our processing of spaces becomes more complicated; it must additionally look backwards for a VIRAMA and a RA to discover how the space functions.

To fully apprehend the burden that those extensions to the convention would place on text processing systems, consider an XML parser that normalizes spaces, e.g. collapses runs of spaces to a single one. Is it reasonable to force it to understand Indic scripts?

Furthermore, the sequence <TA, RA, VIRAMA, SPACE> is ambiguous. Is it a word ending by a RA with an explicit virama, and followed by a space, or is it a word ending in TA, followed by a superscript form of RA?

So far, we have ignored these possible extensions of the convention. However, they are perfectly natural, and in fact the draft for the Sri Lanka standard SLS 1134:2004 (see L2/04-131 = WG2-N2737) is trying to institute such extensions. [In the published version of the draft, the base character is actually a joiner, which is even more problematic; after discussion, the proposal is to use a space as described here.]

In other words, it is now time to look carefully at the situation. I believe that we have only two ways forward:


Document History

Author: Eric Muller

RevisionDateComments
1June 7, 2004

Initial version