Some thoughts on character decomposition

From: Jonathan Coxhead (jonathan@doves.demon.co.uk)
Date: Sat Jun 05 1999 - 02:31:18 EDT


   Unicode's compatibility decompositions use a language almost like a
markup language. It would be nice to have this part of the standard
available as a tool for implementors. I propose an extension that
exposes more of the internals of the logic behind the repertoire. It
seems that this removes a possible source of inconsistency, and may
make use of Unicode less error prone. The proposal also gives more
tools to users, though that was not the motivation.

   Also, the rules about the way in which line breaks can happen are
not specified clearly (to me, anyway) by the Unicode standard. For
some characters, "breaking" or "non-breaking" properties are
specified, but it is not always clear what others do. (E g, there is
no NON-BREAKING SOLIDUS, so if my software breaks a line at a SOLIDUS,
how can I stop it? There is no BREAKING FULL STOP, so if I wish to
write a U R L with optional line breaks between the components, how
can I do that?)

   I'm putting this forward to see if there is any interest in these
ideas. I have already used some of them in software I have written.

   13 new characters are proposed:

      PRESENTATION SUGGESTION ALTERNATIVE
      PRESENTATION SUGGESTION BLACK LETTER
      PRESENTATION SUGGESTION DOUBLE-STRUCK
      PRESENTATION SUGGESTION ITALIC
      PRESENTATION SUGGESTION LIGATURE
      PRESENTATION SUGGESTION NARROW
      PRESENTATION SUGGESTION SCRIPT
      PRESENTATION SUGGESTION SMALL
      PRESENTATION SUGGESTION SUBSCRIPT
      PRESENTATION SUGGESTION SUPERSCRIPT
      PRESENTATION SUGGESTION VERTICAL
      PRESENTATION SUGGESTION WIDE

These express the intent behind the various compatibility formatting
directives in a way that allows them to be used in a productive way by
software that wishes to do so, and also allow the compatibility decom-
positions to be expressed in a more formal way. They are modifiers
just like other Unicode modifiers, so they follow the character (or
group) modified.

   The 13th is

      START GROUP

This just starts a group for purposes of subsequent modification. It
is like LRE, RLE, LRO, RLO but without any directional implication. It
is terminated in the same way (by POP DIRECTIONAL FORMATTING). START
GROUP is used, e g, with combining characters that can span more than
one letter, as in START GROUP, LATIN CAPITAL LETTER D, LATIN CAPITAL
LETTER Z, POP DIRECTIONAL FORMATTING, COMBINING CARON; or for
applications involving the TeX \widehat or \widetilde control
sequences which can span more than 1 character. But it is mostly
provided for use with the presentation suggestions proposed above.

   The character ZERO-WIDTH NO-BREAK SPACE and the <nobreak>
formatting directive are superseded by this proposal. (I e, they are
no longer needed.)

   New decompositions are proposed:

      EM SPACE: EM QUAD, ZERO WIDTH SPACE
      EN SPACE: EN QUAD, ZERO WIDTH SPACE
      FOUR-PER-EM SPACE: <compat>, SPACE
      HAIR SPACE: <compat>, SPACE
      HYPHEN: NON-BREAKING HYPHEN, ZERO WIDTH SPACE
      NO-BREAK SPACE (decomposition deleted)
      NON-BREAKING HYPHEN (decomposition deleted)
      PUNCTUATION SPACE: <compat>, SPACE
      SIX-PER-EM SPACE: <compat>, SPACE
      SPACE: NO-BREAK SPACE, ZERO WIDTH SPACE
      THIN SPACE: <compat>, SPACE
      THREE-PER-EM SPACE: <compat>, SPACE

Decompositions not marked with <compat> are canonical, so software
that treats HYPHEN in any way differently from NON-BREAKING HYPHEN,
ZERO WIDTH SPACE is in error. The only characters that permit a
line-break are ZERO WIDTH SPACE and those with a decomposition that
includes it. All uses of SPACE in decompositions not listed above are
changed to NO-BREAK SPACE, e g,

      CIRCUMFLEX ACCENT: NO-BREAK SPACE, COMBINING CIRCUMFLEX ACCENT

This ensures that decomposition does not introduce an inadvertent
break-point into a word.

   The various compatibility formatting directives are replaced in the
following ways.

   <Compat> is unchanged, except that PRESENTATION SUGGESTION LIGATURE
is used in the decompositions of all characters with LIGATURE in the
name: e g,

      LATIN CAPITAL LIGATURE IJ: <compat>, START GROUP, LATIN CAPITAL
            LETTER I, LATIN CAPITAL LETTER J, POP DIRECTIONAL
            FORMATTING, PRESENTATION SUGGESTION LIGATURE

and similarly for the others. (PRESENTATION SUGGESTION LIGATURE is
only useful with a group of 2 or more characters: it suggests a
ligature be used for the whole group. This means it works properly for
ligatures like "ffl".)

   Other uses of <compat> appear to have no extra formatting content,
but just flag the decomposition as non-canonical. I wouldn't necess-
arily propose to change this, but see below.

   "<circle>, ..." is replaced by "<compat>, START GROUP, ..., POP
DIRECTIONAL FORMATTING, COMBINING ENCLOSING CIRCLE". If there is only
1 character in the circle, this is the same as "<compat>, ...,
COMBINING ENCLOSING CIRCLE". (The same remark applies to all similar
constructs below, and won't be repeated.)

   "<final>, ..." is replaced by "<compat>, ZERO WIDTH JOINER, ...,
ZERO WIDTH NON-JOINER".

   "<font>, ..." is replaced by "<compat>, START GROUP, ..., POP
DIRECTIONAL FORMATTING, f" where f is one of PRESENTATION SUGGESTION
ALTERNATIVE, PRESENTATION SUGGESTION BLACK LETTER, PRESENTATION
SUGGESTION DOUBLE-STRUCK, PRESENTATION SUGGESTION ITALIC (for the
Planck constant), PRESENTATION SUGGESTION SCRIPT or PRESENTATION
SUGGESTION WIDE, depending on the name of the character. (PRESENTATION
SUGGESTION WIDE is also used for the <wide> directive, below.)

   "<fraction>, ..., FRACTION SLASH, ..." is replaced by "<compat>,
START GROUP, ..., POP DIRECTIONAL FORMATTING, PRESENTATION SUGGESTION
SUPERSCRIPT, FRACTION SLASH, START GROUP, ..., POP DIRECTIONAL
FORMATTING, PRESENTATION SUGGESTION SUBSCRIPT". Again, if there is
only 1 character in each group, the group is not needed, so

      VULGAR FRACTION ONE QUARTER: <fraction>, DIGIT ONE, FRACTION
            SLASH, DIGIT FOUR

becomes just

      VULGAR FRACTION ONE QUARTER: <compat>, DIGIT ONE, PRESENTATION
            SUGGESTION SUPERSCRIPT, FRACTION SLASH, DIGIT FOUR,
            PRESENTATION SUGGESTION SUBSCRIPT"

   "<initial>, ..." is replaced by "<compat>, ZERO WIDTH NON-JOINER,
..., ZERO WIDTH JOINER".

   "<isolated>, ..." is replaced by "<compat>, ZERO WIDTH NON-JOINER,
..., ZERO WIDTH NON-JOINER".

   "<medial>, ..." is replaced by "<compat>, ZERO WIDTH JOINER, ...,
ZERO WIDTH JOINER".

   "<narrow>, ..." is replaced by "<compat>, START GROUP, ..., POP
DIRECTIONAL FORMATTING, PRESENTATION SUGGESTION NARROW".

   <nobreak> is superseded. Breaks are allowed only at ZERO WIDTH
SPACE, which is also part of the (canonical) decomposition of SPACE,
HYPHEN, EN SPACE and EM SPACE. The user can make any character a
possible break-point just like these by following it with ZERO WIDTH
SPACE. (Disclaimer: I know nothing about ideographic spacing or line-
breaking.)

   "<small>, ..." is replaced by "<compat>, START GROUP, ..., POP
DIRECTIONAL FORMATTING, PRESENTATION SUGGESTION SMALL". (This is
unrelated to the SMALL in character names like LATIN SMALL LETTER A.)

   "<square>, ..." is replaced by "<compat>, START GROUP, ..., POP
DIRECTIONAL FORMATTING, COMBINING ENCLOSING SQUARE".

   "<sub>, ..." is replaced by "<compat>, START GROUP, ..., POP
DIRECTIONAL FORMATTING, PRESENTATION SUGGESTION SUBSCRIPT".

   "<super>, ..." is replaced by "<compat>, START GROUP, ..., POP
DIRECTIONAL FORMATTING, PRESENTATION SUGGESTION SUPERSCRIPT".

   "<vertical>, ..." is replaced by "<compat>, START GROUP, ..., POP
DIRECTIONAL FORMATTING, PRESENTATION SUGGESTION VERTICAL".
(PRESENTATION SUGGESTION VERTICAL applied to a group rotates the
characters individually; it does not stack them up.)

   "<wide>, ..." is replaced by "<compat>, START GROUP, ..., POP
DIRECTIONAL FORMATTING, PRESENTATION SUGGESTION WIDE". (PRESENTATION
SUGGESTION WIDE can also be used for the <font> directive.)

   All decompositions (canonical or compatibility) involving 2
non-spacing characters should be enclosed in START GROUP, ..., POP
DIRECTIONAL FORMATTING so that any subsequent modifier in the text
stream has the same effect after decomposition. E g,

      LATIN CAPITAL LETTER LJ: <compat>, START GROUP, LATIN CAPITAL
            LETTER L, LATIN CAPITAL LETTER J, POP DIRECTIONAL
            FORMATTING.

   With these changes, it would additionally be possible to define it
so that any decomposition involving the presentation suggestions is a
canonical decomposition---the presence of the suggestion character
itself lets any application know that something has happened. To put
it another way, DOUBLE-STRUCK CAPITAL C may be only compatibility-
equivalent to LATIN CAPITAL LETTER C; but it seems very reasonable---
indeed, clear---that it is *canonically* equivalent to LATIN CAPITAL
LETTER C, PRESENTATION SUGGESTION DOUBLE-STRUCK.

   Sophisticated software could also allow users to take explicit
control over the presentation suggestions. This shouldn't be confused
with a markup language, even though it gives control over some kinds
of font change (including italic, subscript and superscript): it should
be seen only as a way of allowing orthogonal access to features that
the software provides anyway.

   Still more structure could be exposed by adding new presentation
suggestion characters and new decompositions: in particular, TURNED [or
INVERTED], REVERSED, SANS-SERIF, NEGATIVE [in the photographic sense],
DIGRAPH [same as LIGATURE?] LIGHT and HEAVY seem productive. Then we
could have, e g, a canonical decomposition for DINGBAT NEGATIVE CIRCLED
SANS-SERIF NUMBER TEN as START GROUP, DIGIT ONE, DIGIT ZERO, POP
DIRECTIONAL FORMATTING, PRESENTATION SUGGESTION SANS-SERIF, COMBINING
ENCLOSING CIRCLE, PRESENTATION SUGGESTION NEGATIVE as a canonical
decomposition, and software that had no glyph there could still render
a reasonably respectable '10', with various other possibilites for
various levels of capability in dynamic glyph-shaping.

   Well, that's it. Does anyone like it? Would anyone like to see it
proposed formally?

  /| Jonathan Coxhead Philips Semiconductor Research Lab
 (_|/ 660 Gail Ave #A3 811 E Arques Ave
  /| Sunnyvale CA 94086-8160 Sunnyvale CA 94088
 (_/ tel:+1 408 245 5285 +1 408 991 3725 (voicemail)
      fax: +1 408 991 3300



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:46 EDT