The Atomic Theory of Unicode

From: Jonathan Coxhead (jonathan@doves.demon.co.uk)
Date: Tue Jul 06 1999 - 20:23:15 EDT


Introduction
============

   I posted "Some thoughts on character decomposition" on 4th June to
this list. Since then, there has been some discussion, and I have made a
more thorough examination of the ideas I considered there. The main
motivation is to simplify Unicode for developers by providing more
structure within the standard: this allows a lot of characters to be
implemented by following a few clearly-stated rules; and at the same
time, make the character set more extensible, thereby making it more
universal.

   It has the side effect of giving more control to the users of the
standard by "opening it up" so that people in special fields (e g,
mathematics, phonetics), or those who just want strange effects in text,
can have them without needing to petition the standardising body. This,
I think, is what makes it more than just an exercise in classification.

   Since this has been an exercise in trying to understand the internal
structure of the U C S, I have called it the Atomic Theory of Unicode.
Maybe the analogy with chemistry would be closer, as single characters
are like atoms, able to interact in their own right, or to join in
various ways to make molecules.

   My aim has been to identify the largest possible set of decomposi-
tions, by using (or abusing) the "markup" tags present in the decompo-
sition fields of UNIDATA.TXT as explicit presentation suggestion
characters, and by making explicit some of the information that is only
represented in the name or visual appearance of the character. This is
done with a mixture of existing combining characters, some new combining
characters, and a new type of character called a PRESENTATION
SUGGESTION.

   This resolves another question, as well: there is a script alphabet
in the U C S, consisting of the characters B E F H I L M P R V a e g l o
v. Similar remarks apply to other alphabets. On one hand, it doesn't
make sense to have such an arbitrary set of characters; on the other
hand, there is no obvious requirement for the others. The resolution is
to give them all decompositions, putting them all on an equal footing.

   The character START GROUP is needed to make this work. It is an open
bracket, like LEFT-TO-RIGHT OVERRIDE but without any directional
implication, terminated in the same way: by POP DIRECTIONAL FORMATTING.

   Some of these concepts are clearly more desirable than others. Under
each paragraph, I have put a brief discussion of the issues that might
arise if it were adopted. I see the value of a decomposition as lying in
2 places: firstly, it provides new structure to existing characters,
which can let rendering software make substitutions in an intelligent
way, and thereby increase the readability of text to everyone (in other
words, 'R' is better than '?' as a rendering of DOUBLE-STRUCK CAPITAL
R); and second, it may be productive as a means for characters to be
generated without having to get new characters encoded (in other words,
it gives access to DOUBLE-STRUCK CAPITAL F, should anyone need it).

   The second point is important, because it allows us to recapitulate
the way many characters entered common use in the first place. The
character LATIN SMALL LETTER TURNED Y did not just appear: it was
adopted because a new symbol was needed, and the typographic technology
made it convenient. So it seems sensible to acknowledge that LATIN SMALL
LETTER TURNED Y is a LATIN SMALL LETTER Y that has had some sort of
process applied to it.

   I have also listed the characters whose definition would be affected.
I have tried to make it complete in most cases. (The main failings are
in the areas of arrow characters and modifier letters, and I don't think
there are any new principles there.)

   I have confined myself to the "Western" section of the U C S, loosely
defined as the Latin, Greek and Cyrillic alphabets together with the
mathematical symbols, because I lack knowledge outside that sphere. I
may have made mistakes in analysing I P A characters, because my
knowledge in that area is sketchy.

Summary
=======

   This note considers 3134 characters, of which 900 have canonical
decompositions already, and are not considered further. Of the 2234
characters left, over 1300 of them---well over half---are given new
canonical decompositions, some of which involve one or more of 34 new
characters, which are defined here. These characters are intended to be
productive parts of the U C S.

   I hope that some consideration can be given to these ideas. I even
hope that they might forestall the encoding of large numbers of copies
of the Latin alphabet into the U C S in the guise of mathematical
symbols and phonetic characters, etc, while restoring the freedom of
expression to these groups of people, and keeping the U C S down to a
small and productive core.

New "presentation suggestions"
=== ============= ============

PRESENTATION SUGGESTION BLACK-LETTER
------------ ---------- ------------

   This requests that a black-letter, or fraktur, font be used. Certain
mathematical symbols are conventionally written this way, and German
publishing sometimes uses Fraktur rather then heavy (or bold) for
vectors.

   There are 5 black-letter characters in the U C S.

      BLACK-LETTER CAPITAL C (LATIN CAPITAL LETTER C)
      BLACK-LETTER CAPITAL H (LATIN CAPITAL LETTER H)
      BLACK-LETTER CAPITAL I (LATIN CAPITAL LETTER I)
      BLACK-LETTER CAPITAL R (LATIN CAPITAL LETTER R)
      BLACK-LETTER CAPITAL Z (LATIN CAPITAL LETTER Z)

   (The first line of this table should be interpreted as meaning that
we have a canonical decomposition

      BLACK-LETTER CAPITAL C = LATIN CAPITAL LETTER C + PRESENTATION
            SUGGESTION BLACK-LETTER,

and the rest similarly.)

   The lower-case alphabet is available as LATIN SMALL LETTER (whatever)
+ PRESENTATION SUGGESTION BLACK-LETTER, and because these are canonical
decompositions, the resulting output would be completely compatible,
visually and for all processing purposes, with the 5 precomposed forms
already encoded.

   Cannot be done algorithmically: either you have the right font, or
you don't. Falling back to the base glyph is likely to give good results
though.

PRESENTATION SUGGESTION CAPITAL LETTER TONE
------------ ---------- ------- ------ ----

   This suggests, of a digit, that a variant glyph be used with a style
suitable for marking Zhuang tone. It is for the following:

      LATIN CAPITAL LETTER TONE TWO = LATIN DIGIT TWO + PRESENTATION
            SUGGESTION CAPITAL LETTER TONE
      LATIN CAPITAL LETTER TONE FIVE = LATIN DIGIT FIVE + PRESENTATION
            SUGGESTION CAPITAL LETTER TONE
      LATIN CAPITAL LETTER TONE SIX = LATIN DIGIT SIX + PRESENTATION
            SUGGESTION CAPITAL LETTER TONE

   In addition, compatibility decompositions could be given for

      CYRILLIC CAPITAL LETTER ZE = <compat> + LATIN DIGIT THREE +
            PRESENTATION SUGGESTION CAPITAL LETTER TONE
      CYRILLIC CAPITAL LETTER CHE = <compat> + LATIN DIGIT FOUR +
            PRESENTATION SUGGESTION CAPITAL LETTER TONE

   By encoding this character, it becomes possible for sophisticated
software to render suitable glyphs for all the tone letters, without
needing separate encodings for latin capital letter tones 3, 4.

PRESENTATION SUGGESTION COMPOSE
------------ ---------- -------

   Requests that characters be overstruck. Applies to the 2 characters
on each side, like a "binary operator".

   Although seemingly simple, this introduces a whole set of problems.
What is the difference between following a character with a COMBINING
ENCLOSING CIRCLE and COMPOSING it with a WHITE CIRCLE? Can you accent a
character by composing it with a spacing accent character?

   To avoid such problems, the COMPOSE character is only used in cases
where the derivation of the character is clearly understood, and known
to be overstuck. This is a historical judgement.

   It applies mostly to the A P L block, and there are 64 symbols,
represented here by a sample of 2:

      APL FUNCTIONAL SYMBOL CIRCLE BACKSLASH = WHITE CIRCLE +
            PRESENTATION SUGGESTION COMPOSE + BACKSLASH
      APL FUNCTIONAL SYMBOL DOWN TACK UNDERBAR = DOWN TACK +
            PRESENTATION SUGGESTION COMPOSE + LOW LINE

   Composition could also be used to shrink the number of box-drawing
characters down to a very reasonable 20 or so, from which the rest can
be built.

   If widely deployed, the COMPOSE operation could cause no end of havoc
by encouraging the creation of new symbols in a very uncontrolled way.
(On the other hand, maybe that's a good thing.)

PRESENTATION SUGGESTION DOUBLE
------------ ---------- ------

   Simply a repeated glyph, though usually kerned closer together than
an actual repetition.

   Could be used in

      APOSTROPHE (QUOTATION MARK)
      DOUBLE ACUTE ACCENT (ACUTE ACCENT)
      DOUBLE EXCLAMATION MARK (EXCLAMATION MARK)
      DOUBLE HIGH-REVERSED-9 QUOTATION MARK (SINGLE HIGH-REVERSED-9
            QUOTATION MARK)
      DOUBLE INTEGRAL (INTEGRAL)
      DOUBLE LOW-9 QUOTATION MARK (SINGLE LOW-9 QUOTATION MARK)
      DOUBLE PRIME (PRIME)
      DOUBLE SUBSET (SUBSET OF)
      DOUBLE SUPERSET (SUPERSET OF)
      DOUBLE VERTICAL LINE (VERTICAL LINE)
      HEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT (HEAVY SINGLE COMMA
            QUOTATION MARK ORNAMENT)
      HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT (HEAVY SINGLE
            TURNED COMMA QUOTATION MARK ORNAMENT)
      LATIN LETTER LATERAL CLICK (LATIN LETTER DENTAL CLICK)
      LEFT DOUBLE QUOTATION MARK (LEFT SINGLE QUOTATION MARK)
      LEFT-POINTING DOUBLE ANGLE QUOTATION MARK (SINGLE LEFT-POINTING
            ANGLE QUOTATION MARK)
      MODIFIER LETTER DOUBLE PRIME (MODIFIER LETTER PRIME)
      MUCH GREATER-THAN (GREATER-THAN SIGN)
      MUCH LESS-THAN (LESS-THAN SIGN)
      PROPORTION (RATIO)
      REVERSED DOUBLE PRIME (REVERSED PRIME)
      RIGHT DOUBLE QUOTATION MARK (RIGHT SINGLE QUOTATION MARK)
      RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK (SINGLE RIGHT
            POINTING ANGLE QUOTATION MARK)
      SURFACE INTEGRAL (CONTOUR INTEGRAL)

but not COMBINING DOUBLE ACUTE ACCENT or other doubled accent
characters. (To avoid ambiguity a presentation suggestion always applies
to the base form plus all preceding combining characters.) Also, not
COMBINING DOUBLE TILDE LEFT HALF where the DOUBLE refers to total width
rather than repetition, or DOUBLE DAGGER, DOUBLE UNION, etc, where the
semantic is not "doubled".

   Easy to do algorithmically.

   Unlikely to be very productive in forming new characters, as it's
easier to just write a character twice, and there is visually little
difference.

   If used but not recognised, quite likely to cause the resulting text
to be misinterpreted. Not very desirable.

PRESENTATION SUGGESTION DOUBLE-STRUCK
------------ ---------- -------------

   Requests that a double-struck, "open-face", "blackboard bold" font be
used.

   Used in

      DOUBLE-STRUCK CAPITAL C (LATIN CAPITAL LETTER C)
      DOUBLE-STRUCK CAPITAL H (LATIN CAPITAL LETTER H)
      DOUBLE-STRUCK CAPITAL N (LATIN CAPITAL LETTER N)
      DOUBLE-STRUCK CAPITAL P (LATIN CAPITAL LETTER P)
      DOUBLE-STRUCK CAPITAL Q (LATIN CAPITAL LETTER Q)
      DOUBLE-STRUCK CAPITAL R (LATIN CAPITAL LETTER R)
      DOUBLE-STRUCK CAPITAL Z (LATIN CAPITAL LETTER Z)

   Hard to do algorithmically.

   May well be productive---in particular, F (as in "Let F be a field
...") is missing, but often seen in the literature.

   If used but not rendered, confusion is likely to be minimal, so
highly desirable.

PRESENTATION SUGGESTION FULLWIDTH
------------ ---------- ---------

   This is simply used for characters whose decompositions include
<wide>. It indicates that, if there is choice between 2 glyphs (the
single-cell one or the double-cell one), the double-cell one should be
chosen. It enables software to use decomposition and get good results
without needing to understand anything else about fullwidth/halfwidth
characters.

PRESENTATION SUGGESTION HALFWIDTH
------------ ---------- ---------

   This is simply used for characters whose decompositions include
<small> or <narrow>. It indicates that, if there is choice between 2
glyphs (the single-cell one or the double-cell one), the single-cell one
should be chosen. It enables software to use decomposition and get good
results without needing to understand anything else about fullwidth/
halfwidth characters.

PRESENTATION SUGGESTION HEAVY
------------ ---------- -----

   Requests the character be rendered in a heavy, "bold", or "black"
font. The style is frequently used with important semantic content in
mathematics, where it is used to represent a vector, and the magnitude
of the vector is represented by the corresponding non-heavy character.

   There are 16 heavy characters already in the U C S:

      HEAVY ASTERISK (ASTERISK)
      HEAVY BALLOT X (BALLOT X)
      HEAVY CHECK MARK (CHECK MARK)
      HEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT (RIGHT DOUBLE
            QUOTATION MARK)
      HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT (LEFT DOUBLE
            QUOTATION MARK)
      HEAVY EIGHT POINTED RECTILINEAR BLACK STAR (EIGHT POINTED
            RECTILINEAR BLACK STAR)
      HEAVY EIGHT TEARDROP-SPOKED PROPELLER ASTERISK (EIGHT TEARDROP-
            SPOKED PROPELLER ASTERISK)
      HEAVY FOUR BALLOON-SPOKED ASTERISK (FOUR BALLOON-SPOKED ASTERISK)
      HEAVY MULTIPLICATION X (MULTIPLICATION X)
      HEAVY OPEN CENTRE CROSS (OPEN CENTRE CROSS)
      HEAVY OUTLINED BLACK STAR (OUTLINED BLACK STAR)
      HEAVY SINGLE COMMA QUOTATION MARK ORNAMENT (RIGHT SINGLE
            QUOTATION MARK)
      HEAVY SINGLE TURNED COMMA QUOTATION MARK ORNAMENT (LEFT SINGLE
            QUOTATION MARK)
      HEAVY SPARKLE (SPARKLE)
      HEAVY TEARDROP-SPOKED ASTERISK (TEARDROP-SPOKED ASTERISK)
      HEAVY VERTICAL BAR (MEDIUM VERTICAL BAR)

HEAVY TEARDROP-SPOKED PINWHEEL ASTERISK and HEAVY CHEVRON SNOWFLAKE both
appear to be heavy, but the base form is not encoded. (This reminds me
of the situation with proto-Indo-European words whose existence we can
deduce without direct evidence.) Maybe they should be added.

   Hard to do well algorithmically, but easy to do to some legible
standard.

   If used but not recognised, unlikely to cause the resulting text to
be misinterpreted (except in the mathematical use), so highly desirable.

PRESENTATION SUGGESTION INVERTED
------------ ---------- --------

   Rotates the character (out of the paper) through a half-turn about a
horizontal axis; equivalently, reflects the character about the
horizontal axis. For characters where "inverted" and "turned" are
equivalent, we describe the character as "turned", out of deference to
metal typography.

   There 5 characters which are inverted copies of other characters:

      LATIN LETTER INVERTED GLOTTAL STOP (LATIN LETTER GLOTTAL STOP)
      LATIN LETTER INVERTED GLOTTAL STOP WITH STROKE (LATIN LETTER
            GLOTTAL STOP WITH STROKE)
      LATIN LETTER SMALL CAPITAL INVERTED R (LATIN LETTER SMALL
            CAPITAL R)
      LOWER BLADE SCISSORS (UPPER BLADE SCISSORS)
      UPPER RIGHT PENCIL (LOWER RIGHT PENCIL)

   There is also INVERTED LAZY S, but no LAZY S. However, LAZY S can be
seen as a rotated version of LATIN SMALL LETTER S, which we can obtain
if we use a PRESENTATION SUGGESTION ROTATED decomposition, described
later.

   Some arrows can be derived from others by inverting:

      SOUTHEAST ARROW = NORTHEAST ARROW + PRESENTATION SUGGESTION
            INVERTED
      RIGHTWARDS HARPOON WITH BARB DOWNWARDS = RIGHTWARDS HARPOON WITH
            BARB UPWARDS + PRESENTATION SUGGESTION INVERTED

In each case, the one pointing up (positive) is the one we define as
"the right way up", and its image is the "inverted" glyph.

   This is very easy to do in software, and the consequences of ignoring
it are likely to be severe if arrows are important. (This only affects
people who try to make up new characters, as existing characters are
already encoded and should be well understood.)

PRESENTATION SUGGESTION ITALIC
------------ ---------- ------

   Requests the glyph be rendered in an italic, oblique, or slanted
font. This may be just slanted, or it may have additional ornamentation
at the ends of strokes, but in this case should still be distinguishable
from SCRIPT (q v).

   Only needed for 2 characters currently encoded.

      PLANCK CONSTANT (LATIN SMALL LETTER H)
      PLANCK CONSTANT OVER TWO PI (LATIN SMALL LETTER H WITH STROKE)

   In mathematical text, there is usually a font difference between the
characters used in the running text and the characters used for ordinary
mathematical symbols. The recommended way to mark the distinction is
with PRESENTATION SUGGESTION ITALIC. Symbols represented as Greek
characters are sometimes printed in a recognisably italic font, and
sometimes an upright one: if there is only an italic Greek font
available, it should be used for Greek characters with or without a
PRESENTATION SUGGESTION ITALIC.

   Slanting, at least, can be done algorithmically with little
difficulty for both outline and bit-mapped fonts.

   If used but not recognised, unlikely to cause the resulting text to
be misinterpreted (even in a mathematical application), so this is very
desirable even though it's only used for 2 existing characters.

PRESENTATION SUGGESTION LARGE
------------ ---------- -----

   A larger version of the same character. Used in

      LIGHT VERTICAL BAR (VERTICAL BAR)
      MULTIPLICATION X (MULTIPLICATION SIGN)
      N-ARY COPRODUCT (GREEK CAPITAL LETTER PI)
      N-ARY INTERSECTION (INTERSECTION)
      N-ARY LOGICAL AND (WEDGE)
      N-ARY LOGICAL OR (VEE)
      N-ARY PRODUCT (GREEK CAPITAL LETTER PI)
      N-ARY SUMMATION (GREEK CAPITAL LETTER SIGMA)
      N-ARY UNION (UNION)

   It's odd that although there's an N-ARY COPRODUCT, there's no CO-
PRODUCT. It could be represented as LATIN CAPITAL LETTER PI + PRESEN-
TATION SUGGESTION TURNED. This presentation suggestion would also be the
right one to use with HEBREW LETTER WIDE ALEPH/DALET/HE/KAF/LAMED/FINAL
MEM/RESH/TAV.

PRESENTATION SUGGESTION LIGATURE
------------ ---------- --------

   Requests some kind of "artistic combination" of the 2 characters into
a single glyph. This is another "binary operation": the PRESENTATION
SUGGESTION LIGATURE stands between two characters to be ligated. Either
or both may have their own combining marks: to give a combining mark to
the whole ligature, it would have to be first enclosed in START GROUP
... POP DIRECTIONAL FORMATTING.

   Could be used for

      CYRILLIC CAPITAL LETTER IOTIFIED BIG YUS (CYRILLIC LETTER
            PALOCHKA, CYRILLIC CAPITAL LETTER BIG YUS)
      CYRILLIC CAPITAL LETTER IOTIFIED E (CYRILLIC LETTER PALOCHKA,
            CYRILLIC CAPITAL LETTER E)
      CYRILLIC CAPITAL LETTER IOTIFIED LITTLE YUS (CYRILLIC LETTER
            PALOCHKA, CYRILLIC CAPITAL LETTER LITTLE YUS)
      CYRILLIC CAPITAL LIGATURE A IE (CYRILLIC CAPITAL LETTER A,
            CYRILLIC CAPITAL LETTER IE)
      CYRILLIC CAPITAL LIGATURE EN GHE (CYRILLIC CAPITAL LETTER EN,
            CYRILLIC CAPITAL LETTER GHE)
      CYRILLIC CAPITAL LIGATURE TE TSE (CYRILLIC CAPITAL LETTER TE,
            CYRILLIC CAPITAL LETTER TSE)
      CYRILLIC SMALL LETTER IOTIFIED BIG YUS (CYRILLIC LETTER
            PALOCHKA, CYRILLIC SMALL LETTER BIG YUS)
      CYRILLIC SMALL LETTER IOTIFIED E (CYRILLIC LETTER PALOCHKA,
            CYRILLIC SMALL LETTER E)
      CYRILLIC SMALL LETTER IOTIFIED LITTLE YUS (CYRILLIC LETTER
            PALOCHKA, CYRILLIC SMALL LETTER LITTLE YUS)
      CYRILLIC SMALL LIGATURE A IE (CYRILLIC SMALL LETTER A,
            CYRILLIC SMALL LETTER IE)
      CYRILLIC SMALL LIGATURE EN GHE (CYRILLIC SMALL LETTER EN,
            CYRILLIC SMALL LETTER GHE)
      CYRILLIC SMALL LIGATURE TE TSE (CYRILLIC SMALL LETTER TE,
            CYRILLIC SMALL LETTER TSE)
      L B BAR SYMBOL (LATIN SMALL LETTER L, LATIN SMALL LETTER B)
      LATIN CAPITAL LETTER AE WITH MACRON (LATIN CAPITAL LETTER A,
            LATIN CAPITAL LETTER E)
      LATIN CAPITAL LETTER AE (LATIN CAPITAL LETTER A, LATIN CAPITAL
            LETTER E)
      LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON (LATIN
            CAPITAL LETTER D, LATIN SMALL LETTER Z)
      LATIN CAPITAL LETTER DZ WITH CARON (LATIN CAPITAL LETTER D,
            LATIN CAPITAL LETTER Z)
      LATIN CAPITAL LETTER L WITH SMALL LETTER J (LATIN CAPITAL LETTER
            L, LATIN SMALL LETTER J)
      LATIN CAPITAL LETTER LJ (LATIN CAPITAL LETTER L, LATIN CAPITAL
            LETTER J)
      LATIN CAPITAL LETTER N WITH SMALL LETTER J (LATIN CAPITAL LETTER
            N, LATIN SMALL LETTER J)
      LATIN CAPITAL LETTER NJ (LATIN CAPITAL LETTER N, LATIN CAPITAL
            LETTER J)
      LATIN CAPITAL LIGATURE IJ (LATIN CAPITAL LETTER I, LATIN CAPITAL
            LETTER J)
      LATIN CAPITAL LIGATURE OE (LATIN CAPITAL LETTER O, LATIN SMALL
            LETTER E)
      LATIN LETTER SMALL CAPITAL OE (LATIN CAPITAL LETTER O, LATIN
            CAPITAL LETTER E)
      LATIN SMALL LETTER AE WITH MACRON (LATIN SMALL LETTER A, LATIN
            SMALL LETTER E)
      LATIN SMALL LETTER DEZH DIGRAPH (LATIN SMALL LETTER D, LATIN
            SMALL LETTER EZH)
      LATIN SMALL LETTER DZ DIGRAPH (LATIN SMALL LETTER D, LATIN
            SMALL LETTER Z)
      LATIN SMALL LETTER DZ DIGRAPH WITH CURL (LATIN SMALL LETTER D,
            LATIN SMALL LETTER Z WITH CURL)
      LATIN SMALL LETTER DZ WITH CARON (LATIN SMALL LETTER D, LATIN
            SMALL LETTER Z)
      LATIN SMALL LETTER HV (LATIN SMALL LETTER H, LATIN SMALL LETTER V)
      LATIN SMALL LETTER LEZH (LATIN SMALL LETTER L, LATIN
            SMALL LETTER EZH)
      LATIN SMALL LETTER LJ (LATIN SMALL LETTER L, LATIN
            SMALL LETTER J)
      LATIN SMALL LETTER NJ (LATIN SMALL LETTER N, LATIN
            SMALL LETTER J)
      LATIN SMALL LETTER REVERSED OPEN E WITH HOOK (LATIN SMALL LETTER
            E + PRESENTATION SUGGESTION VARIANT + PRESENTATION
            SUGGESTION REVERSED, MODIFIER LETTER RHOTIC HOOK)
      LATIN SMALL LETTER SHARP S (LATIN SMALL LETTER S, LATIN
            SMALL LETTER S)
      LATIN SMALL LETTER SCHWA WITH HOOK (LATIN SMALL LETTER SCHWA,
            MODIFIER LETTER RHOTIC HOOK)
      LATIN SMALL LETTER TC DIGRAPH WITH CURL (LATIN SMALL LETTER T,
            LATIN SMALL LETTER C)
      LATIN SMALL LETTER TESH DIGRAPH (LATIN SMALL LETTER T, LATIN
            SMALL LETTER ESH)
      LATIN SMALL LETTER TS DIGRAPH (LATIN SMALL LETTER T, LATIN
            SMALL LETTER S)
      LATIN SMALL LIGATURE FF (LATIN SMALL LETTER F, LATIN
            SMALL LETTER F)
      LATIN SMALL LIGATURE FFI (LATIN SMALL LETTER F, LATIN
            SMALL LETTER F, LATIN SMALL LETTER I)
      LATIN SMALL LIGATURE FFL (LATIN SMALL LETTER F, LATIN
            SMALL LETTER F, LATIN SMALL LETTER L)
      LATIN SMALL LIGATURE FI (LATIN SMALL LETTER F, LATIN
            SMALL LETTER I)
      LATIN SMALL LIGATURE FL (LATIN SMALL LETTER F, LATIN
            SMALL LETTER L)
      LATIN SMALL LIGATURE IJ (LATIN SMALL LETTER I, LATIN
            SMALL LETTER J)
      LATIN SMALL LIGATURE LONG S T (LATIN SMALL LETTER LONG S, LATIN
            SMALL LETTER T)
      LATIN SMALL LIGATURE OE (LATIN SMALL LETTER O, LATIN
            SMALL LETTER E)
      LATIN SMALL LIGATURE ST (LATIN SMALL LETTER S, LATIN
            SMALL LETTER T)
      NUMERO SIGN (LATIN CAPITAL LETTER N, LATIN SMALL LETTER O)
      PRESCRIPTION TAKE (LATIN CAPITAL LETTER P, LATIN SMALL LETTER X)

   (The first line of this table should be read as implying the
existence of a decomposition

      CYRILLIC CAPITAL LETTER IOTIFIED BIG YUS = CYRILLIC LETTER
            PALOCHKA + PRESENTATION SUGGESTION LIGATURE + CYRILLIC
            CAPITAL LETTER BIG YUS,

and similarly for the rest. If there are more than 2 characters, it
means there is more than 1 ligature, as in

      LATIN SMALL LIGATURE FFI = LATIN SMALL LETTER F + PRESENTATION
            SUGGESTION LIGATURE + LATIN SMALL LETTER F + PRESENTATION
            SUGGESTION LIGATURE + LATIN SMALL LETTER I.)

   It could be argued that CYRILLIC CAPITAL LETTER YU is historically
CYRILLIC CAPITAL LETTER IOTIFIED O and so should have a decomposition as
CYRILLIC LETTER PALOCHKA, CYRILLIC CAPITAL LETTER O. (If this horrifies
you, see the remarks on O WITH STROKE, below.)

   There is an argument for some of these---e g, LATIN CAPITAL LETTER AE
---that they are not ligatures, and should not be decomposed as such.
However, even when LATIN CAPITAL LETTER AE is being used as the letter
ash, it is still appropriate to render it as AE if its glyph is not
available: you'd see text like "AElfred the Great". (The comments for "O
WITH STROKE" also apply here.)

   Strangely, the decomposition with ligature is most useful for
renderers that can't do ligatures, e g, cell-based character terminals.
They can just look up the decomposition and render the two glyphs---
ignoring the PRESENTATION REQUEST LIGATURE completely---and get good,
legible results.

PRESENTATION SUGGESTION OUTLINED
------------ ---------- --------

   Surrounds the character with a narrow line. 5 characters are
described as "outlined":

      HEAVY OUTLINED BLACK STAR (HEAVY BLACK STAR)
      OUTLINED BLACK STAR (BLACK STAR)
      OUTLINED GREEK CROSS (GREEK CROSS)
      OUTLINED LATIN CROSS (LATIN CROSS)
      STRESS OUTLINED WHITE STAR (WHITE STAR)

and EIGHT PETALLED OUTLINED BLACK FLORETTE lacks a base form.

   Possible to do algorithmically, but seems like a very specialised
thing to do for such little gain.

   Substitution of the non-outlined glyph is unlikely to cause legibi-
lity problems though, so this would be a good decomposition to have.

PRESENTATION SUGGESTION PLINTHED
------------ ---------- --------

   Requests that a "plinth" be drawn, with the glyph as the top surface.
This is called SHADOWED in character names. Conventionally, the glyph is
above and slightly to the left of the observer. This can be changed by
using TURNED, INVERTED or REVERSED.

   Could be used for

      SHADOWED WHITE LATIN CROSS (WHITE LATIN CROSS)
      SHADOWED WHITE STAR (WHITE STAR)
      SHADOWED WHITE CIRCLE (WHITE CIRCLE)
      LOWER RIGHT SHADOWED WHITE SQUARE (WHITE SQUARE)
      UPPER RIGHT SHADOWED WHITE SQUARE (WHITE SQUARE)
      HEAVY LOWER RIGHT-SHADOWED WHITE RIGHTWARDS ARROW (RIGHTWARDS
            ARROW)
      HEAVY UPPER RIGHT-SHADOWED WHITE RIGHTWARDS ARROW (RIGHTWARDS
            ARROW)

and there are some plinthed characters that have no base form: BACK-
TILTED SHADOWED WHITE RIGHTWARDS ARROW, FRONT-TILTED SHADOWED WHITE
RIGHTWARDS ARROW, NOTCHED LOWER RIGHT-SHADOWED WHITE RIGHTWARDS ARROW
NOTCHED UPPER RIGHT-SHADOWED WHITE RIGHTWARDS ARROW.

   Hard to do well in software, but a missing plinth is not going to
make the difference between comprehension and confusion, so the
decompositions would be useful.

PRESENTATION SUGGESTION QUADRUPLE
------------ ---------- ---------

   Four copies of a glyph, kerned closer together than if it were just
written four times.

   The only such character is COMBINING FOUR DOTS ABOVE, and it has no
non-spacing form; so this character is useless, and we don't need to
consider 5 and above.

PRESENTATION SUGGESTION REVERSED
------------ ---------- --------

   Rotates the character (out of the paper) through a half-turn about a
vertical axis; equivalently, reflects the character about the vertical
axis. For characters where "reversed" and "turned" are equivalent, we
describe the character as "turned", out of deference to metal
typography.

      LATIN CAPITAL LETTER EZH REVERSED (LATIN CAPITAL LETTER EZH)
      LATIN LETTER PHARYNGEAL VOICED FRICATIVE (LATIN LETTER GLOTTAL
            STOP)
      LATIN LETTER REVERSED ESH LOOP (LATIN LETTER ESH)
      LATIN LETTER REVERSED GLOTTAL STOP WITH STROKE (LATIN LETTER
            GLOTTAL STOP WITH STROKE)
      LATIN SMALL LETTER CLOSED REVERSED OPEN E (LATIN SMALL LETTER
            CLOSED OPEN E)
      LATIN SMALL LETTER EZH REVERSED (LATIN SMALL LETTER EZH)
      LATIN SMALL LETTER REVERSED E (LATIN SMALL LETTER E)
      LATIN SMALL LETTER REVERSED OPEN E (LATIN SMALL LETTER OPEN E)
      LATIN SMALL LETTER REVERSED R WITH FISHHOOK (LATIN SMALL LETTER
            R WITH FISHHOOK)
      LATIN SMALL LETTER SQUAT REVERSED ESH (LATIN SMALL LETTER ESH)
      REVERSE SOLIDUS (SOLIDUS)
      REVERSED DOUBLE PRIME (DOUBLE PRIME)
      REVERSED NOT SIGN (NOT SIGN)
      REVERSED PRIME (PRIME)
      REVERSED TILDE (TILDE)
      REVERSED TRIPLE PRIME (TRIPLE PRIME)

   Some arrows can be derived from others by reversal:

      NORTHWEST ARROW = NORTHEAST ARROW + PRESENTATION SUGGESTION
            REVERSED
      LEFTWARDS ARROW WITH HOOK = RIGHTWARDS ARROW WITH HOOK +
            PRESENTATION SUGGESTION REVERSED

In each case, the one pointing to the right (positive) is the one we
define as "forwards", and its image is the "reversed" glyph.

   Combining characters cannot take advantage of the presentation
suggestions, as discussed elsewhere, so there is no way to decompose a
character like COMBINING REVERSED COMMA ABOVE by means of PRESENTATION
SUGGESTION REVERSED. (... Unless you wish to reverse the character, put
an ordinary comma above on it, and turn it back: COMBINING REVERSED
COMMA ABOVE = PRESENTATION SUGGESTION REVERSED + COMBINING COMMA ABOVE +
PRESENTATION SUGGESTION REVERSED. What sick mind would try to do such a
thing??)

   It seems possible that there should be some relationship between
PRESENTATION SUGGESTION REVERSED and the "symmetric swappping" that
happens when text is flowing from right to left. It may be appropriate
to decompose all "right" characters as reversed "left" characters, to
make this explicit:

      RIGHT PARENTHESIS = LEFT PARENTHESIS + PRESENTATION SUGGESTION
            REVERSED
      RIGHT SQUARE BRACKET = LEFT SQUARE BRACKET + PRESENTATION
            SUGGESTION REVERSED

etc. I'm not convinced this is so desirable though.

   Reversing is very easy to do in software, and the consequences of
ignoring it are likely to be severe if arrows are important. (This only
affects people who try to make up new symbols, as existing characters
are already encoded and should be well understood.)

PRESENTATION SUGGESTION ROTATED
------------ ---------- -------

   This is a rotation of a quarter-turn anticlockwise (+90 degrees),
staying in the plane of the paper. Typographically, it is unusual to use
rotated characters, because traditional type is designed to fit in a
constant height, but with varying widths. A rotated character would fall
out too easily. There are only 2 characters in the U C S that are
rotated:

      INVERTED LAZY S (LATIN SMALL LETTER S)
      ROTATED HEAVY BLACK HEART BULLET (FLORAL HEART)

   Lots of arrows could be described as rotated versions of other
arrows:

      UPWARDS ARROW = RIGHTWARDS ARROW + PRESENTATION SUGGESTION
            ROTATED
      NORTHWEST ARROW = NORTHEAST ARROW + PRESENTATION SUGGESTION
            ROTATED
      UP DOWN ARROW = LEFT RIGHT ARROW + PRESENTATION SUGGESTION
            ROTATED

(This list is not exhaustive.)

   It can be done algorithmically, but is harder than INVERTED, REVERSED
or TURNED because the resulting character has a different bounding box.
This means it is not just a question of moving the ink around, but has
wider implications for line-length etc. (This is related to the
typographic point.)

   If widely deployed, could be a very useful source of new symbols in
many different disciplines.

   The decompositions involving the <vertical> tag (21 of them) should
presumably be replaced by ones involving PRESENTATION SUGGESTION
ROTATED, but I don't know enough about eastern writing systems to say.

   A clockwise rotation could be represented by 3 copies of this one, or
by rotating a turned character, or, if both rotations command equal
importance, by a character PRESENTATION SUGGESTION ROTATED CLOCKWISE.

PRESENTATION SUGGESTION SANS SERIF
------------ ---------- ---- -----

   This requests a sans-serif font to be used. Since there is no
requirement in Unicode for a font to have serifs in the first place,
this could easily be a null operation. However, the concept is in
the U C S in the names of the dingbats DINGBAT CIRCLED SANS-SERIF DIGIT
ONE--NUMBER TEN and DINGBAT NEGATIVE CIRCLED SANS-SERIF DIGIT ONE--
NUMBER TEN. Even with that, it seems unlikely that anyone would support
it as a character. If Dingbats are to be allowed decompositions (and
there are good reasons to do so), maybe the sans serif numbers could be
decomposed using PRESENTATION SUGGESTION VARIANT, together with
PRESENTATION SUGGESTION WHITE and COMBINING ENCLOSING CIRCLE.

PRESENTATION SUGGESTION SCRIPT
------------ ---------- ------

   Requests a script font be used. Many script characters are already
present:

      LATIN CAPITAL LETTER V WITH HOOK (LATIN CAPITAL LETTER V)
      LATIN SMALL LETTER ALPHA (LATIN SMALL LETTER A)
      LATIN SMALL LETTER SCRIPT G (LATIN SMALL LETTER G)
      LATIN SMALL LETTER V WITH HOOK (LATIN SMALL LETTER V)
      SCRIPT CAPITAL B (LATIN CAPITAL LETTER B)
      SCRIPT CAPITAL E (LATIN CAPITAL LETTER E)
      SCRIPT CAPITAL F (LATIN CAPITAL LETTER F)
      SCRIPT CAPITAL H (LATIN CAPITAL LETTER H)
      SCRIPT CAPITAL I (LATIN CAPITAL LETTER I)
      SCRIPT CAPITAL L (LATIN CAPITAL LETTER L)
      SCRIPT CAPITAL M (LATIN CAPITAL LETTER M)
      SCRIPT CAPITAL P (LATIN CAPITAL LETTER P)
      SCRIPT CAPITAL R (LATIN CAPITAL LETTER R)
      SCRIPT SMALL E (LATIN SMALL LETTER E)
      SCRIPT SMALL G (LATIN SMALL LETTER G)
      SCRIPT SMALL L (LATIN SMALL LETTER L)
      SCRIPT SMALL O (LATIN SMALL LETTER O)

   (The v's "with hook" are really script letters. There's more on hooks
later.)

   Here seems a good time to mention decompositions of currency symbols.
Since we know that the currency symbols were invented as typographic
variants of existing characters, it seems a good idea to encode this.
Then (a) software with no glyph can generate an acceptable alternative
and (b) when a new currency is invented, a symbol can be given to it
without needing to go through a standardisation process. I suggest that

      POUND SIGN = LATIN CAPITAL LETTER L + PRESENTATION SUGGESTION
            SCRIPT + COMBINING SHORT STROKE OVERLAY

is historically right, right by current usage, and gives a result that
will be understandable to an English national if there is no better
glyph available ('L'). Other currency symbols should be treated the same
way, including DOLLAR SIGN (= LATIN CAPITAL LETTER S + COMBINING LONG
VERTICAL LINE OVERLAY) and CENT SIGN (= LATIN SMALL LETTER C + COMBINING
LONG VERTICAL LINE OVERLAY).

   It seems futile to deny history, and claim that these are fully-
formed characters in their own right. This doesn't deny anyone the right
to design specialised glyphs, if they wish. There is no reason to change
current practice, just to systematise it.

   (The symbol for Euro looks to me like LATIN SMALL LETTER C +
COMBINING SHORT SOLIDUS OVERLAY. No doubt, any such suggestion would
give the European Commission a collective fit ...)

   If the characters LATIN SMALL LETTER SCRIPT G and SCRIPT SMALL G are
supposed to be different from each other, then there is a mistake here,
and one of them should be given a different decomposition (maybe "with
curl").

PRESENTATION SUGGESTION SHADOWED
------------ ---------- --------

   Requests that a shadow be drawn behind the glyph. Conventionally, the
light source is behind the left shoulder of the observer, as if the
observer was right handed and working at a desk. (This can be changed by
using TURNED, REVERSED or INVERTED.) The shadow is cast on a flat
surface behind the glyph (a "drop-shadow").

   Could be used for

      LOWER RIGHT DROP-SHADOWED WHITE SQUARE = BLACK SQUARE +
            PRESENTATION SUGGESTION WHITE + PRESENTATION SUGGESTION
            SHADOWED
      UPPER RIGHT DROP-SHADOWED WHITE SQUARE = BLACK SQUARE +
            PRESENTATION SUGGESTION WHITE + PRESENTATION SUGGESTION
            SHADOWED + PRESENTATION SUGGESTION INVERTED

   Not much gain there, and unlikely to be useful for anything very
much.

PRESENTATION SUGGESTION SMALL
------------ ---------- -----

   This is not the <small> marked in decompositions---we use PRESEN-
TATION SUGGESTION HALFWIDTH for that, since it means the same thing for
western characters that halfwidth does for ideographic ones. It is also
not the SMALL in LATIN SMALL LETTER A, which means lower-case. This
SMALL just asks for a smaller version of the same character.

   It is used in

      BLACK DOWN-POINTING SMALL TRIANGLE (BLACK DOWN-POINTING TRIANGLE)
      BLACK LEFT-POINTING SMALL TRIANGLE (BLACK LEFT-POINTING TRIANGLE)
      BLACK RIGHT-POINTING SMALL TRIANGLE (BLACK RIGHT-POINTING
           TRIANGLE)
      BLACK SMALL SQUARE (BLACK SQUARE)
      BLACK UP-POINTING SMALL TRIANGLE (BLACK UP-POINTING TRIANGLE)
      LATIN LETTER SMALL CAPITAL B (LATIN LETTER CAPITAL B)
      LATIN LETTER SMALL CAPITAL G (LATIN LETTER CAPITAL G)
      LATIN LETTER SMALL CAPITAL G WITH HOOK (LATIN LETTER CAPITAL G
           WITH HOOK)
      LATIN LETTER SMALL CAPITAL H (LATIN LETTER CAPITAL H)
      LATIN LETTER SMALL CAPITAL I (LATIN LETTER CAPITAL I)
      LATIN LETTER SMALL CAPITAL INVERTED R (LATIN LETTER CAPITAL
           INVERTED R)
      LATIN LETTER SMALL CAPITAL L (LATIN LETTER CAPITAL L)
      LATIN LETTER SMALL CAPITAL N (LATIN LETTER CAPITAL N)
      LATIN LETTER SMALL CAPITAL OE (LATIN LETTER CAPITAL OE)
      LATIN LETTER SMALL CAPITAL R (LATIN LETTER CAPITAL R)
      LATIN LETTER SMALL CAPITAL Y (LATIN LETTER CAPITAL Y)
      WHITE DOWN-POINTING SMALL TRIANGLE (WHITE DOWN-POINTING TRIANGLE)
      WHITE LEFT-POINTING SMALL TRIANGLE (WHITE LEFT-POINTING TRIANGLE)
      WHITE RIGHT-POINTING SMALL TRIANGLE (WHITE RIGHT-POINTING
           TRIANGLE)
      WHITE SMALL SQUARE (WHITE SQUARE)
      WHITE UP-POINTING SMALL TRIANGLE (WHITE UP-POINTING TRIANGLE)

   SMALL ELEMENT OF and SMALL CONTAINS AS MEMBER are really presentation
variants of GREEK SMALL LETTER EPSILON, treated above.

PRESENTATION SUGGESTION SMALL LETTER TONE
------------ ---------- ----- ------ ----

   This suggests, of a digit, that a variant glyph be used of a style
suitable for marking Zhuang tone. It is for the following:

      LATIN SMALL LETTER TONE TWO = LATIN DIGIT TWO + PRESENTATION
            SUGGESTION SMALL LETTER TONE
      LATIN SMALL LETTER TONE FIVE = LATIN DIGIT FIVE + PRESENTATION
            SUGGESTION SMALL LETTER TONE
      LATIN SMALL LETTER TONE SIX = LATIN DIGIT SIX + PRESENTATION
            SUGGESTION SMALL LETTER TONE

   In addition, compatibility decompositions should be given for

      CYRILLIC SMALL LETTER ZE = <compat> + LATIN DIGIT THREE +
            PRESENTATION SUGGESTION SMALL LETTER TONE
      CYRILLIC SMALL LETTER CHE = <compat> + LATIN DIGIT FOUR +
            PRESENTATION SUGGESTION SMALL LETTER TONE

   By encoding this character, it becomes possible for sophisticated
software to render suitable glyphs for all the tone letters, without
needing separate encodings for latin small letter tones 3, 4.

PRESENTATION SUGGESTION STACK UP, PRESENTATION SUGGESTION STACK DOWN
------------ ---------- ----- --- ------------ ---------- ----- ----

   Requests that characters be stacked vertically up or down the page.
The first character in the stack is placed at its normal position. The
second is moved up or down to appear above or below the first.

   This is another idea, like COMPOSE, that could cause a lot of
problems, as it is not obvious where a sensible place to stop might be.

   Is a LESS-THAN OR EQUAL TO sign a stack of a LESS-THAN SIGN and a
MINUS SIGN? Although it looks like it in many fonts, we would really
prefer it to be a stack of LESS-THAN SIGN and EQUALS SIGN, because that
will look better if the renderer can't "do" stacks. (You'd see '<=',
which would probably be very helpful.) And it's very often given its own
glyph, with the underline parallel to the bottom part of the LESS-THAN
SIGN.

   Is an underlined character a character formed from COMBINING LOW
LINE, or is it a down-stack with MINUS SIGN?

   Can you make accented characters by stacking spacing accents above
letters?

   Despite these problems, the idea of a stack seems necessary. Consider
the character EQUAL TO BY DEFINITION. This character is an equals sign
with the small word 'def' on top of it. It seems ridiculous that this
should be an atomic character, when the reason for its existence is the
fact that d, e, f are the first 3 letters of the English word
'definition'. Whichever mathematician invented that symbol was clearly
"sticking things together", and not just coming up with an arbitrary
symbol from nowhere. Another mathematician will do a similar thing very
soon, and it seems wrong that (in a perfectly logical world) it would
have to get "approval" from the Unicode Consortium (in the form of a
character registration) before it could publish its book.

   The "stack" concept has an antecedent in TeX, where it is called
\makerel.

   We need 2 different STACK characters to ensure visual harmony between
the different presentation forms that can be generated. Both are binary.

   The following are clearly compositions using PRESENTATION SUGGESTION
STACK UP. In some cases (e g, MEASURED BY), there would be a
PRESENTATION SUGGESTION SMALL for the second character.

      MINUS-OR-PLUS SIGN (PLUS SIGN, MINUS SIGN)
      APPROACHES THE LIMIT (EQUALS SIGN, DOT OPERATOR)
      RING EQUAL TO (EQUALS SIGN, RING OPERATOR)
      CORRESPONDS TO (EQUALS SIGN, FROWN)
      ESTIMATES (EQUALS SIGN, WEDGE)
      EQUIANGULAR TO (EQUALS SIGN, VEE)
      STAR EQUALS (EQUALS SIGN, STAR OPERATOR)
      DELTA EQUAL TO (EQUALS SIGN, INCREMENT)
      MEASURED BY (EQUALS SIGN, LATIN SMALL LETTER M)
      QUESTIONED EQUAL TO (EQUALS SIGN, QUESTION MARK)
 
The following use PRESENTATION SUGGESTION STACK DOWN:

      PLUS-MINUS SIGN (PLUS SIGN, MINUS SIGN)
      LESS-THAN OVER EQUAL TO (LESS-THAN SIGN, EQUALS SIGN)
      GREATER-THAN OVER EQUAL TO (GREATER-THAN SIGN, EQUALS SIGN)
      LESS-THAN BUT NOT EQUAL TO (LESS-THAN SIGN, NOT EQUAL TO)
      GREATER-THAN BUT NOT EQUAL TO (GREATER-THAN SIGN, NOT EQUAL TO)
      LESS-THAN OR EQUIVALENT TO (LESS-THAN SIGN, TILDE OPERATOR)
      GREATER-THAN OR EQUIVALENT TO (GREATER-THAN SIGN, TILDE OPERATOR)
      LESS-THAN OR GREATER-THAN (LESS-THAN SIGN, GREATER-THAN SIGN)
      GREATER-THAN OR LESS-THAN (GREATER-THAN SIGN, LESS-THAN SIGN)
      PRECEDES OR EQUIVALENT TO (PRECEDES, TILDE OPERATOR)
      SUCCEEDS OR EQUIVALENT TO (PRECEDES, TILDE OPERATOR)

   One character needs both:

      GEOMETRICALLY EQUAL TO = EQUALS SIGN + PRESENTATION SUGGESTION
            STACK UP + DOT OPERATOR + PRESENTATION SUGGESTION STACK
            DOWN + DOT OPERATOR.

   We also have

      NEITHER LESS-THAN NOR EQUIVALENT TO = START GROUP + LESS-THAN
            SIGN + PRESENTATION SUGGESTION STACK DOWN + TILDE OPERATOR
            + POP DIRECTIONAL FORMATTING + COMBINING LONG SOLIDUS
            OVERLAY,
      NEITHER GREATER-THAN NOR EQUIVALENT TO = START GROUP + GREATER
            THAN SIGN + PRESENTATION SUGGESTION STACK DOWN + TILDE
            OPERATOR + POP DIRECTIONAL FORMATTING + COMBINING LONG
            SOLIDUS OVERLAY

and of course the character that caused all these problems:

      EQUAL TO BY DEFINITION = EQUALS SIGN + PRESENTATION SUGGESTION
            STACK UP + START GROUP + LATIN SMALL LETTER D + LATIN SMALL
            LETTER E + LATIN SMALL LETTER F + END GROUP + PRESENTATION
            SUGGESTION SMALL

(it seems horrible, but I see no real alternative). An unsophisticated
rendering engine will be able to make a shot at this as '=def', which
seems like as good a result as one might hope for.

   There are a few more similar compositions that could also be
entertained. (Somehow 2 straightforward ones in this area have slipped
through the net:

      COLON EQUALS = COLON + EQUALS SIGN,
      EQUALS COLON = EQUALS SIGN + COLON.)

   If widely deployed, the STACK operations could cause no end of havoc
by encouraging the creation of new "symbols" in a very uncontrolled way.
(On the other hand, maybe that's a good thing.)

PRESENTATION SUGGESTION SUBSCRIPT
------------ ---------- ---------

   Requests that a character be rendered at a smaller size, and with a
lower baseline.

   Would be used in all characters whose decomposition includes <sub>
(there are 15 of these).

   Easy to do algorithmically.

   If used but not recognised, the resulting test will be wrong, but
still better than if a substitute character was used.

PRESENTATION SUGGESTION SUPERSCRIPT
------------ ---------- -----------

   Requests that a character be rendered at a smaller size, and above
the baseline.

   Would be used in all characters whose decomposition includes <super>
(there are about 50 of these), as well as:

      ASTERISK = ASTERISK OPERATOR + PRESENTATION SUGGESTION
            SUPERSCRIPT
      DEGREE SIGN = RING OPERATOR + PRESENTATION SUGGESTION
            SUPERSCRIPT

   Easy to do algorithmically.

   If used but not recognised, the resulting test will be wrong, but
still better than if a substitute character was used.

PRESENTATION SUGGESTION TRIPLE
------------ ---------- ------

   Simply a twice-repeated glyph, though usually kerned closer together
than an actual repetition.

   Could be used in

      HORIZONTAL ELLIPSIS (FULL STOP)
      MODIFIER LETTER TRIPLE PRIME (MODIFIER LETTER PRIME)
      REVERSED TRIPLE PRIME (REVERSED PRIME)
      TRIPLE INTEGRAL (INTEGRAL)
      TRIPLE PRIME (PRIME)
      VERY MUCH GREATER-THAN (GREATER-THAN SIGN)
      VERY MUCH LESS-THAN (LESS-THAN SIGN)
      VOLUME INTEGRAL (CONTOUR INTEGRAL)

but not COMBINING THREE DOTS ABOVE or other tripled accent characters,
because to avoid ambiguity the suggestion must apply to the base form.

   Easy to do algorithmically.

   Unlikely to be very productive, as it's easier to just write a
character three times.

   If used but not recognised, quite likely to cause the resulting text
to be misinterpreted.

PRESENTATION SUGGESTION TURNED
------------ ---------- ------

   Rotates the character through half a turn in its own plane. Equi-
valent to REVERSED followed by INVERTED, or to ROTATED twice.

   Could be used in:

      BECAUSE (THEREFORE)
      CONTAINS AS MEMBER (ELEMENT OF)
      FOR ALL (LATIN CAPITAL LETTER A)
      FROWN (SMILE)
      INVERTED EXCLAMATION MARK (EXCLAMATION MARK)
      INVERTED OHM SIGN (OHM SIGN)
      INVERTED QUESTION MARK (QUESTION MARK)
      LATIN CAPITAL LETTER OPEN O (LATIN CAPITAL LETTER C)
      LATIN CAPITAL LETTER REVERSED E (LATIN CAPITAL LETTER E)
      LATIN LETTER INVERTED GLOTTAL STOP (LATIN LETTER GLOTTAL STOP)
      LATIN LETTER INVERTED GLOTTAL STOP WITH STROKE (LATIN LETTER
            GLOTTAL STOP WITH STROKE)
      LATIN LETTER SMALL CAPITAL INVERTED R (LATIN LETTER SMALL CAPITAL
            R)
      LATIN SMALL LETTER DOTLESS J WITH STROKE (LATIN SMALL LETTER F)
      LATIN SMALL LETTER DOTLESS J WITH STROKE AND HOOK (LATIN SMALL
            LETTER F WITH HOOK)
      LATIN SMALL LETTER OPEN O (LATIN SMALL LETTER C)
      LATIN SMALL LETTER SCHWA (LATIN SMALL LETTER E)
      LATIN SMALL LETTER TURNED A (LATIN SMALL LETTER A)
      LATIN SMALL LETTER TURNED ALPHA (LATIN SMALL LETTER ALPHA)
      LATIN SMALL LETTER TURNED DELTA (GREEK SMALL LETTER DELTA)
      LATIN SMALL LETTER TURNED E (LATIN SMALL LETTER E)
      LATIN SMALL LETTER TURNED H (LATIN SMALL LETTER H)
      LATIN SMALL LETTER TURNED K (LATIN SMALL LETTER K)
      LATIN SMALL LETTER TURNED M (LATIN SMALL LETTER M)
      LATIN SMALL LETTER TURNED M WITH LONG LEG (LATIN SMALL LETTER M
            WITH LONG LEG)
      LATIN SMALL LETTER TURNED R (LATIN SMALL LETTER R)
      LATIN SMALL LETTER TURNED R WITH HOOK (LATIN SMALL LETTER R WITH
            HOOK)
      LATIN SMALL LETTER TURNED R WITH LONG LEG (LATIN SMALL LETTER R
            WITH LONG LEG)
      LATIN SMALL LETTER TURNED T (LATIN SMALL LETTER T)
      LATIN SMALL LETTER TURNED V (LATIN SMALL LETTER V)
      LATIN SMALL LETTER TURNED W (LATIN SMALL LETTER W)
      LATIN SMALL LETTER TURNED Y (LATIN SMALL LETTER Y)
      MINUS-OR-PLUS SIGN (PLUS-MINUS SIGN)
      N-ARY COPRODUCT (N-ARY PRODUCT)
      NABLA (INCREMENT)
      OCR INVERTED FORK (OCR FORK)
      THERE EXISTS (LATIN CAPITAL LETTER E)
      TURNED CAPITAL F (LATIN LETTER CAPITAL F)
      TURNED GREEK SMALL LETTER IOTA (GREEK SMALL LETTER IOTA)
      TURNED NOT SIGN (NOT SIGN)
      WEDGE (VEE)

   There are 2 characters with TURNED in the name that would not be
coded directly with PRESENTATION SUGGESTION TURNED: LATIN CAPITAL LETTER
TURNED M and LATIN CAPITAL LETTER SCHWA. These are large turned versions
of a lower-case character. Maybe they should be decomposed as

      LATIN CAPITAL LETTER SCHWA = LATIN LETTER SMALL E +
            PRESENTATION SUGGESTION LARGE + PRESENTATION SUGGESTION
            TURNED
      LATIN CAPITAL LETTER TURNED M = LATIN LETTER SMALL M +
            PRESENTATION SUGGESTION LARGE + PRESENTATION SUGGESTION
            TURNED

---or maybe these characters have a completely different origin?

   PRESENTATION SUGGESTION TURNED also allows lots of arrows to be
decomposed, e g

      RIGHTWARDS ARROW = LEFTWARDS ARROW + PRESENTATION SUGGESTION
            TURNED
      DOWNWARDS ARROW = UPWARDS ARROW + PRESENTATION SUGGESTION
            TURNED

and the same for many others.

   The 2 characters besed on "dotless j with stroke" are really turned
f's; and the capital and small "open o" characters are really turned
c's.

PRESENTATION SUGGESTION VARIANT
------------ ---------- -------

   This is glyph modification. It's a bit of a miscellany, but it has
sound antecedents (e g, in TeX), and provided it is not overused, seems
to be useful. 24 characters can be understood as variant presentation
forms of other characters. They are listed here, followed by the base
character varied.

      COMPLEMENT (LATIN CAPITAL LETTER C)
      CURVED STEM PARAGRAPH SIGN ORNAMENT (PILCROW SIGN)
      CYRILLIC CAPITAL LETTER GHE WITH UPTURN (CYRILLIC CAPITAL LETTER
            GHE)
      CYRILLIC CAPITAL LETTER STRAIGHT U (CYRILLIC CAPITAL LETTER U)
      CYRILLIC SMALL LETTER GHE WITH UPTURN (CYRILLIC SMALL LETTER GHE)
      CYRILLIC SMALL LETTER STRAIGHT U (CYRILLIC SMALL LETTER U)
      EULER CONSTANT (LATIN CAPITAL LETTER E)
      GREEK BETA SYMBOL (GREEK SMALL LETTER BETA)
      GREEK KAPPA SYMBOL (GREEK SMALL LETTER KAPPA)
      GREEK LUNATE SIGMA SYMBOL (GREEK SMALL LETTER SIGMA)
      GREEK PHI SYMBOL (GREEK SMALL LETTER PHI)
      GREEK PI SYMBOL (GREEK SMALL LETTER PI)
      GREEK RHO SYMBOL (GREEK SMALL LETTER RHO)
      GREEK THETA SYMBOL (GREEK SMALL LETTER THETA)
      GREEK UPSILON WITH HOOK SYMBOL (GREEK CAPITAL LETTER UPSILON)
      LATIN CAPITAL LETTER OPEN E (LATIN CAPITAL LETTER E)
      LATIN LETTER STRETCHED C (LATIN CAPITAL LETTER C)
      LATIN SMALL LETTER LONG S (LATIN SMALL LETTER S)
      LATIN SMALL LETTER OPEN E (LATIN SMALL LETTER E)
      LATIN SMALL LETTER R WITH FISHHOOK (LATIN SMALL LETTER R)
      LATIN SMALL LETTER SQUAT REVERSED ESH (LATIN SMALL LETTER ESH,
            reversed)
      PARTIAL DIFFERENTIAL (LATIN SMALL LETTER D)
      SMALL CONTAINS AS MEMBER (GREEK SMALL LETTER EPSILON, turned)
      SMALL ELEMENT OF (GREEK SMALL LETTER EPSILON)

   Cannot be done algorithmically: either you have a variant glyph, or
you don't.

   Falling back to the base form is likely to give good results, except
in specialised fields, so this is a desirable decomposition to encode.

PRESENTATION SUGGESTION WHITE
------------ ---------- -----

   We assume that the ordinary state for a character is to be "black",
as this is the colour of ink. Some characters---normally those with
large solid regions---also exist in "white" variants. This is a request
for those characters to be used. Many characters have the word "black"
in their name. We just ignore this, claiming that it carries no semantic
value apart from emphasis.

   The following characters are white variants of others:

      BLACK CENTRE WHITE STAR (OPEN CENTRE BLACK STAR)
      CIRCLED HEAVY WHITE RIGHTWARDS ARROW (RIGHTWARDS ARROW)
      CIRCLED WHITE STAR (BLACK STAR)
      DOWNWARDS WHITE ARROW (DOWNWARDS ARROW)
      HEAVY LOWER RIGHT-SHADOWED WHITE RIGHTWARDS ARROW (RIGHTWARDS
            ARROW)
      HEAVY UPPER RIGHT-SHADOWED WHITE RIGHTWARDS ARROW (RIGHTWARDS
            ARROW)
      LEFT-SHADED WHITE RIGHTWARDS ARROW (RIGHTWARDS ARROW)
      LEFTWARDS WHITE ARROW (LEFTWARDS ARROW)
      LOWER RIGHT DROP-SHADOWED WHITE SQUARE (BLACK SQUARE)
      LOWER RIGHT SHADOWED WHITE SQUARE (BLACK SQUARE)
      RIGHT-SHADED WHITE RIGHTWARDS ARROW
      RIGHTWARDS WHITE ARROW (RIGHTWARDS ARROW)
      SHADOWED WHITE CIRCLE (BLACK CIRCLE)
      SHADOWED WHITE STAR (BLACK STAR)
      STRESS OUTLINED WHITE STAR (BLACK STAR)
      UPPER RIGHT DROP-SHADOWED WHITE SQUARE (BLACK SQUARE)
      UPPER RIGHT SHADOWED WHITE SQUARE (BLACK SQUARE)
      UPWARDS WHITE ARROW FROM BAR (UPWARDS ARROW FROM BAR)
      UPWARDS WHITE ARROW (UPWARDS ARROW)
      WHITE BULLET (BULLET)
      WHITE CHESS BISHOP (BLACK CHESS BISHOP)
      WHITE CHESS KING (BLACK CHESS KING)
      WHITE CHESS KNIGHT (BLACK CHESS KNIGHT)
      WHITE CHESS PAWN (BLACK CHESS PAWN)
      WHITE CHESS QUEEN (BLACK CHESS QUEEN)
      WHITE CHESS ROOK (BLACK CHESS ROOK)
      WHITE CIRCLE (BLACK CIRCLE)
      WHITE CLUB SUIT (BLACK CLUB SUIT)
      WHITE DIAMOND SUIT (BLACK DIAMOND SUIT)
      WHITE DIAMOND (BLACK DIAMOND)
      WHITE DOWN-POINTING SMALL TRIANGLE (BLACK DOWN-POINTING SMALL
            TRIANGLE)
      WHITE DOWN-POINTING TRIANGLE (BLACK DOWN-POINTING TRIANGLE)
      WHITE FLORETTE (BLACK FLORETTE)
      WHITE FOUR POINTED STAR (BLACK FOUR POINTED STAR)
      WHITE HEART SUIT (BLACK HEART SUIT)
      WHITE LEFT POINTING INDEX (BLACK LEFT POINTING INDEX)
      WHITE LEFT-POINTING POINTER (BLACK LEFT-POINTING POINTER)
      WHITE LEFT-POINTING SMALL TRIANGLE (BLACK LEFT-POINTING SMALL
            TRIANGLE)
      WHITE LEFT-POINTING TRIANGLE (BLACK LEFT-POINTING TRIANGLE)
      WHITE NIB (BLACK NIB)
      WHITE PARALLELOGRAM (BLACK PARALLELOGRAM)
      WHITE RECTANGLE (BLACK RECTANGLE,)
      WHITE RIGHT POINTING INDEX (BLACK RIGHT POINTING INDEX)
      WHITE RIGHT-POINTING POINTER (BLACK RIGHT-POINTING POINTER)
      WHITE RIGHT-POINTING SMALL TRIANGLE (BLACK RIGHT-POINTING SMALL
            TRIANGLE)
      WHITE RIGHT-POINTING TRIANGLE (BLACK RIGHT-POINTING TRIANGLE)
      WHITE SCISSORS (BLACK SCISSORS)
      WHITE SMALL SQUARE (BLACK SMALL SQUARE)
      WHITE SMILING FACE (BLACK SMILING FACE)
      WHITE SPADE SUIT (BLACK SPADE SUIT)
      WHITE SQUARE (BLACK SQUARE)
      WHITE STAR (BLACK STAR)
      WHITE SUN WITH RAYS (BLACK SUN WITH RAYS)
      WHITE TELEPHONE (BLACK TELEPHONE)
      WHITE UP-POINTING SMALL TRIANGLE (BLACK UP-POINTING SMALL
            TRIANGLE)
      WHITE UP-POINTING TRIANGLE (BLACK UP-POINTING TRIANGLE)
      WHITE VERTICAL RECTANGLE (BLACK VERTICAL RECTANGLE)

but not BACK-TILTED SHADOWED WHITE RIGHTWARDS ARROW, FRONT-TILTED SHA-
DOWED WHITE RIGHTWARDS ARROW, NOTCHED LOWER RIGHT-SHADOWED WHITE RIGHT-
WARDS ARROW, NOTCHED UPPER RIGHT-SHADOWED WHITE RIGHTWARDS ARROW WHITE
DOWN POINTING INDEX, WHITE UP POINTING INDEX, because there are no black
forms, and there is no BLACK FROWNING FACE either.

   This could be done algorithmically, but it requires clever image
processing capability: software could do something like surrounding the
character with a thin black line, and then invert the interior of the
region so delineated. It might be sufficient to convey the concept just
to exchange black and white in a character cell, though this wouldn't
work if an attempt was made to use extended runs of white text.

   It could be argued that there is no need for a "double-struck"
presentation suggestion, because the double-struck characters are just
white versions of heavy ones: "PRESENTATION SUGGESTION DOUBLE-STRUCK =
PRESENTATION SUGGESTION HEAVY + PRESENTATION SUGGESTION WHITE"? I am not
going to argue that here though.

   If used but not interpreted, unlikely to result in misinterpretation:
a black symbol is likely to be a good stand-in for a white one.

New uses for existing characters
=== ==== === ======== ==========

COMBINING ENCLOSING CIRCLE
--------- --------- ------

   Everything marked as <circle> (there are 197 of these!) should be
modified to a canonical decomposition involving COMBINING ENCLOSING
CIRCLE. Also, we have

      CIRCLED ASTERISK OPERATOR (ASTERISK OPERATOR)
      CIRCLED DASH (EN DASH)
      CIRCLED DIVISION SLASH (DIVISION SLASH)
      CIRCLED DOT OPERATOR (DOT OPERATOR)
      CIRCLED EQUALS (EQUALS SIGN)
      CIRCLED MINUS (MINUS SIGN)
      CIRCLED PLUS (PLUS SIGN)
      CIRCLED RING OPERATOR (RING OPERATOR)
      CIRCLED TIMES (MULTIPLICATION SIGN)
      COPYRIGHT SIGN (LATIN CAPITAL LETTER C)
      REGISTERED SIGN (LATIN CAPITAL LETTER R)
      SOUND RECORDING COPYRIGHT (LATIN CAPITAL LETTER P)

COMBINING ENCLOSING SQUARE
--------- --------- ------

   Just used for 4 characters, unless also used as the A P L quad
character.

      SQUARED DOT OPERATOR (DOT OPERATOR)
      SQUARED MINUS (MINUS SIGN)
      SQUARED PLUS (PLUS SIGN)
      SQUARED TIMES (MULTIPLICATION SIGN)

   Characters marked as <square> are not enclosed in a square, they are
just rendered as is. I suppose that if <square> was replaced by <compat>
throughout, or just deleted (thereby making the composition canonical),
no-one would notice. This would add 194 canonical decompositions.

COMBINING PALATALIZED HOOK BELOW
--------- ----------- ---- -----

   The following decomposition is missing. I imagine this is an error.

      LATIN SMALL LETTER T WITH PALATAL HOOK = LATIN SMALL LETTER T +
            COMBINING PALATALIZED HOOK BELOW

The absence of the following may also be an error---I don't know enough
to be sure.

      LATIN CAPITAL LETTER N WITH LEFT HOOK = LATIN CAPITAL LETTER N +
            COMBINING PALATALIZED HOOK BELOW
      LATIN SMALL LETTER N WITH LEFT HOOK = LATIN SMALL LETTER N +
            COMBINING PALATALIZED HOOK BELOW

COMBINING RETROFLEX HOOK
--------- --------- ----

   Some decompositions involving this character are also missing:

      LATIN CAPITAL LETTER T WITH RETROFLEX HOOK (LATIN CAPITAL LETTER
            T)
      LATIN SMALL LETTER D WITH TAIL (LATIN SMALL LETTER D)
      LATIN SMALL LETTER EZH WITH TAIL (LATIN SMALL LETTER EZH)
      LATIN SMALL LETTER L WITH RETROFLEX HOOK (LATIN SMALL LETTER L)
      LATIN SMALL LETTER N WITH RETROFLEX HOOK (LATIN SMALL LETTER N)
      LATIN SMALL LETTER R WITH TAIL (LATIN SMALL LETTER R)
      LATIN SMALL LETTER T WITH RETROFLEX HOOK (LATIN SMALL LETTER T)
      LATIN SMALL LETTER Z WITH RETROFLEX HOOK (LATIN SMALL LETTER Z)

   (The forms "with tail" are speculation on my part, but the visual
appearances match. This seems to be enough for combining marks, as in
the case of umlaut vs diaeresis.)

COMBINING RING OVERLAY
--------- ---- -------

   This should be used to decompose

      CONTOUR INTEGRAL (INTEGRAL)
      SURFACE INTEGRAL (DOUBLE INTEGRAL)
      VOLUME INTEGRAL (TRIPLE INTEGRAL)

Also related are

      ANTICLOCKWISE CONTOUR INTEGRAL = INTEGRAL + COMBINING
            ANTICLOCKWISE RING OVERLAY
      CLOCKWISE CONTOUR INTEGRAL = INTEGRAL + COMBINING CLOCKWISE RING
            OVERLAY
      CLOCKWISE INTEGRAL = INTEGRAL + COMBINING CLOCKWISE ARROW ABOVE

COMBINING SHORT SOLIDUS OVERLAY
--------- ----- ------- -------

   Many of the characters described as "with stroke" could be provided
with decompositions using this character. The list is

      LATIN CAPITAL LETTER L WITH STROKE (LATIN CAPITAL LETTER L)
      LATIN CAPITAL LETTER LAMBDA WITH STROKE (LATIN CAPITAL LETTER
            LAMBDA)
      LATIN CAPITAL LETTER O WITH STROKE (LATIN CAPITAL LETTER O)
      LATIN CAPITAL LETTER O WITH STROKE AND ACUTE (LATIN CAPITAL
            LETTER O WITH ACUTE)
      LATIN SMALL LETTER L WITH STROKE (LATIN SMALL LETTER L)
      LATIN SMALL LETTER LAMBDA WITH STROKE (LATIN SMALL LETTER LAMBDA)
      LATIN SMALL LETTER O WITH STROKE (LATIN SMALL LETTER O)
      LATIN SMALL LETTER O WITH STROKE AND ACUTE (LATIN SMALL LETTER O
            WITH ACUTE)
      PLANCK CONSTANT OVER TWO PI (PLANCK CONSTANT)

   I suppose making this suggestion would result in howls of outrage
from people whose alphabets contain these characters, as, e g, "O WITH
STROKE" is a letter in its own right, not a composed character, in these
alphabets. There are 3 points in favour of making it a composite
character though

         (a) historically, this is how it came about;

         (b) the situation is no worse than for alphabets that contain
      other similar characters, e g, LATIN SMALL A WITH RING, that are
      already decomposed;

         (c) having the decomposition may help legibility, as it would
      better to see an 'o' than a '?' if the glyph is not available.

   The few mathematical characters which can be composed from this
character should be added: examples include

      LESS-THAN BUT NOT EQUAL TO = LESS-THAN SIGN + PRESENTATION
            SUGGESTION STACK DOWN + EQUALS SIGN + COMBINING LONG
            SOLIDUS OVERLAY

   Most other characters "with stroke" are encoded with COMBINING SHORT
STROKE OVERLAY.

COMBINING SHORT STROKE OVERLAY
--------- ----- ------ -------

   The situation here is similar to the one for COMBINING SHORT SOLIDUS
OVERLAY: a lot of characters described as "with stroke", "with bar",
"barred", "bar" or "with middle tilde" could be provided with
decompositions using this character. The list includes:

      CYRILLIC CAPITAL LETTER BARRED O (CYRILLIC CAPITAL LETTER O)
      CYRILLIC CAPITAL LETTER BARRED O WITH DIAERESIS (CYRILLIC CAPITAL
            LETTER O WITH DIAERESIS)
      CYRILLIC CAPITAL LETTER GHE WITH STROKE (CYRILLIC CAPITAL LETTER
            GHE)
      CYRILLIC CAPITAL LETTER STRAIGHT U WITH STROKE (CYRILLIC CAPITAL
            LETTER STRAIGHT U)
      CYRILLIC SMALL LETTER BARRED O (CYRILLIC SMALL LETTER O)
      CYRILLIC SMALL LETTER BARRED O WITH DIAERESIS (CYRILLIC SMALL
            LETTER O WITH DIAERESIS)
      CYRILLIC SMALL LETTER GHE WITH STROKE (CYRILLIC SMALL LETTER GHE)
      CYRILLIC SMALL LETTER STRAIGHT U WITH STROKE (CYRILLIC SMALL
            LETTER STRAIGHT U)
      LATIN CAPITAL LETTER D WITH STROKE (LATIN CAPITAL LETTER D)
      LATIN CAPITAL LETTER G WITH STROKE (LATIN CAPITAL LETTER G)
      LATIN CAPITAL LETTER H WITH STROKE (LATIN CAPITAL LETTER H)
      LATIN CAPITAL LETTER I WITH STROKE (LATIN CAPITAL LETTER I)
      LATIN CAPITAL LETTER O WITH MIDDLE TILDE (LATIN CAPITAL LETTER O)
      LATIN CAPITAL LETTER T WITH STROKE (LATIN CAPITAL LETTER T)
      LATIN CAPITAL LETTER Z WITH STROKE (LATIN CAPITAL LETTER Z)
      LATIN LETTER GLOTTAL STOP WITH STROKE (LATIN LETTER GLOTTAL STOP)
      LATIN LETTER INVERTED GLOTTAL STOP WITH STROKE (LATIN LETTER
            INVERTED GLOTTAL STOP)
      LATIN LETTER REVERSED GLOTTAL STOP WITH STROKE (LATIN LETTER
            REVERSED GLOTTAL STOP)
      LATIN LETTER TWO WITH STROKE (LATIN DIGIT TWO)
      LATIN SMALL LETTER B WITH STROKE (LATIN SMALL LETTER B)
      LATIN SMALL LETTER BARRED O (LATIN SMALL LETTER O)
      LATIN SMALL LETTER D WITH STROKE (LATIN SMALL LETTER D)
      LATIN SMALL LETTER G WITH STROKE (LATIN SMALL LETTER G)
      LATIN SMALL LETTER H WITH STROKE (LATIN SMALL LETTER H)
      LATIN SMALL LETTER I WITH STROKE (LATIN SMALL LETTER I)
      LATIN SMALL LETTER L WITH BAR (LATIN SMALL LETTER L)
      LATIN SMALL LETTER T WITH STROKE (LATIN SMALL LETTER T)
      LATIN SMALL LETTER U BAR (LATIN SMALL LETTER U)
      LATIN SMALL LETTER Z WITH STROKE (LATIN SMALL LETTER Z)

   (The dotless j with stroke is really a turned f.)

   Although some algorithmic sofistikashun would be required to get the
bar in exactly the right place, in practice it might be well enough to
just go ahead and overprint it, with maybe a few special cases for
instances where it is in a very unusual position (e g, LATIN SMALL
LETTER G WITH STROKE).

COMBINING SHORT VERTICAL LINE OVERLAY
--------- ----- -------- ---- -------

   There are 4 characters "with vertical stroke" which could be composed
from this character:

      CYRILLIC CAPITAL LETTER KA WITH VERTICAL STROKE (CYRILLIC
            CAPITAL LETTER KA)
      CYRILLIC SMALL LETTER KA WITH VERTICAL STROKE (CYRILLIC SMALL
            LETTER KA)
      CYRILLIC CAPITAL LETTER CHE WITH VERTICAL STROKE (CYRILLIC
            CAPITAL LETTER CHE)
      CYRILLIC SMALL LETTER CHE WITH VERTICAL STROKE (CYRILLIC SMALL
            LETTER CHE)

COMBINING VERTICAL LINE BELOW
--------- -------- ---- -----

   This seems to be another case where some decompositions have been
accidentally omitted.

      LATIN SMALL LETTER N WITH LONG RIGHT LEG (LATIN SMALL LETTER N)
      LATIN SMALL LETTER TURNED M WITH LONG LEG (LATIN SMALL LETTER
            TURNED M)
      LATIN SMALL LETTER TURNED R WITH LONG LEG (LATIN SMALL LETTER
            TURNED R)
      LATIN SMALL LETTER R WITH LONG LEG (LATIN SMALL LETTER R)

FRACTION SLASH
-------- -----

   This is an existing character, but we give it more precise semantics
by specifying that it lies between 1 character or group on the left, and
1 on the right. In other words, it is "binary", just like PRESENTATION
SUGGESTIONS LIGATURE, COMPOSE, STACK UP and STACK DOWN. It is used in
all the decompositions marked with <fraction> (there are 16 of these),
and the following:

      ACCOUNT OF (LATIN SMALL LETTER A, LATIN SMALL LETTER C)
      ADDRESSED TO THE SUBJECT (LATIN SMALL LETTER A, LATIN SMALL
            LETTER S)
      ARABIC PERCENT SIGN (DOT OPERATOR twice)
      CADA UNA (LATIN SMALL LETTER C, LATIN SMALL LETTER U)
      CARE OF (LATIN SMALL LETTER C, LATIN SMALL LETTER O)
      PER MILLE SIGN (LATIN DIGIT ZERO and a group of 2 of the same)
      PER TEN THOUSAND SIGN (LATIN DIGIT ZERO and a group of 3 more)
      PERCENT SIGN (LATIN DIGIT ZERO, LATIN DIGIT ZERO)

   A sophisticated rendering agent is explicitly allowed to stack the
top and bottom of a fraction over each other (maybe varying their size
as well), and use a horizontal rule to represent the division. This is
because a decomposition like

      VULGAR FRACTION ONE QUARTER = LATIN DIGIT ONE + FRACTION SLASH +
            LATIN DIGIT FOUR

is canonical, so it is permitted (but not required) to use a special
glyph, such as would be present in a Latin-1 font.

LEFT-TO-RIGHT OVERRIDE
------------- --------

   Some Hebrew characters are used in mathematical text. These have
obvious decompositions which should be encoded. Doing this will enable
mathematicians to use any other Hebrew characters as symbols (by using
the decomposition) without needing to get them encoded in the U C S
first.

      ALEPH SYMBOL: LEFT-TO-RIGHT OVERRIDE, HEBREW LETTER ALEPH, POP
           DIRECTIONAL FORMATTING
      BET SYMBOL: LEFT-TO-RIGHT OVERRIDE, HEBREW LETTER BET, POP
           DIRECTIONAL FORMATTING
      GIMEL SYMBOL: LEFT-TO-RIGHT OVERRIDE, HEBREW LETTER GIMEL, POP
           DIRECTIONAL FORMATTING
      DALET SYMBOL: LEFT-TO-RIGHT OVERRIDE, HEBREW LETTER DALET, POP
           DIRECTIONAL FORMATTING

New combining characters
=== ========= ==========

COMBINING HOOK
--------- ----

   Many characters are described as "with hook" or "with middle hook",
but no combining form of this mark is encoded. This is probably because
the position of the hook moves around a lot depending on which character
is to receive it, and because there are a few different forms of hook, 3
of which are encoded separately and were considered above. The fact that
the hook moves around should be seen as a rendering problem, easily
solved by a repository of precomposed glyphs for the cases that are
actually used.

   If there was to be a COMBINING HOOK character, the characters that
use it would be

      CYRILLIC CAPITAL LETTER EN WITH HOOK (CYRILLIC CAPITAL LETTER EN)
      CYRILLIC CAPITAL LETTER GHE WITH MIDDLE HOOK (CYRILLIC CAPITAL
            LETTER GHE)
      CYRILLIC CAPITAL LETTER KA WITH HOOK (CYRILLIC CAPITAL LETTER KA)
      CYRILLIC CAPITAL LETTER PE WITH MIDDLE HOOK (CYRILLIC CAPITAL
            LETTER PE)
      CYRILLIC SMALL LETTER EN WITH HOOK (CYRILLIC SMALL LETTER EN)
      CYRILLIC SMALL LETTER GHE WITH MIDDLE HOOK (CYRILLIC SMALL LETTER
            GHE)
      CYRILLIC SMALL LETTER KA WITH HOOK (CYRILLIC SMALL LETTER KA)
      CYRILLIC SMALL LETTER PE WITH MIDDLE HOOK (CYRILLIC SMALL LETTER
            PE)
      LATIN CAPITAL LETTER B WITH HOOK (LATIN CAPITAL LETTER B)
      LATIN CAPITAL LETTER C WITH HOOK (LATIN CAPITAL LETTER C)
      LATIN CAPITAL LETTER D WITH HOOK (LATIN CAPITAL LETTER D)
      LATIN CAPITAL LETTER F WITH HOOK (LATIN CAPITAL LETTER F)
      LATIN CAPITAL LETTER G WITH HOOK (LATIN CAPITAL LETTER G)
      LATIN CAPITAL LETTER K WITH HOOK (LATIN CAPITAL LETTER K)
      LATIN CAPITAL LETTER P WITH HOOK (LATIN CAPITAL LETTER P)
      LATIN CAPITAL LETTER T WITH HOOK (LATIN CAPITAL LETTER T)
      LATIN CAPITAL LETTER Y WITH HOOK (LATIN CAPITAL LETTER Y)
      LATIN LETTER SMALL CAPITAL G WITH HOOK (LATIN LETTER SMALL
            CAPITAL G)
      LATIN SMALL LETTER B WITH HOOK (LATIN SMALL LETTER B)
      LATIN SMALL LETTER C WITH HOOK (LATIN SMALL LETTER C)
      LATIN SMALL LETTER D WITH HOOK (LATIN SMALL LETTER D)
      LATIN SMALL LETTER DOTLESS J WITH STROKE AND HOOK (LATIN SMALL
            LETTER F)
      LATIN SMALL LETTER F WITH HOOK (LATIN SMALL LETTER F)
      LATIN SMALL LETTER G WITH HOOK (LATIN SMALL LETTER G)
      LATIN SMALL LETTER H WITH HOOK (LATIN SMALL LETTER H)
      LATIN SMALL LETTER K WITH HOOK (LATIN SMALL LETTER K)
      LATIN SMALL LETTER M WITH HOOK (LATIN SMALL LETTER M)
      LATIN SMALL LETTER P WITH HOOK (LATIN SMALL LETTER P)
      LATIN SMALL LETTER Q WITH HOOK (LATIN SMALL LETTER Q)
      LATIN SMALL LETTER REVERSED OPEN E WITH HOOK (LATIN SMALL LETTER
            REVERSED OPEN E)
      LATIN SMALL LETTER S WITH HOOK (LATIN SMALL LETTER S)
      LATIN SMALL LETTER T WITH HOOK (LATIN SMALL LETTER T)
      LATIN SMALL LETTER TURNED R WITH HOOK (LATIN SMALL LETTER TURNED
            R)
      LATIN SMALL LETTER Y WITH HOOK (LATIN SMALL LETTER Y)

   It's odd that although there is a LATIN SMALL LETTER HENG WITH HOOK,
there is no LATIN SMALL LETTER HENG. It should be represented as a
ligature of h and eng, and that gives us

      LATIN SMALL LETTER HENG WITH HOOK = START GROUP + LATIN SMALL
            LETTER H + PRESENTATION SUGGESTION LIGATURE + LATIN SMALL
            LETTER ENG + POP DIRECTIONAL FORMATTING + COMBINING HOOK.

   GREEK CAPITAL LETTER UPSILON WITH HOOK is not here---it is really a
GREEK CAPITAL LETTER UPSILON + PRESENTATION SUGGESTION VARIANT.

   LATIN CAPITAL/SMALL LETTER V WITH HOOK are not here---they are really
LATIN CAPITAL/SMALL LETTER V + PRESENTATION SUGGESTION SCRIPT.

   SMALL LETTER SCHWA WITH HOOK and SMALL LETTER REVERSED OPEN E WITH
HOOK are not here---they are really ligatures with MODIFIER LETTER
RHOTIC HOOK.

COMBINING CURL
--------- ----

   The case of characters "with curl" is similar to those "with hook",
in that the curl moves around depending on the character being modified.
But if there was a combining curl, it would be used for 12 characters,
if we also follow the principal of Occam's Razor and include
crossed-tail, belted, looped and closed characters in this set, as is
justified by their visual appearance.

      LATIN LETTER REVERSED ESH LOOP (LATIN LETTER REVERSED ESH)
      LATIN SMALL LETTER C WITH CURL (LATIN SMALL LETTER C)
      LATIN SMALL LETTER CLOSED OMEGA (LATIN SMALL LETTER OMEGA)
      LATIN SMALL LETTER CLOSED OPEN E (LATIN SMALL LETTER OPEN E)
      LATIN SMALL LETTER CLOSED REVERSED OPEN E (LATIN SMALL LETTER
            REVERSED OPEN E)
      LATIN SMALL LETTER DZ DIGRAPH WITH CURL (LATIN SMALL LETTER DZ
            DIGRAPH)
      LATIN SMALL LETTER ESH WITH CURL (LATIN SMALL LETTER ESH)
      LATIN SMALL LETTER EZH WITH CURL (LATIN SMALL LETTER EZH)
      LATIN SMALL LETTER J WITH CROSSED-TAIL (LATIN SMALL LETTER J)
      LATIN SMALL LETTER L WITH BELT (LATIN SMALL LETTER L)
      LATIN SMALL LETTER TC DIGRAPH WITH CURL (LATIN SMALL LETTER TC
            DIGRAPH)
      LATIN SMALL LETTER Z WITH CURL (LATIN SMALL LETTER Z)

COMBINING DESCENDER
--------- ---------

   16 characters are described as "with descender". The visual appear-
ance of the descender is variable.

      CYRILLIC CAPITAL LETTER CHE WITH DESCENDER (CYRILLIC CAPITAL
            LETTER CHE)
      CYRILLIC CAPITAL LETTER EN WITH DESCENDER (CYRILLIC CAPITAL LETTER
            EN)
      CYRILLIC CAPITAL LETTER ES WITH DESCENDER (CYRILLIC CAPITAL LETTER
            ES)
      CYRILLIC CAPITAL LETTER HA WITH DESCENDER (CYRILLIC CAPITAL LETTER
            HA)
      CYRILLIC CAPITAL LETTER KA WITH DESCENDER (CYRILLIC CAPITAL LETTER
            KA)
      CYRILLIC CAPITAL LETTER TE WITH DESCENDER (CYRILLIC CAPITAL LETTER
            TE)
      CYRILLIC CAPITAL LETTER ZE WITH DESCENDER (CYRILLIC CAPITAL LETTER
            ZE)
      CYRILLIC CAPITAL LETTER ZHE WITH DESCENDER (CYRILLIC CAPITAL
            LETTER ZHE)
      CYRILLIC SMALL LETTER CHE WITH DESCENDER (CYRILLIC SMALL LETTER
            CHE)
      CYRILLIC SMALL LETTER EN WITH DESCENDER (CYRILLIC SMALL LETTER EN)
      CYRILLIC SMALL LETTER ES WITH DESCENDER (CYRILLIC SMALL LETTER ES)
      CYRILLIC SMALL LETTER HA WITH DESCENDER (CYRILLIC SMALL LETTER HA)
      CYRILLIC SMALL LETTER KA WITH DESCENDER (CYRILLIC SMALL LETTER KA)
      CYRILLIC SMALL LETTER TE WITH DESCENDER (CYRILLIC SMALL LETTER TE)
      CYRILLIC SMALL LETTER ZE WITH DESCENDER (CYRILLIC SMALL LETTER ZE)
      CYRILLIC SMALL LETTER ZHE WITH DESCENDER (CYRILLIC SMALL LETTER
            ZHE)

Summary
=======

   This note considers 3134 characters, of which 900 have canonical
decompositions already, and are not considered further. Of the 2234
characters left, over 1300 of them---well over half---are given new
canonical decompositions, some of which involve one or more of 34 new
characters, which are defined here. These characters are intended to be
productive parts of the U C S.

   I hope that some consideration can be given to these ideas. I even
hope that they might forestall the encoding of large numbers of copies
of the Latin alphabet into the U C S in the guise of mathematical
symbols and phonetic characters, etc, while restoring the freedom of
expression to these groups of people, and keeping the U C S down to a
small and productive core.

                                            Jonathan Coxhead, 6 Jul 1999



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT