L2/07-260

Date: 05 Aug 2007
From: Asmus Freytag
Subject: Non-blank spaces


This contribution is in response to document L2/07-258 "Middle Dots and
Don'ts" by Ken Whistler and document L2/07-231, Preliminary proposal
to add the Samaritan alphabet to the BMP

Background
----------

In this document, Ken has done one of his excellent jobs in pulling
together the background on the various middle dots in the standard. Some
such discussion should be added to the book, or to a technical note -
it's clearly needed to help orient implementers and users in the face of
apparently duplicate encodings that are nevertheless distinct.

For the most part, I have no issue with his findings, and I am in
general support of his underlying position, which is to keep the number
of middle dots with *generic* properties to their minimum.

Many scripts, particularly ancient scripts, are (or were) written
without the modern convention of blank spaces as primary word separators
and marks as punctuation. In some of these scripts, a middle dot is used
as a separator between words. In some, middle dots are used for other
(punctuation) purposes.

I firmly agree, with the general sense of Ken's a-priori position that
Unicode should avoid encoding a script-specific middle dot every time a
new script comes along that uses one as punctuation. Encoding any
additional middle dots should be avoided if any of the generic middle
dots can be utilized instead.

After a long argument, that I won't repeat here, Ken arrives at three
candidates for generic middle dots.

10101 AEGEAN WORD SEPARATOR DOT <== bc=ON
   = neutral word separator middle dot, non-terminal
       gc=Po, ccc=0, bc=ON, lb=BA

1091F PHOENICIAN WORD SEPARATOR
   = neutral word separartor middle dot, terminal
   gc=Po, ccc=0, bc=ON, lb=BA, Terminal_Punctuation=True.

16EB RUNIC SINGLE PUNCTUATION
   = left-to-right word separator middle dot, terminal
   gc=Po, ccc=0, bc=L, lb=BA, Terminal_Punctuation=True

Here, I've restored the correct character names and shown the *current*
bidi class for 1091F, which is ON, not R.


Analysis
--------
I disagree with several important details of his conclusion, for the following reasons:

1a) I see no need, and a lot of harm, in trying to change the bidi class
of a character. A class of ON will work just fine, except, possibly at
the boundary of directional runs, but that can be handled by adding an
RLM (right-to-left mark) where needed - not a terrible burden (N'Ko
punctuation is also ON, for example, even though N'Ko is RTL).

An ON character that's part of a RTL run, will, of course, resolve to a
RTL directionality itself.

1b) Not making this change would allow the generic use of AEGEAN word separators for non-sentence-terminal cases, the phoenecian for sentence-terminal cases, and the Runic one would not be needed. So, unlike Ken, I'd conclude that only two generic middle dot separators exist so far, and that their differentiation is not by bidi class, but by differences in their use - separating words, or separating other types of constructs.

1c) Everson noted recently that 1091F is itself unification of a middle dot and a vertical line, based on indistinguishable function in Phoenician texts. If so, that would put the use of 1091f as a generic _middle dot_ in question. However, if the bidi class is ON and the only
property difference is Terminal_Punctuation, that may not matter much.

2) The question then is whether 10101 AEGEAN WORD SEPARATOR DOT could fill the role of the Samritan middle dot, proposed in document L2/07-231. It is indeed clear from the evidence presented there that the way that character is used, can be described as "non-blank space": it occurs between all words, and, interestingly, seems to be adjusted in width during layout, much like a space character. It bears some similarity in usage, albeit no in shape, to 1680 OGHAM SPACE MARK.

In other words: is U+10101 intended to be used in this manner as a
non-blank space? The properties for the OGHAM SPACE MARK are

1680 OGHAM SPACE MARK
   gc=Zs; ccc=0; bc=WS; lb=BA; White_Space=true; SB=Sp

and the relevant properties for the AGEAN word separator are

10101 AEGEAN WORD SEPARATOR DOT
       gc=Po, ccc=0, bc=ON, lb=BA

These are significant differences in properties. A text-processor that's
generalized to recognize Unicode space characters other then just the
ASCII space, will treat 1680 like a space character, but would treat
10101 like a punctuation character, albeit one that allows line breaks
and is not part of a word.

3) Document L2/07-231 states that Samaritan is in continued modern usage
(citing a newspaper). If that's the case, making sure that modern text
processing software does the correct thing out of the box is an
important factor to consider - independent of whether the *layout*
software does the same thing.

4) On the layout side, stretching the character during justification
needs special support - because of that, the layout engine might as well
have a Samaritan mode, in which case it's not necessary to have a
distinct character to carry the stretchiness property. This argument
should be reviewed and explicitly endorsed by manufacturers of layout
engines, possibly as part of a Public Review Item (PRI).

More about layout engines in the appendix.

Conclusion
===========
In conclusion, I would urge the UTC and the authors of documents 07/231
and 07/258 to review whether supporting (modern) Samaritan needs the use
of a character that is explicitly White_Space in its design.

If the answer is yes, then the thing to do would be to code a third
_generic_ middle dot, but one with  properties matching that of the
OGHAM SPACE MARK. A script-specific encoding is not favored.

If, on the other hand, the AEGEAN dot was intended to fully function as
a non-blank space mark, then it is questionable whether it should continue
to remain classified as Po, with all that entails. Retaining it as is would
seem to make it ill-suited for modern text usage, but wouldn't impact its
use in scholarly publications.

-------------------------------------------------------------------

Appendix: Why Samaritan will need a special layout engine anyway

It's perhaps surprising to some, but no matter *how* the Samaritan middle
dot is encoded, it *will* need a specialized layout engine to support it.

Stretchable paces are supported by layout engines, but often only U+0020 SPACE
is actually adjusted. SPACE of course does not have a glyph, so fonts are not
necessarily involved in layout other than providing a suggested width of
the space character in their metrics. The layout engine can just adjust the
offsets of the starting positions of the words to achieve the effect of
stretching or compressing the spaces.

This technique is obviously not going to work if the layout engine needs
to provide a "dot" in the middle between the words. It now has to calculate
the middle position and place a glyph there. That logic doesn't magically
appear, it needs to be added explicitly. The font is not going to help,
since it can't provide the width calculation.

Therefore, because the need for a special layout engine cannot be avoided,
it doesn't affect the decision on coding characters.

Well, not completely: should there be more than one script with the *same*
stretchy non-blank space rendered as a dot, then a generic character code
for that would allow a generic rendering engine extension to be keyed off
the character code, instead of the script context - that's useful if
several scripts with that feature are expected to come online at different
times, but _only_ if none of them need other script specializations, so
that allowing an existing engine to key of the
generic character would support the new script(s) out of the box.

However, as long as it's only Samaritan, keying off the script seems fine.
A separate character code would be required only to support the _text
processing_ properties, which appear to be very different between a "non-blank
space" and a punctuation character.