L2/07-260 Date: 05 Aug 2007 From: Asmus Freytag Subject: Non-blank spaces This contribution is in response to document L2/07-258 "Middle Dots and Don'ts" by Ken Whistler and document L2/07-231, Preliminary proposal to add the Samaritan alphabet to the BMP Background ---------- In this document, Ken has done one of his excellent jobs in pulling together the background on the various middle dots in the standard. Some such discussion should be added to the book, or to a technical note - it's clearly needed to help orient implementers and users in the face of apparently duplicate encodings that are nevertheless distinct. For the most part, I have no issue with his findings, and I am in general support of his underlying position, which is to keep the number of middle dots with *generic* properties to their minimum. Many scripts, particularly ancient scripts, are (or were) written without the modern convention of blank spaces as primary word separators and marks as punctuation. In some of these scripts, a middle dot is used as a separator between words. In some, middle dots are used for other (punctuation) purposes. I firmly agree, with the general sense of Ken's a-priori position that Unicode should avoid encoding a script-specific middle dot every time a new script comes along that uses one as punctuation. Encoding any additional middle dots should be avoided if any of the generic middle dots can be utilized instead. After a long argument, that I won't repeat here, Ken arrives at three candidates for generic middle dots. 10101 AEGEAN WORD SEPARATOR DOT <== bc=ON = neutral word separator middle dot, non-terminal gc=Po, ccc=0, bc=ON, lb=BA 1091F PHOENICIAN WORD SEPARATOR = neutral word separartor middle dot, terminal gc=Po, ccc=0, bc=ON, lb=BA, Terminal_Punctuation=True. 16EB RUNIC SINGLE PUNCTUATION = left-to-right word separator middle dot, terminal gc=Po, ccc=0, bc=L, lb=BA, Terminal_Punctuation=True Here, I've restored the correct character names and shown the *current* bidi class for 1091F, which is ON, not R. Analysis -------- I disagree with several important details of his conclusion, for the following reasons: 1a) I see no need, and a lot of harm, in trying to change the bidi class of a character. A class of ON will work just fine, except, possibly at the boundary of directional runs, but that can be handled by adding an RLM (right-to-left mark) where needed - not a terrible burden (N'Ko punctuation is also ON, for example, even though N'Ko is RTL). An ON character that's part of a RTL run, will, of course, resolve to a RTL directionality itself. 1b) Not making this change would allow the generic use of AEGEAN word separators for non-sentence-terminal cases, the phoenecian for sentence-terminal cases, and the Runic one would not be needed. So, unlike Ken, I'd conclude that only two generic middle dot separators exist so far, and that their differentiation is not by bidi class, but by differences in their use - separating words, or separating other types of constructs. 1c) Everson noted recently that 1091F is itself unification of a middle dot and a vertical line, based on indistinguishable function in Phoenician texts. If so, that would put the use of 1091f as a generic _middle dot_ in question. However, if the bidi class is ON and the only property difference is Terminal_Punctuation, that may not matter much. 2) The question then is whether 10101 AEGEAN WORD SEPARATOR DOT could fill the role of the Samritan middle dot, proposed in document L2/07-231. It is indeed clear from the evidence presented there that the way that character is used, can be described as "non-blank space": it occurs between all words, and, interestingly, seems to be adjusted in width during layout, much like a space character. It bears some similarity in usage, albeit no in shape, to 1680 OGHAM SPACE MARK. In other words: is U+10101 intended to be used in this manner as a non-blank space? The properties for the OGHAM SPACE MARK are 1680 OGHAM SPACE MARK gc=Zs; ccc=0; bc=WS; lb=BA; White_Space=true; SB=Sp and the relevant properties for the AGEAN word separator are 10101 AEGEAN WORD SEPARATOR DOT gc=Po, ccc=0, bc=ON, lb=BA These are significant differences in properties. A text-processor that's generalized to recognize Unicode space characters other then just the ASCII space, will treat 1680 like a space character, but would treat 10101 like a punctuation character, albeit one that allows line breaks and is not part of a word. 3) Document L2/07-231 states that Samaritan is in continued modern usage (citing a newspaper). If that's the case, making sure that modern text processing software does the correct thing out of the box is an important factor to consider - independent of whether the *layout* software does the same thing. 4) On the layout side, stretching the character during justification needs special support - because of that, the layout engine might as well have a Samaritan mode, in which case it's not necessary to have a distinct character to carry the stretchiness property. This argument should be reviewed and explicitly endorsed by manufacturers of layout engines, possibly as part of a Public Review Item (PRI). More about layout engines in the appendix. Conclusion =========== In conclusion, I would urge the UTC and the authors of documents 07/231 and 07/258 to review whether supporting (modern) Samaritan needs the use of a character that is explicitly White_Space in its design. If the answer is yes, then the thing to do would be to code a third _generic_ middle dot, but one with properties matching that of the OGHAM SPACE MARK. A script-specific encoding is not favored. If, on the other hand, the AEGEAN dot was intended to fully function as a non-blank space mark, then it is questionable whether it should continue to remain classified as Po, with all that entails. Retaining it as is would seem to make it ill-suited for modern text usage, but wouldn't impact its use in scholarly publications. ------------------------------------------------------------------- Appendix: Why Samaritan will need a special layout engine anyway It's perhaps surprising to some, but no matter *how* the Samaritan middle dot is encoded, it *will* need a specialized layout engine to support it. Stretchable paces are supported by layout engines, but often only U+0020 SPACE is actually adjusted. SPACE of course does not have a glyph, so fonts are not necessarily involved in layout other than providing a suggested width of the space character in their metrics. The layout engine can just adjust the offsets of the starting positions of the words to achieve the effect of stretching or compressing the spaces. This technique is obviously not going to work if the layout engine needs to provide a "dot" in the middle between the words. It now has to calculate the middle position and place a glyph there. That logic doesn't magically appear, it needs to be added explicitly. The font is not going to help, since it can't provide the width calculation. Therefore, because the need for a special layout engine cannot be avoided, it doesn't affect the decision on coding characters. Well, not completely: should there be more than one script with the *same* stretchy non-blank space rendered as a dot, then a generic character code for that would allow a generic rendering engine extension to be keyed off the character code, instead of the script context - that's useful if several scripts with that feature are expected to come online at different times, but _only_ if none of them need other script specializations, so that allowing an existing engine to key of the generic character would support the new script(s) out of the box. However, as long as it's only Samaritan, keying off the script seems fine. A separate character code would be required only to support the _text processing_ properties, which appear to be very different between a "non-blank space" and a punctuation character.