L2/07-258 Title: Middle Dots and Don'ts Source: Ken Whistler Date: August 2, 2007 Action: For Consideration by UTC References: L2/07-231, Preliminary proposal to add the Samaritan alphabet to the BMP Background The proposal to encode the Samaritan script has resulted in considerable controversy over what to do about encoding the middle dot used in Samaritan text as a word separator. See L2/07-231 for details of the use of that middle dot in Samaritan, along with citations in context. The upshot is that L2/07-231 is requesting the encoding of: 083F;SAMARITAN WORD SEPARATOR POINT;Po;0;R;;;;;N;;;;; Glyphically, this is just a small middle dot. So the case to be made for encoding it for Samaritan hinges on the claim that this particular middle dot cannot be represented by any of the other middle dots already encoded in the standard, and that it has different properties and behavior than any existing character. L2/07-231 explicitly claims that this character in Samaritan cannot be represented by U+2027 HYPHENATION POINT, because that "is different". In an effort to provide the context about middle dot properties that L2/07-231 doesn't, and which would otherwise be difficult to pull together during the context of a meeting, I am here summarizing all the relevant information about *existing* middle dots in Unicode, so the UTC at least has the relevant information summarized to be able to make an informed decision about whether to encode yet another one. =========================================================== Classification of Existing Encoded (and Proposed) Middle Dots in Unicode There are 22 characters currently encoded in Unicode which can arguably be called "middle dots" of one sort or another, plus two more still in contention (for Avestan and Samaritan). This doesn't include the various baseline dots or raised dots, but simply the dots that occur more or less midline in text. I am also not counting larger size dots, such as U+2022 BULLET, which are intentionally distinguished by their larger size. Here is the list of 24: 00B7;MIDDLE DOT;Po;0;ON;;;;;N;;;;; 0387;GREEK ANO TELEIA;Po;0;ON;00B7;;;;N;;;;; 02D1;MODIFIER LETTER HALF TRIANGULAR COLON;Lm;0;L;;;;;N;;;;; 05BC;HEBREW POINT DAGESH OR MAPIQ;Mn;21;NSM;;;;;N;;;;; 0701;SYRIAC SUPRALINEAR FULL STOP;Po;0;AL;;;;;N;;;;; 0F0B;TIBETAN MARK INTERSYLLABIC TSHEG;Po;0;L;;;;;N;;;;; 0F0C;TIBETAN MARK DELIMITER TSHEG BSTAR;Po;0;L; 0F0B;;;;N;;;;; 1427;CANADIAN SYLLABICS FINAL MIDDLE DOT;Lo;0;L;;;;;N;;;;; 16EB;RUNIC SINGLE PUNCTUATION;Po;0;L;;;;;N;;;;; 1802;MONGOLIAN COMMA;Po;0;ON;;;;;N;;;;; 1B7C;BALINESE MUSICAL SYMBOL LEFT-HAND OPEN PING;So;0;L;;;;;N;;;;; 2027;HYPHENATION POINT;Po;0;ON;;;;;N;;;;; 22C5;DOT OPERATOR;Sm;0;ON;;;;;N;;;;; 302E;HANGUL SINGLE DOT TONE MARK;Mn;224;NSM;;;;;N;;;;; 30FB;KATAKANA MIDDLE DOT;Po;0;ON;;;;;N;;;;; FF65;HALFWIDTH KATAKANA MIDDLE DOT;Po;0;ON; 30FB;;;;N;;;;; 318D;HANGUL LETTER ARAEA;Lo;0;L; 119E;;;;N;;;;; A947;REJANG VOWEL SIGN I;Mn;0;NSM;;;;;N;;;;; 10101;AEGEAN WORD SEPARATOR DOT;Po;0;ON;;;;;N;;;;; 1091F;PHOENICIAN WORD SEPARATOR;Po;0;ON;;;;;N;;;;; 10A50;KHAROSHTHI PUNCTUATION DOT;Po;0;R;;;;;N;;;;; 1D16D;MUSICAL SYMBOL COMBINING AUGMENTATION DOT;Mc;226;L;;;;;N;;;;; *10B38;AVESTAN SEPARATION POINT;Po;0;R;;;;;N;;;;; *083F;SAMARITAN WORD SEPARATOR POINT;Po;0;R;;;;;N;;;;; Now not all of those middle dots are punctuation. Some of them are actual letters in a script, are used as word extenders, or are combining marks that depend on a base character, or are otherwise distinct in ways that would make them inappropriate for use as generic punctuation marks indicating separation of elements. Others are conceptually middle dot separators, but have script-specific shape differences that also make them inappropriate for use as generic punctuation marks. So to get down to the heart of the matter, I need to classify these 24 middle dots further, based on properties, to identify ones that must be distinguished from middle dots serving as generic separators. =========================================================== A. Middle dots that function as letters and extenders 1427 CANADIAN SYLLABICS FINAL MIDDLE DOT gc=Lo, ccc=0, bc=L 318D HANGUL LETTER ARAEA gc=Lo, ccc=0, bc=L, decomp= 119E Note: the araea is shown as a middle dot, but various fonts show it in different shapes, including slanted strokes, presumably so as to visually distinguish it from an actual punctuation middle dot in Korean. And this particular character is part of the compatibility jamo; the combining jamo U+119E HANGUL JUNSEONG ARAEA is not imaged with a middle dot. A947 REJANG VOWEL SIGN I gc=Mn, ccc=0, bc=NSM, Other_Alphabetic=True Note: This is a vowel mark for Rejang. It happens to have the shape of a midline dot. 02D1 MODIFIER LETTER HALF TRIANGULAR COLON;Lm;0;L;;;;;N;;;;; gc=Lm, ccc=0, bc=L, Diacritic=True, Extender=True Note: This is the IPA half-length mark and nominally has a small midline downpointing triangular shape. This shape was deliberately chosen by the IPA to distinguish this formally from the middle dot used to indicate length, to avoid confusion with it. In other words, it (and its sister length mark, 02D0 MODIFIER LETTER TRIANGULAR COLON) were deliberate disambiguations of the preexisting orthographic conventions of using middle dot (and/or colon, respectively) to indicate length, intended for precise use in IPA. Whatever their shape, middle dots that function as letters and extenders clearly are inappropriate for use as separating punctuation, based merely on their General_Category values and what that implies for other processes. =========================================================== B. Diacritic middle dots as combining marks 05BC HEBREW POINT DAGESH OR MAPIQ gc=Mn, ccc=21, bc=NSM Note: The dagesh is a middle dot, but it is a combining, diacritic middle dot that occurs inside Hebrew letters (and in somewhat different positions), rather than punctuation occurring between letters. 302E HANGUL SINGLE DOT TONE MARK gc=Mn, ccc=224, bc=NSM Note: This is one of the few left-side non-spacing marks, and is a tone mark for a Hangul syllable, rather than punctuation occurring between letters. 1D16D MUSICAL SYMBOL COMBINING AUGMENTATION DOT gc=Mc, ccc=226, bc=L Note: This middle dot occurs to the right of a base notehead, but cannot be used properly in non-musical text or applied other than to a notehead, and in any case is not punctuation occurring between letters. =========================================================== C. Middle dots that are part of symbol sets, not functioning as punctuation 1B7C BALINESE MUSICAL SYMBOL LEFT-HAND OPEN PING gc=So, ccc=0, bc=L Note: This is a dot used as a symbol in Balinese musical notation, occurring as part of a set of various symbols. It doesn't function as separator punctuation, in any case. 22C5 DOT OPERATOR gc=Sm, ccc=0, bc=ON Note: This is a math operator. It is encoded as a functional disambiguation of the formally identical U+00B7 MIDDLE DOT. It is used in math expression syntax, and does not function as separator punctuation. O.k., so far that has pared away 9 middle dots in Unicode that don't function as punctuation dots. But that leave 15 that *are* punctuation middle dots. Analyzing further... =========================================================== D. Middle dots that are canonical or (shape-based) compatibility equivalents of other punctuation middle dots 0387 GREEK ANO TELEIA gc=Po, ccc=0, bc=ON, decomp=00B7 Note: This is a singleton canonical equivalent to U+00B7 MIDDLE DOT; it can't really be distinguished from U+00B7 itself in most contexts and shouldn't be recommended for anything. FF65;HALFWIDTH KATAKANA MIDDLE DOT;Po;0;ON; 30FB;;;;N;;;;; gc=Po, ccc=0, bc=ON, decomp= 30FB Note: This is a halfwidth compatibility version of the Katakana middle dot, and is only recommended for roundtrip mapping of legacy East Asian character sets. =========================================================== E. Punctuation middle dots that have notable, script-specific shapes making them inappropriate for generic middle dot punctuation 0F0B TIBETAN MARK INTERSYLLABIC TSHEG gc=Po, ccc=0, bc=L, lb=BA, script=Tibetan 0F0C;TIBETAN MARK DELIMITER TSHEG BSTAR;Po;0;L; 0F0B;;;;N;;;;; gc=Po, ccc=0, bc=L, lb=GL, script=Tibetan, decomp= 0F0B Note: The Tibetan tseks are middle dots used to separate Tibetan orthographic stacks in text, so constitute syllable- (or word-) separating punctuation, even though the things they separate aren't exactly syllables nor words, but are closer to being morphemes. 0F0B and 0F0C differ in their line-breaking behavior. In either case, the *shape* of the tsek is unique to Tibetan, looking again like a down-pointing midline triangle, rather than a small round circle. So based on shape alone, a tsek is not a candidate for a generic punctuation middle dot. 1802 MONGOLIAN COMMA gc=Po, ccc=0, bc=ON, lb=EX, script=Common Note: The Mongolian comma is a middle dot that functions as a "comma" text separator, i.e. as a first level phrase separator, as opposed to the double dot for a "full stop". However, the Mongolian comma tends to have a distinctive diamond shape and would be found only in Mongolian fonts, which are themselves unusual in being designed primarily for vertical rendering. The Script value is Common, because the 'Phags-pa script shares Mongolian punctuation. =========================================================== That leaves us with 10 punctuation middle dots that are all shaped like little midline circles. I list them next, together with all the relevant properties they have related to breaking and separation behavior. F. Punctuation middle dots that are shaped like dots and which function as separators (exceptional cases) 00B7 MIDDLE DOT gc=Po, ccc=0, bc=ON, lb=AI, Diacritic=True, Extender=True, XID_Continue=True, MidLetter=True Note: This is *the* middle dot, but it is hampered by its own history of ambiguous usage. It is both a punctuation mark and *also* has been widely used as a diacritic, mostly for indicating length. Its use in Catalan (as well as many other orthographies that don't get the same press) requires special handling for it, and accounts for its identifier status. This is basically the only punctuation mark also recommended for use in identifiers. Its linebreaking behavior is also ambiguous. And it is MidLetter=True so as to prevent wordbreaking around it. 0701 SYRIAC SUPRALINEAR FULL STOP gc=Po, ccc=0, bc=AL, lb=AL, Terminal_Punctuation=True, STerm=True, Script=Syriac Note: This is the only middle dot which has the STerm property (meaning that it provides a sentence break boundary according to UAX #29), presumably based on its identification as a "FULL STOP" for Syriac. However, it and a number of other Syriac punctuation marks are lb=AL, presumably as a result simply of lack of information about any other particular required linebreaking behavior for Syriac punctuation. This middle dot is also lb=AL, with strong Arabic right-to-left directionality, which would make it only really appropriate for use with the lb=AL scripts: Arabic, Syriac, and Thaana. And it is also given the Syriac script property explicitly. 30FB KATAKANA MIDDLE DOT gc=Po, ccc=0, bc=ON, lb=NS, eaw=W, Hyphen=True Note: This is the only East Asian ("wide") middle dot. It is also exceptional in that it functions more to join things together (functionally like a displayed hyphen), rather than to separate them. Its linebreaking behavior reflects this, as Japanese linebreaking rules prohibit starting a line with the Katakana middle dot, and unlike most middle dots, it isn't a particularly good candidate for break after. 10A50 KHAROSHTHI PUNCTUATION DOT gc=Po, ccc=0, bc=R, lb=BA, Script=Kharoshthi Note: This middle dot is strong Right-to-Left and is given the Kharoshthi script property. It is unlikely to be supported in any font except one specifically for Kharoshthi, which itself is a historic complex script. =========================================================== So that leaves us with 6 middle dots that I would consider "non-exceptional". 4 of these are currently encoded, and 2 more, one for Avestan and one for Samaritan, are in current proposals. G. Punctuation middle dots that are shaped like dots and which function as separators (non-exceptional cases) 2027 HYPHENATION POINT gc=Po, ccc=0, bc=ON, lb=BA, Pattern_Syntax=True. MidLetter=True Note: The Pattern_Syntax=True is a guarantee that U+2027 HYPHENATION POINT will never be a part of identifiers, but otherwise does not impact the breaking or layout behavior of the character. More significant is the fact that the character has the property MidLetter=True, which by the UAX #29 algorithm would not make it a word break opportunity. 10101 AEGEAN WORD SEPARATOR DOT gc=Po, ccc=0, bc=ON, lb=BA 1091F PHOENICIAN WORD SEPARATOR gc=Po, ccc=0, bc=ON, lb=BA, Terminal_Punctuation=True. Script=Phoenician Note: This middle dot is currently identified as Script=Phoenician, but that assessment was a default one. It could just as well be that this middle dot might be considered appropriate for use in more than one Semitic script, in which case it would end up as Script=Common, like other shared middle dots. 16EB RUNIC SINGLE PUNCTUATION gc=Po, ccc=0, bc=L, lb=BA, Terminal_Punctuation=True *10B38 AVESTAN SEPARATION POINT gc=Po, ccc=0, bc=R, lb=BA *083F SAMARITAN WORD SEPARATOR POINT gc=Po, ccc=0, bc=R. lb=BA Further notes: The status of the Terminal_Punctuation property is inconsistent, in part because it has been somewhat indifferently maintained, and is not rigidly verified by checking any algorithmic implications for it. It is difficult to know for historic scripts, in particular, just what its value should be. It is particularly problematical for *separator* punctuation that takes the place of spaces in scripts that predate modern conventions for punctuation. In any case, I consider any current assignments for Terminal_Punctuation to be essentially irrelevant to determination of which of these middle dots should be used for what. All 6 of these middle dots are line break (after) opportunities. All but the HYPHENATION POINT also are word break boundaries. Note that HYPHENATION POINT itself is a misnomer, since it doesn't hyphenate, nor is it a hyphen. In fact, with all the onesey-twosey accumulation of these middle dots over a couple of decades, and inconsistencies of namings that have set in, folks have tended to get too invested in the character names theselves as distinguishing properties, when they really don't. These last 6 characters could just as well (or better) be termed: 2027 (COMMON) SYLLABLE SEPARATOR MIDDLE DOT 10101 AEGEAN WORD SEPARATOR MIDDLE DOT 1091F PHOENICIAN WORD SEPARATOR MIDDLE DOT 16EB RUNIC WORD SEPARATOR MIDDLE DOT *10B38 AVESTAN WORD SEPARATOR MIDDLE DOT * 083F SAMARITAN WORD SEPARATOR MIDDLE DOT And at that point, the apparent justification for continuing to encode more of these word separator punctuation middle dots for more historic scripts really starts to collapse. =========================================================== What to do now? At this point, I think the best way to systematize what we currently have, in a way comprehensible to everybody and in a way that lets us reasonably conclude that we don't need to encode more dots is as follows: 1. Change the script for U+1091F from Script=Phoenician to Script=Common, and change its bidirectional class to bc=R (the same as Phoenician itself). This would leave the standard with 3 generic word separator middle dots, distinguished essentially by their bidirectional class: 10101 (AEGEAN) WORD SEPARATOR MIDDLE DOT <== bc=ON 1091F (PHOENICIAN) WORD SEPARATOR MIDDLE DOT <== bc=R 16EB (RUNIC) WORD SEPARATOR MIDDLE DOT <== bc=L You would use 1091F for Samaritan or any other historic Semitic right-to-left script that needed a middle dot word separator. You would use 16EB for Runic or any other historic alphabetic left-to-right script that needed a middle dot word separator. And you would use 10101 for any other middle dot word separator where either of the strongly directional word separator middle dots or one of the more specialized middle dots (or U+00B7 MIDDLE DOT) itself wouldn't do. U+2027 would then remain as another specialized middle dot, and its chief distinction would be that (by default) it would linebreak like the other middle dots, but would not word break. As for script-specific layout properties of middle dots, I think that ends up being something that should be handled by fonts and rendering engines to get niceties of particular orthographies to work right. I am totally unconvinced by Michael Everson's arguments about the Samaritan word separator middle dot being actually a kind of SPACE in disguise, a faux blank displayed with a dot in it. What we are seeing instead are simply Samaritan line layout conventions involving punctuation, and rules for spacing left and right of a middle dot -- no different in concept than special French rules for spacing around punctuation marks. And the reason why people can convince themselves that this middle dot is actually a variable width space character is because the *function* of the middle dot in Samaritan (as for similar dots in many premodern orthographic traditions) is೯ಥparateදrds like೨is൩sually೯࠳sistਮ౥ading. Finally, if the UTC decides to take this general direction, we would need to thoroughly document the decision about middle dots prominently as a distinct class of punctuation marks (the way the standard currently does for spaces, quotation marks, and so on), so that implementers would understand the intent, and so that people writing proposals for encoding of more historic scripts (and there are many still out there) won't keep bringing in more script-specific middle dots for separate encoding.