L2/07-340 Date: Sat, 6 Oct 2007 Source: Mark Davis Subject: OGHAM SPACE MARK shouldn't be whitespace ================= (1) It has become clear that the U+1680 OGHAM SPACE MARK character is not really a whitespace character. Users of the UCD expect that whitespace characters are, well, white space -- that is, that they do not have visible glyph in normal usage. (Of course, they may have a visible glyph in special circumstances, such as in a "Show Hidden" mode.) According to the standard, the conventional representation of U+1680 OGHAM SPACE MARK has a visible bar. Thus it should not be categorized a whitespace character. The character may well be used to separate words, but that is orthogonal to whether it is whitespace or not. There are many characters in Unicode that are used to separate words, but that are not whitespace, such as U+1361 ETHIOPIC WORDSPACE. This character should be removed from the Whitespace property, and have its general category changed from Zs to So. (2) There is a separate but related issue that was discussed at the last UTC, having to do with how visible word separator characters behave in terms of word-wrap. That is, suppose that we have the text "The:quick:brown:fox:jumped.", where : represents a visible word separator, and we break between "brown" and "fox". Then the desired visual appearance could be (A) suppress the visible word separator The:quick:brown fox:jumped. (B) break before the visible word separator The:quick:brown :fox:jumped. (C) break after the visible word separator The:quick:brown: fox:jumped. Both (B) and (C) can be expressed with the current Unicode Line Breaking Algorithm. However, if there are characters that behave like (A), they cannot. If the U+1680 OGHAM SPACE MARK or other visible word separator characters behave like (A), then it may be worth having a property for them, or at least a documented list somewhere of them. (The latter might be appropriate if the (A) behavior is only exhibited with complex scripts anyway, or only with archaic scripts.) -- Mark