L2/06-115 Title: Correction of Word_Break Property Value for U+00A0 NBSP Date: 2006-04-07 Source: Ken Whistler Executive Summary: I propose that for Unicode 5.0, the Word_Break property value for U+00A0 be corrected to WB=XX. Discussion: >> I still believe the INVISIBLE LETTER does what we want it to do, >> leaving NBSP to serve the function the gods intended for it. Here is the way I believe the gods intended it to be: code name advance gc lb lb-class WB Base U+0020 SPACE + Zs SP A XX + (not preferred) U+00A0 NBSP + Zs GL XB/XA XX + (preferred) U+200B ZWSP - Cf ZW A XX - U+2060 WJ - Cf WJ XB/XA XX - U+???? NGL + Lo AL XP ALetter + Summary: SPACE and NBSP are glyphless characters with non-zero advance width. They are "spaces" (gc=Zs) and are WB=XX for the purposes of word boundary determination. The distinction between them is that in linebreaking, SPACE provides a break-after opportunity (lb-class=A), whereas NBSP prevents breaks before and after (lb-class=XB/XA). Both are formally base characters in Unicode, but NBSP is the preferred base for the display of isolated combining marks, because of problems in HTML and XML with the collapse of sequences of SPACEs (among other things). ZWSP and WJ (word joiner) are glyphless characters with zero advance width. They are format controls (gc=Cf) and are WB=XX for the purposes of word boundary determination. The distinction between them is that in linebreaking, ZWSP provides a break-after opportunity (lb-class=A), whereas WJ prevents breaks before and after (lb-class=XB/XA). Neither is a base character. What is missing is the NGL (no glyph letter = invisible letter), which would be a glyphless character with a non-zero advance width, but which would *otherwise* have typical letter properties for the purposes of parsing, linebreaking, word breaking, and so on. The lb=AL property would assure that the NGL would linebreak just like any other generic letter. The immediate problem for Safari and Apple is that we actually have WB=ALetter for U+00A0 NBSP. That ought to be fixed for Unicode 5.0. Then, having that out of the way, we should again look at the rationale for encoding NGL as indicated above. I think the word breaking behavior of combining marks "displayed in isolation" on NBSP would be fine with WB=XX, as shown above. The fact that the word breaking is not identical to a modifier letter of similar appearance occurring in the middle of a word is o.k. For example: U+02CA MODIFIER LETTER ACUTE ACCENT gc=Lm, WB=ALetter U+00A0 NO BREAK SPACE gc=Zs, WB=XX U+0301 COMBINING ACUTE ACCENT gc=Cm, WB=XX, Grapheme_Extend=True So --> --> and you'd get a word break between "a" and "a" in the second case. But that is o.k., because this is an *aberrant* use of NBSP to display a nonspacing combining mark in isolation, rather than using a modifier letter encoded explicitly to have that character as part of an orthography. If NGL were encoded with the properties as shown above, then in those paleographic cases where a letterform is actually missing, you could end up with all the appropriate behavior by using NGL instead of NBSP: --> This does not constitute a proposal to actually encode the NGL right now -- we already have such a proposal on record. However, I think the argument for NGL makes it clearer that it is not a defect for NBSP used as a base to retain its normal word break property, just as it is not a defect for it to retain its normal linebreaking property. Further Discussion in Followup: I don't think we would want to change the recommendation that to *display* a nonspacing mark in isolation you just apply it to a NBSP. That would work for any nonspacing mark, and it doesn't matter what it is used for. The NGL, as I see it, at least, would simply be available for those instances where people are really representing words in paleographic (or possibly some didactic) contexts and happen to have a diacritic where the visible form of the base is missing. To prevent inappropriate word breaks, they could use NGL *instead* of NBSP under those circumstances. Furthermore, there are no guarantees about application of combining marks for symbols turning text units formally into "symbols" anyway. The editors have been carefully drafting text for Unicode 5.0 to clarify this about the combining enclosing marks, for example.