L2/06-115

Title: Correction of Word_Break Property Value for U+00A0 NBSP

Date: 2006-04-07

Source: Ken Whistler


Executive Summary:

I propose that for Unicode 5.0, the Word_Break property
value for U+00A0 be corrected to WB=XX.

Discussion:


>> I still believe the INVISIBLE LETTER does what we want it to do, 
>> leaving NBSP to serve the function the gods intended for it.


Here is the way I believe the gods intended it to be:

code   name   advance  gc   lb   lb-class  WB       Base

U+0020 SPACE  +        Zs   SP   A         XX       + (not preferred)
U+00A0 NBSP   +        Zs   GL   XB/XA     XX       + (preferred)
U+200B ZWSP   -        Cf   ZW   A         XX       -
U+2060 WJ     -        Cf   WJ   XB/XA     XX       -

U+???? NGL    +        Lo   AL   XP        ALetter  +

Summary:

SPACE and NBSP are glyphless characters with non-zero
advance width. They are "spaces" (gc=Zs) and are WB=XX
for the purposes of word boundary determination. The
distinction between them is that in linebreaking, SPACE
provides a break-after opportunity (lb-class=A), whereas
NBSP prevents breaks before and after (lb-class=XB/XA).
Both are formally base characters in Unicode, but NBSP
is the preferred base for the display of isolated combining
marks, because of problems in HTML and XML with the
collapse of sequences of SPACEs (among other things).

ZWSP and WJ (word joiner) are glyphless characters with
zero advance width. They are format controls (gc=Cf) and
are WB=XX for the purposes of word boundary determination.
The distinction between them is that in linebreaking, ZWSP
provides a break-after opportunity (lb-class=A), whereas
WJ prevents breaks before and after (lb-class=XB/XA).
Neither is a base character.

What is missing is the NGL (no glyph letter = invisible letter),
which would be a glyphless character with a non-zero
advance width, but which would *otherwise* have typical
letter properties for the purposes of parsing, linebreaking,
word breaking, and so on. The lb=AL property would assure that
the NGL would linebreak just like any other generic letter.

The immediate problem for Safari and Apple is that we actually
have WB=ALetter for U+00A0 NBSP. That ought to be fixed
for Unicode 5.0. Then, having that out of the way, we should
again look at the rationale for encoding NGL as indicated
above.

I think the word breaking behavior of combining marks
"displayed in isolation" on NBSP would be fine with
WB=XX, as shown above. The fact that the word breaking is
not identical to a modifier letter of similar appearance
occurring in the middle of a word is o.k. For example:

U+02CA MODIFIER LETTER ACUTE ACCENT gc=Lm, WB=ALetter

U+00A0 NO BREAK SPACE gc=Zs, WB=XX
U+0301 COMBINING ACUTE ACCENT gc=Cm, WB=XX, Grapheme_Extend=True

So 

  <a, 02CA, a>       --> <ALetter, ALetter, ALetter>
  <a, 00A0, 0301, a> --> <ALetter, XX, ALetter>
  
and you'd get a word break between "a" and "a" in the second
case. But that is o.k., because this is an *aberrant* use of
NBSP to display a nonspacing combining mark in isolation,
rather than using a modifier letter encoded explicitly to
have that character as part of an orthography.

If NGL were encoded with the properties as shown above, then
in those paleographic cases where a letterform is actually
missing, you could end up with all the appropriate behavior
by using NGL instead of NBSP:

   <a, NGL, 0301, a> --> <ALetter, ALetter, ALetter>
   
This does not constitute a proposal to actually encode the
NGL right now -- we already have such a proposal on record.
However, I think the argument for NGL makes it clearer that
it is not a defect for NBSP used as a base to retain its
normal word break property, just as it is not a defect for it to
retain its normal linebreaking property.

Further Discussion in Followup:

I don't think we would want to change the recommendation that
to *display* a nonspacing mark in isolation you just apply
it to a NBSP. That would work for any nonspacing mark, and it
doesn't matter what it is used for. The NGL, as I see it,
at least, would simply be available for those instances
where people are really representing words in paleographic
(or possibly some didactic) contexts and happen to have
a diacritic where the visible form of the base is missing.
To prevent inappropriate word breaks, they could use NGL
*instead* of NBSP under those circumstances.

Furthermore, there are no guarantees about application of
combining marks for symbols turning text units formally into
"symbols" anyway. The editors have been carefully drafting text for
Unicode 5.0 to clarify this about the combining enclosing
marks, for example.