Re: BOM as WJ?

From: Philippe Verdy (
Date: Wed Nov 19 2003 - 08:44:10 EST

  • Next message: Ostermueller, Erik: "creating a test font w/ CJKV Extension B characters."

    From: "Pim Blokland" <>
    > However, a couple of paragraphs up, the definition for No-Break
    > Space says:
    > > U+00A0 [No-Break Space] behaves like the following coded
    > > character sequence: U+FEFF [Zero Width No-Break Space] +
    > > U+0020 [Space] + U+FEFF [Zero Width No-Break Space].
    > Is this something that has slipped by the editors? Or am I missing
    > something?

    The main word of the sentence is "behave like". That's different from saying
    it is equivalent (no the statement does not say that NBSP is decomposable,
    but it just illustrates the non-breaking behavior of NBSP, on both sides,
    and is to be represented as if it was a normal space).

    But it's true that NBSP is used to join words, and so a better analogy would
    to say:

    > U+00A0 [No-Break Space] behaves like the following coded
    > character sequence: U+2060 [Word Joiner] +
    > U+0020 [Space] + U+2060 [Word Joiner].

    I think that the wording of this sentence was not modified as it should have
    been. But this does not constitutes a breach in the standard, as the
    sentence is mostly informative.

    Of course, coding a text with <ZWNBSP,SP,ZWNBSP> instead of just <NBSP>
    would create possible collisions with current BOM. But it is not invalid to
    use the 3 character sequence in the middle of the text. For UTF encoding
    schemes that forbid the use of BOM, ZWNBSP (U+FEFF) must be still
    interpreted exactly like the newer WORD JOINER.
    There will be no problem with BOM interpretation if a text uses instead
    <WJ,SP,WJ> even at the begining of text, which is equally valid (even if a
    WJ at the first position of text looks strange).

    But there's an opportunity now to use indenting spaces at the begining of
    lines, which may be rendered in paragraphs by keeping the spacing, if the
    first WJ is removed from the sequence, and successive WJ are collated into a
    single one:
    <SP,WJ,SP,WJ,SP,WJ> would then be encoding _roughly_ (not equivalently...)
    the same rendered text as:

    This archive was generated by hypermail 2.1.5 : Wed Nov 19 2003 - 09:39:34 EST