Re: Zero Width Word Boundary

From: Javier SOLA (
Date: Fri Jan 30 2009 - 02:02:06 CST

  • Next message: William J Poser: "Re: Urgent call for clarification of Armenian numbering rules"

    It is a little more complex.

    - ZWSP has always been defined as a word boundary, until last May. Its
    use to separate syllables is incorrect, because it breaks the words, not
    allowing correct functioning of spell-checkers, search engines or

    - In 2003 its typology was changed in UNICODE from being a spacing
    character to a format character. As the character was not explicitly
    added to the word-boundary algorithm, all software that used the Unicode
    tables directly stopped working for functions that used this character
    as a word boundary (such as spell-checking).

    - On 22nd May 2008 an erroneus errata was added to Unicode 5.0 and 5.1
    eliminating the word-boundary property of the character, leaving only
    the line-breaking property. Unfortunatelly, this errata has NOT YET BEEN

    - On August 2008 I presented a proposal to the UTC to revert to the
    original word boundary property of the character.

    - On the last meeting of the UTC, a public review issue was published,
    it is as follows:


    The Unicode Technical Committee is considering changing the Word_Break
    property value for ZWSP from the value WB=Format to the value WB=Other
    (WB=XX). The effect of this would be to have the ZWSP act as a
    word-separator in the default word break algorithm, and is consistent
    with its usage in Thai, Lao, and other scripts that don't use spaces for
    separating words.


    The issue should be resolved in the next UTC meeting, which should also
    ensure that the ERRATA is erased.



    Doug Ewell wrote:
    > ɹɐzlnƃ ɟıʇɐ <atif dot gulzar at gmail dot com> wrote:
    >> I have checked and could not find any Unicode character for word
    >> separator (zero width space as WORD separator). This character/code
    >> is needed for languages where space is not used as word separator.
    >> The available zero width characters are incapable to address this
    >> issue. e.g.
    >> U+200B Zero Width Space: This character is intended for line break
    >> control (In Lao language lines can be broken at syllable levels, Lao
    >> uses U+200B to mark syllable boundaries).
    >> ...
    > According to Section 11.1 on Thai in TUS 5.0 (p. 376), and Section
    > 16.2 on layout controls (p. 535), U+200B ZERO WIDTH SPACE is the right
    > character for marking word boundaries in languages like Thai which
    > don't use visible spaces between words. I don't see why this would be
    > different for Lao.
    > --
    > Doug Ewell * Thornton, Colorado, USA * RFC 4645 * UTN #14
    > ˆ

    This archive was generated by hypermail 2.1.5 : Fri Jan 30 2009 - 02:04:21 CST