Re: Zero Width Word Boundary

From: Javier SOLA (lists@khmeros.info)
Date: Fri Jan 30 2009 - 02:02:06 CST

  • Next message: William J Poser: "Re: Urgent call for clarification of Armenian numbering rules"

    It is a little more complex.

    - ZWSP has always been defined as a word boundary, until last May. Its
    use to separate syllables is incorrect, because it breaks the words, not
    allowing correct functioning of spell-checkers, search engines or
    word-selection.

    - In 2003 its typology was changed in UNICODE from being a spacing
    character to a format character. As the character was not explicitly
    added to the word-boundary algorithm, all software that used the Unicode
    tables directly stopped working for functions that used this character
    as a word boundary (such as spell-checking).

    - On 22nd May 2008 an erroneus errata was added to Unicode 5.0 and 5.1
    eliminating the word-boundary property of the character, leaving only
    the line-breaking property. Unfortunatelly, this errata has NOT YET BEEN
    REMOVED.

    - On August 2008 I presented a proposal to the UTC to revert to the
    original word boundary property of the character.
    http://www.unicode.org/L2/L2008/08344-zwsp-myanmar.pdf

    - On the last meeting of the UTC, a public review issue was published,
    it is as follows:

    --------------------

    The Unicode Technical Committee is considering changing the Word_Break
    property value for ZWSP from the value WB=Format to the value WB=Other
    (WB=XX). The effect of this would be to have the ZWSP act as a
    word-separator in the default word break algorithm, and is consistent
    with its usage in Thai, Lao, and other scripts that don't use spaces for
    separating words.

    ------------------------

    The issue should be resolved in the next UTC meeting, which should also
    ensure that the ERRATA is erased.

    Regards,

    Javier

    Doug Ewell wrote:
    > ɹɐzlnƃ ɟıʇɐ <atif dot gulzar at gmail dot com> wrote:
    >
    >> I have checked and could not find any Unicode character for word
    >> separator (zero width space as WORD separator). This character/code
    >> is needed for languages where space is not used as word separator.
    >> The available zero width characters are incapable to address this
    >> issue. e.g.
    >>
    >> U+200B Zero Width Space: This character is intended for line break
    >> control (In Lao language lines can be broken at syllable levels, Lao
    >> uses U+200B to mark syllable boundaries).
    >> ...
    >
    > According to Section 11.1 on Thai in TUS 5.0 (p. 376), and Section
    > 16.2 on layout controls (p. 535), U+200B ZERO WIDTH SPACE is the right
    > character for marking word boundaries in languages like Thai which
    > don't use visible spaces between words. I don't see why this would be
    > different for Lao.
    >
    > --
    > Doug Ewell * Thornton, Colorado, USA * RFC 4645 * UTN #14
    > http://www.ewellic.org
    > http://www1.ietf.org/html.charters/ltru-charter.html
    > http://www.alvestrand.no/mailman/listinfo/ietf-languages ˆ
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Fri Jan 30 2009 - 02:04:21 CST