Re: Zero Width Word Boundary

From: Javier SOLA (lists@khmeros.info)
Date: Fri Jan 30 2009 - 02:02:06 CST

Next message: William J Poser: "Re: Urgent call for clarification of Armenian numbering rules"

Previous message: Atif Gulzar: "Re: Zero Width Word Boundary"
In reply to: Doug Ewell: "Re: Zero Width Word Boundary"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

It is a little more complex.

- ZWSP has always been defined as a word boundary, until last May. Its
use to separate syllables is incorrect, because it breaks the words, not
allowing correct functioning of spell-checkers, search engines or
word-selection.

- In 2003 its typology was changed in UNICODE from being a spacing
character to a format character. As the character was not explicitly
added to the word-boundary algorithm, all software that used the Unicode
tables directly stopped working for functions that used this character
as a word boundary (such as spell-checking).

- On 22nd May 2008 an erroneus errata was added to Unicode 5.0 and 5.1
eliminating the word-boundary property of the character, leaving only
the line-breaking property. Unfortunatelly, this errata has NOT YET BEEN
REMOVED.

- On August 2008 I presented a proposal to the UTC to revert to the
original word boundary property of the character.
http://www.unicode.org/L2/L2008/08344-zwsp-myanmar.pdf

- On the last meeting of the UTC, a public review issue was published,
it is as follows:

--------------------

The Unicode Technical Committee is considering changing the Word_Break
property value for ZWSP from the value WB=Format to the value WB=Other
(WB=XX). The effect of this would be to have the ZWSP act as a
word-separator in the default word break algorithm, and is consistent
with its usage in Thai, Lao, and other scripts that don't use spaces for
separating words.

------------------------

The issue should be resolved in the next UTC meeting, which should also
ensure that the ERRATA is erased.

Regards,

Javier

Doug Ewell wrote:
> ɹɐzlnƃ ɟıʇɐ <atif dot gulzar at gmail dot com> wrote:
>
>> I have checked and could not find any Unicode character for word
>> separator (zero width space as WORD separator). This character/code
>> is needed for languages where space is not used as word separator.
>> The available zero width characters are incapable to address this
>> issue. e.g.
>>
>> U+200B Zero Width Space: This character is intended for line break
>> control (In Lao language lines can be broken at syllable levels, Lao
>> uses U+200B to mark syllable boundaries).
>> ...
>
> According to Section 11.1 on Thai in TUS 5.0 (p. 376), and Section
> 16.2 on layout controls (p. 535), U+200B ZERO WIDTH SPACE is the right
> character for marking word boundaries in languages like Thai which
> don't use visible spaces between words. I don't see why this would be
> different for Lao.
>
> --
> Doug Ewell * Thornton, Colorado, USA * RFC 4645 * UTN #14
> http://www.ewellic.org
> http://www1.ietf.org/html.charters/ltru-charter.html
> http://www.alvestrand.no/mailman/listinfo/ietf-languages ˆ
>
>
>

Next message: William J Poser: "Re: Urgent call for clarification of Armenian numbering rules"
Previous message: Atif Gulzar: "Re: Zero Width Word Boundary"
In reply to: Doug Ewell: "Re: Zero Width Word Boundary"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jan 30 2009 - 02:04:21 CST