Re: Zero Width Word Boundary

From: Javier SOLA (lists@khmeros.info)
Date: Fri Jan 30 2009 - 02:17:27 CST

  • Next message: Kent Karlsson: "Re: Urgent call for clarification of Armenian numbering rules"

    Unfortunatelly, computer treatment of Lao, Thai and Myanmar in many
    cases has tended to separate syllables, instead of full words, not
    considering the preference that exist is all these scripts to keep the
    words together at the end of a line. It is always better to break at the
    end of a word. This practice has been damaged by european style
    newspapers that write in narrow columns, making layout very complicated
    if long words are kept together, and hyphenation has started (with or
    without hyphen, depending on cases), but this is not the preferred usage
    of the language. In a book you would tend to break at the end of words.
    Syllable separation makes modern treatment of text impossible.

    We are now moving towards automatic dictionary-based word-separation and
    line-breaking for these scripts, and this would always have to be word
    based.

    Javier

    Atif Gulzar wrote:
    >> According to Section 11.1 on Thai in TUS 5.0 (p. 376), and Section 16.2 on
    >> layout controls (p. 535), U+200B ZERO WIDTH SPACE is the right character for
    >> marking word boundaries in languages like Thai which don't use visible
    >> spaces between words. I don't see why this would be different for Lao.
    >>
    >
    >
    > Lao script is close to Thai but it has different script block (U+0E80
    > to U+0EFF) and language processing rules. Unlike Thai, Lao script can
    > be break at syllable level at line breaks.
    >
    > http://www.panl10n.net/english/final%20reports/pdf%20files/Laos/LAO06.pdf
    >
    >
    > --
    > Best Regards,
    > Atif Gulzar
    >
    > I ◘◘◘◘ Unicode, ɹɐzlnƃ ɟıʇɐ
    >
    >
    >
    >
    > On Fri, Jan 30, 2009 at 11:59 AM, Doug Ewell <doug@ewellic.org> wrote:
    >
    >> ɹɐzlnƃ ɟıʇɐ <atif dot gulzar at gmail dot com> wrote:
    >>
    >>
    >>> I have checked and could not find any Unicode character for word separator
    >>> (zero width space as WORD separator). This character/code is needed for
    >>> languages where space is not used as word separator. The available zero
    >>> width characters are incapable to address this issue. e.g.
    >>>
    >>> U+200B Zero Width Space: This character is intended for line break control
    >>> (In Lao language lines can be broken at syllable levels, Lao uses U+200B to
    >>> mark syllable boundaries).
    >>> ...
    >>>
    >> According to Section 11.1 on Thai in TUS 5.0 (p. 376), and Section 16.2 on
    >> layout controls (p. 535), U+200B ZERO WIDTH SPACE is the right character for
    >> marking word boundaries in languages like Thai which don't use visible
    >> spaces between words. I don't see why this would be different for Lao.
    >>
    >> --
    >> Doug Ewell * Thornton, Colorado, USA * RFC 4645 * UTN #14
    >> http://www.ewellic.org
    >> http://www1.ietf.org/html.charters/ltru-charter.html
    >> http://www.alvestrand.no/mailman/listinfo/ietf-languages ˆ
    >>
    >>
    >>
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Fri Jan 30 2009 - 02:19:24 CST