Re: Questions on ZWNBS - for line initial holam plus alef

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Aug 11 2003 - 15:26:44 EDT

  • Next message: John Cowan: "Re: Questions on ZWNBS - for line initial holam plus alef"

    Peter Kirk wrote:

    > I think this may be a "Peter mistake". I meant to refer to spacing
    > diacritics. Sorry.
    >
    > It is certainly highly inappropriate for spacing diacritics to
    > be considered word boundaries.

    Why? It is entirely dependent on the orthography and conventions
    involved. There is probably as much (or more) bad ASCII usage
    of spacing diacritics like `this', where a grave accent character
    is being misapplied to make a directional quotation mark, as
    there is actual, linguistically appropriate use of spacing
    diacritics.

    Also, everyone should consider carefully the status of UAX #29,
    Text Boundaries.

    <quote>
    2 Conformance

    This is informative material. There are many different ways to
    divide text elements corresponding to grapheme clusters, words
    and sentences, and the Unicode Standard and this document do not
    restrict the ways in which implementations can do this.

    This specification is a <emphasis>default</emphasis> mechanism;
    more sophisticated engines can and should tailor it for particular
    locales or environments. ...
    </quote>

    The whole UAX is informative. It is a here's-how-you-can-approach-
    the-problem implementation guide with some suggestions for
    rules and classes.

    *If* you are working with an orthography that uses one or more
    spacing diacritics, and
    *If* those spacing diacritics need to be represented by
    <SPACE, NSM> sequences,

    then you are in the situation where your implementation of
    text boundaries should take <SPACE, NSM> sequences explicitly
    into account, so as to result in expected behavior for that
    orthography.

    Everyone has had experiences with their platform UI producing
    bad results for text boundaries. The Solaris platform I am
    writing this on right now, for example, implements a double-click
    word selection that treats the string "`this'," above, including
    the grave accent, the apostrophe, and the comma, as a "word".
    Is that right or wrong? Well, it depends on what you are trying
    to do, I expect.

    But even the most sophisticated platform implementers can only
    do so much with processes like default word selection. It is
    bound to be wrong for one purpose or another and for one
    orthography or another. Ultimately you need to have tailored
    processes that can be orthography-specific if you want to
    get best results.

    --Ken



    This archive was generated by hypermail 2.1.5 : Mon Aug 11 2003 - 16:02:52 EDT