Re: Questions on ZWNBS - for line initial holam plus alef

From: Mark Davis (
Date: Mon Aug 11 2003 - 19:06:55 EDT

  • Next message: Mark Davis: "Assume everything on this list is ignored (was Re: Newbie Question - what are all those duplicated characters FO R?)"

    Some of this seems to be in reference to an earlier contention that
    Text Boundaries (inc. Lines) break between the space and the
    non-spacing mark. I think this was attributed to Phillipe.

    [This may not be true: I don't actually read his email, because the
    information content per line falls below my email threshold; not to
    say that there may not be information there, but I cannot afford to
    take the time to find out -- sadly, one of my character flaws.]

    All of the text boundaries preserve grapheme cluster boundaries, which
    never separate a base character (including space and NBSP) from a
    following NSM. In addition, each of the boundary types above grapheme
    clusters make some statement about the behavior of the grapheme
    cluster. For example, with line boundaries a SPACE + NSM has a special
    behavior. With the others, the behavior is the same as the base

    As Ken points out, in any event these are default boundaries, and can
    be tailored. That being said, if the normal behavior of the default
    can be improvied, and someone has a concrete proposal for doing so,
    then it can be considered.

    ► “Eppur si muove” ◄

    ----- Original Message -----
    From: "Kenneth Whistler" <>
    To: <>
    Cc: <>; <>
    Sent: Monday, August 11, 2003 12:26
    Subject: Re: Questions on ZWNBS - for line initial holam plus alef

    > Peter Kirk wrote:
    > > I think this may be a "Peter mistake". I meant to refer to spacing
    > > diacritics. Sorry.
    > >
    > > It is certainly highly inappropriate for spacing diacritics to
    > > be considered word boundaries.
    > Why? It is entirely dependent on the orthography and conventions
    > involved. There is probably as much (or more) bad ASCII usage
    > of spacing diacritics like `this', where a grave accent character
    > is being misapplied to make a directional quotation mark, as
    > there is actual, linguistically appropriate use of spacing
    > diacritics.
    > Also, everyone should consider carefully the status of UAX #29,
    > Text Boundaries.
    > <quote>
    > 2 Conformance
    > This is informative material. There are many different ways to
    > divide text elements corresponding to grapheme clusters, words
    > and sentences, and the Unicode Standard and this document do not
    > restrict the ways in which implementations can do this.
    > This specification is a <emphasis>default</emphasis> mechanism;
    > more sophisticated engines can and should tailor it for particular
    > locales or environments. ...
    > </quote>
    > The whole UAX is informative. It is a here's-how-you-can-approach-
    > the-problem implementation guide with some suggestions for
    > rules and classes.
    > *If* you are working with an orthography that uses one or more
    > spacing diacritics, and
    > *If* those spacing diacritics need to be represented by
    > <SPACE, NSM> sequences,
    > then you are in the situation where your implementation of
    > text boundaries should take <SPACE, NSM> sequences explicitly
    > into account, so as to result in expected behavior for that
    > orthography.
    > Everyone has had experiences with their platform UI producing
    > bad results for text boundaries. The Solaris platform I am
    > writing this on right now, for example, implements a double-click
    > word selection that treats the string "`this'," above, including
    > the grave accent, the apostrophe, and the comma, as a "word".
    > Is that right or wrong? Well, it depends on what you are trying
    > to do, I expect.
    > But even the most sophisticated platform implementers can only
    > do so much with processes like default word selection. It is
    > bound to be wrong for one purpose or another and for one
    > orthography or another. Ultimately you need to have tailored
    > processes that can be orthography-specific if you want to
    > get best results.
    > --Ken

    This archive was generated by hypermail 2.1.5 : Mon Aug 11 2003 - 19:52:08 EDT