From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Aug 11 2003 - 15:26:44 EDT
Peter Kirk wrote:
> I think this may be a "Peter mistake". I meant to refer to spacing
> diacritics. Sorry.
>
> It is certainly highly inappropriate for spacing diacritics to
> be considered word boundaries.
Why? It is entirely dependent on the orthography and conventions
involved. There is probably as much (or more) bad ASCII usage
of spacing diacritics like `this', where a grave accent character
is being misapplied to make a directional quotation mark, as
there is actual, linguistically appropriate use of spacing
diacritics.
Also, everyone should consider carefully the status of UAX #29,
Text Boundaries.
<quote>
2 Conformance
This is informative material. There are many different ways to
divide text elements corresponding to grapheme clusters, words
and sentences, and the Unicode Standard and this document do not
restrict the ways in which implementations can do this.
This specification is a <emphasis>default</emphasis> mechanism;
more sophisticated engines can and should tailor it for particular
locales or environments. ...
</quote>
The whole UAX is informative. It is a here's-how-you-can-approach-
the-problem implementation guide with some suggestions for
rules and classes.
*If* you are working with an orthography that uses one or more
spacing diacritics, and
*If* those spacing diacritics need to be represented by
<SPACE, NSM> sequences,
then you are in the situation where your implementation of
text boundaries should take <SPACE, NSM> sequences explicitly
into account, so as to result in expected behavior for that
orthography.
Everyone has had experiences with their platform UI producing
bad results for text boundaries. The Solaris platform I am
writing this on right now, for example, implements a double-click
word selection that treats the string "`this'," above, including
the grave accent, the apostrophe, and the comma, as a "word".
Is that right or wrong? Well, it depends on what you are trying
to do, I expect.
But even the most sophisticated platform implementers can only
do so much with processes like default word selection. It is
bound to be wrong for one purpose or another and for one
orthography or another. Ultimately you need to have tailored
processes that can be orthography-specific if you want to
get best results.
--Ken
This archive was generated by hypermail 2.1.5 : Mon Aug 11 2003 - 16:02:52 EDT