Word Selection (was: RE: [indic] Unicode Processing Requirements for Tamil)

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Sep 02 2005 - 19:20:51 CDT

Next message: James Kass: "Re: [indic] Unicode Processing Requirements for Tamil (was: 28th IUC paper - Tamil Unicode New)"

Previous message: Rein: "Re: unsubscribe please"
Next in thread: Mark Davis: "Re: Word Selection (was: RE: [indic] Unicode Processing Requirements for Tamil)"
Reply: Mark Davis: "Re: Word Selection (was: RE: [indic] Unicode Processing Requirements for Tamil)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Kent Spielmann wrote:

> When double clicking a word, I would want the whole
> word to be selected, not broken up at one of these "modifiers". This is not
> the case in most word processing programs. There is no standard behavior.

Correct. But your expectation that there should be runs somewhat afoul
of the nature of the problem.

There is no universal definition of "a word" in the first place,
that could be defined purely on the basis of a character encoding,
independent of considerations of particular languages and particular
orthographic conventions.

[ Experimental data excised ]

> Note that no two pieces of software behave the same. It seems a standard
> behavior should be made clear in the Unicode standard

Well, I disagree in part about this assessment. How word processors choose
to implement double-click behavior is their concern, and may involve
a lot of factors and opinions regarding what is "right" and what
is "wrong" default selection behavior. It is not the place of
the Unicode Standard to dictate that, particularly in the
absence of any way of knowing what constraints implementations
may be operating under or what requirements their customers may
have.

The Unicode Standard *does*, however, supply a specification of a default
word boundary detection algorithm (in UAX #29), which can be
used, but it is expected that implementations will, in most cases,
choose to tailor it in one way or another, or in other cases
simply implement their own word selection.

If you work through that specification and apply it to the
particular characters you have chosen, you'll end up with
the following determinations:

Class Aletter: 02B0, 02BC, 02C6, 02D0, 207F

Class MidLetter: 0027, 003A

Class Numeric: 0031

Class Other: 00B9, 02C2, 02E9

And the default word break determinations are as follows,
where "x" means don't break here, and "÷" means break here.

ALetter x ALetter x ALetter
ALetter x MidLetter x ALetter
ALetter x Numeric x ALetter
ALetter ÷ Other ÷ ALetter

which means by your chart, 00B9 (superscript 1), 02C2 (left
arrowhead), and 02E9 (extra-low tone bar) would not be
judged "letterlike" enough to be counted within the "word",
(gets an "L" in your chart) whereas the other characters would
be included within the "word" (gets a "W" in your chart).

I actually think that is a pretty good default, as superscript
numerals, tone letters, and IPA non-letterlike diacritics
such as the left arrowhead are not common in actual, practical
orthographies. They occur occasionally, of course, and do
occur in transcriptional material, but I consider those to
be edge cases that I wouldn't expect generic software to have
to deal with. I don't expect a general purpose word processor
to allow me to double-click in the middle of a close
IPA transcription and correctly determine a "word" boundary
in such material, any more than I would expect it to be
able to parse out a mathematical expression or a particular
formal language construct. A special-purpose word processor
could, of course -- the way programming editors parse and
highlight C or Java constructs automatically. But that's
way beyond the requirements for something like Notepad.

Except for U+003A COLON the UAX #29 specification matches, apparently,
the actual behavior of OpenOffice Writer, from your chart, from which
I surmise that it probably bases its word selection on
a WordBreak iterator class from ICU, based on implementation
of UAX #29 word boundary detection. And COLON is a true
edge case -- for most purposes it is probably better to break
around it, but it does get used in some languages, including
Swedish, as parts of words.

WorldPad is similar, but doesn't show the later UAX #29 changes
for U+0027 and U+0031, so it might have been based on earlier
published word boundary detection suggestions from Unicode 3.0.

--Ken

Next message: James Kass: "Re: [indic] Unicode Processing Requirements for Tamil (was: 28th IUC paper - Tamil Unicode New)"
Previous message: Rein: "Re: unsubscribe please"
Next in thread: Mark Davis: "Re: Word Selection (was: RE: [indic] Unicode Processing Requirements for Tamil)"
Reply: Mark Davis: "Re: Word Selection (was: RE: [indic] Unicode Processing Requirements for Tamil)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Sep 02 2005 - 19:27:04 CDT