Word Selection (was: RE: [indic] Unicode Processing Requirements for Tamil)

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Sep 02 2005 - 19:20:51 CDT

  • Next message: James Kass: "Re: [indic] Unicode Processing Requirements for Tamil (was: 28th IUC paper - Tamil Unicode New)"

    Kent Spielmann wrote:

    > When double clicking a word, I would want the whole
    > word to be selected, not broken up at one of these "modifiers". This is not
    > the case in most word processing programs. There is no standard behavior.

    Correct. But your expectation that there should be runs somewhat afoul
    of the nature of the problem.

    There is no universal definition of "a word" in the first place,
    that could be defined purely on the basis of a character encoding,
    independent of considerations of particular languages and particular
    orthographic conventions.

    [ Experimental data excised ]

    > Note that no two pieces of software behave the same. It seems a standard
    > behavior should be made clear in the Unicode standard

    Well, I disagree in part about this assessment. How word processors choose
    to implement double-click behavior is their concern, and may involve
    a lot of factors and opinions regarding what is "right" and what
    is "wrong" default selection behavior. It is not the place of
    the Unicode Standard to dictate that, particularly in the
    absence of any way of knowing what constraints implementations
    may be operating under or what requirements their customers may

    The Unicode Standard *does*, however, supply a specification of a default
    word boundary detection algorithm (in UAX #29), which can be
    used, but it is expected that implementations will, in most cases,
    choose to tailor it in one way or another, or in other cases
    simply implement their own word selection.

    If you work through that specification and apply it to the
    particular characters you have chosen, you'll end up with
    the following determinations:

    Class Aletter: 02B0, 02BC, 02C6, 02D0, 207F

    Class MidLetter: 0027, 003A

    Class Numeric: 0031

    Class Other: 00B9, 02C2, 02E9

    And the default word break determinations are as follows,
    where "x" means don't break here, and "" means break here.

    ALetter x ALetter x ALetter
    ALetter x MidLetter x ALetter
    ALetter x Numeric x ALetter
    ALetter Other ALetter

    which means by your chart, 00B9 (superscript 1), 02C2 (left
    arrowhead), and 02E9 (extra-low tone bar) would not be
    judged "letterlike" enough to be counted within the "word",
    (gets an "L" in your chart) whereas the other characters would
    be included within the "word" (gets a "W" in your chart).

    I actually think that is a pretty good default, as superscript
    numerals, tone letters, and IPA non-letterlike diacritics
    such as the left arrowhead are not common in actual, practical
    orthographies. They occur occasionally, of course, and do
    occur in transcriptional material, but I consider those to
    be edge cases that I wouldn't expect generic software to have
    to deal with. I don't expect a general purpose word processor
    to allow me to double-click in the middle of a close
    IPA transcription and correctly determine a "word" boundary
    in such material, any more than I would expect it to be
    able to parse out a mathematical expression or a particular
    formal language construct. A special-purpose word processor
    could, of course -- the way programming editors parse and
    highlight C or Java constructs automatically. But that's
    way beyond the requirements for something like Notepad.

    Except for U+003A COLON the UAX #29 specification matches, apparently,
    the actual behavior of OpenOffice Writer, from your chart, from which
    I surmise that it probably bases its word selection on
    a WordBreak iterator class from ICU, based on implementation
    of UAX #29 word boundary detection. And COLON is a true
    edge case -- for most purposes it is probably better to break
    around it, but it does get used in some languages, including
    Swedish, as parts of words.

    WorldPad is similar, but doesn't show the later UAX #29 changes
    for U+0027 and U+0031, so it might have been based on earlier
    published word boundary detection suggestions from Unicode 3.0.


    This archive was generated by hypermail 2.1.5 : Fri Sep 02 2005 - 19:27:04 CDT