RE: [indic] Unicode Processing Requirements for Tamil (was: 28th IUC paper - Tamil Unicode New)

Date: Fri Sep 02 2005 - 15:33:41 CDT

    > Peter Constable
    > Sure, because (unless you happen to notice this bit of text buried
    > in the standard), the Latin superscript digits are treated as *not*
    > being part of the same script run, and so the cluster is broken, etc.
    I have been lurking for about two weeks on this list and have not as of yet
    dared to jump in. However this thread, I think, touches on a concern of
    mine and I would like to truly find out what the Standard says about script
    runs and modifier letters.

    The reason is that in phonetic transcriptions and many orthographies a
    number of modifier letters and superscript numbers are used as word
    building characters. When double clicking a word, I would want the whole
    word to be selected, not broken up at one of these "modifiers". This is not
    the case in most word processing programs. There is no standard behavior.
    To wit I did an experiment to find out what was selected when double
    clicking on the first letter of the following word-modifier combinations.

    | |M |U+ |Cat |Name Prefix |Name |
    | wʰord |ʰ | 02B0 | Lm | Modifier |Small H |
    | | | | | Letter | |
    | wⁿord |ⁿ | 207F | Ll | Superscript | N |
    | | | | | Letter | |
    | w˩ord |˩ | 02E9 | Sk | Modifier | Extra-Low Tone Bar |
    | | | | | Letter | |
    | w¹ord |¹ | 00B9 | No | Superscript | One |
    | w1ord |1 | 0031 | Nd | Digit | One |
    | woːrd |ː | 02D0 |Lm |Modifier Letter| Triangular Colon |
    | wo:rd |: | 003A | Po |— |COLON |
    | wʼord |ʼ | 02BC |Lm |Modifier Letter|Apostrophe |
    | w'ord |' |0027 |Po |— |APOSTROPHE |
    | wˆord |ˆ | 02C6 |Lm |Modifier Letter|Circumflex Accent |
    | w˂ord |˂ |02C2 |Sk |Modifier Letter|Left Arrowhead |

    I used the following Word processors
    | Name | Version|Abbreviation|
    | MS Word | 2003 |WD03 |
    | UltraEdit-32 | 11.10a |UE11 |
    | MS Publisher | 2003 |Pub03 |
    | Notepad | 5.1 | Np5 |
    | WorldPad | 2.0 |WP2 |
    | OpenOffice | 2.0 |OOW2 |
    | Writer | | |

    This is what was selected. W=the whole word, L=the letters up to the
    modifier, LM=the letters up to and including the modifier.

    | M | | WD03 | UE11 | Pub03 | Np5 | WP2 | OOW2 |
    | : | 003A |L |L |L |W |L |L |
    | ¹ | 00B9 |L |W |LM |W |L |L |
    | ː | 02D0 |L |W |W |L |W |W |
    | ˆ | 02C6 |L |W |W |L |W |W |
    | ˩ | 02E9 |L |W |W |W |L |L |
    | ˂ |02C2 |L |W |W |W |L |L |
    | ʰ | 02B0 |L |W |W |W |W |W |
    | ˈ | 02C8 |L |W |W |W |W |W |
    | ʼ | 02BC |L |W |W |W |W |W |
    | ⁿ | 207F |W |W |LM |W |W |W |
    | ' |0027 |W |L |W |W |L |W |
    | 1 | 0031 |W |W |W |W |L |W |

    Note that no two pieces of software behave the same. It seems a standard
    behavior should be made clear in the Unicode standard

    Kent Spielmann

    SIL International, Dallas

