Re: GR and letter case

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jul 15 2009 - 14:38:55 CDT

  • Next message: announcements@unicode.org: "[Unicode Announcement] 33rd Internationalization & Unicode Conference - Program Online"

    Christoph Burgmer wrote:

    > > How would we treat letter case as of UTR#21? Even using full stop for the
    > >
    > > compulsory neutral tone turns up wrong title case (example in Python):
    > > >>> "bu jy.daw".title()
    > >
    > > 'Bu Jy.Daw'
    > >
    > > Though in my eyes it should be
    > > 'Bu Jy.daw'
    > >
    > > Would UTR#21 even handle those cases? Would such a character fall into the
    > > "Letter Modifier" class?
    >
    > I'd like to re-raise this question more explicitly for the compulsory neutral
    > tone, as its usage seems to be official.
    >
    > Would one map this glyph to the full stop U+002e , as Y.R. Chao probably
    > designed it, and which is used in IPA to separate syllables, or rather look
    > for a character falling in the class "case-ignorable" so that the titlecase
    > algorithm from UTR#21 takes effect?

    In addition to the points made by Asmus, I'll add my own
    elaborations here:

    1. U+002E FULL STOP already *is* in the class "case-ignorable",
       as is made abundantly clear by the new derived case-related
       property Case_Ignorable, now included in DerivedCoreProperties.txt
       in the Unicode 5.2 beta. As of today, the file for review is:
       
    http://www.unicode.org/Public/5.2.0/ucd/DerivedCoreProperties-5.2.0d11.txt
       
       and the relevant entry is:
       
    # Derived Property: Case_Ignorable (CI)
    # As defined by Unicode Standard Definition D121
    # C is defined to be case-ignorable if
    # Word_Break(C) = MidLetter or MidNumLet, or
    # General_Category(C) = Nonspacing_Mark (Mn), Enclosing_Mark (Me), Format (Cf),
    Modifier_Letter (Lm), or Modifier_Symbol (Sk).

    0027 ; Case_Ignorable # Po APOSTROPHE
    002E ; Case_Ignorable # Po FULL STOP
    ...

    2. The reason why U+002E is Case_Ignorable=True is because of its
       word-breaking behavior, which is defined in WordBreakProperty.txt.
       As of today, the file for review is:
       
    http://www.unicode.org/Public/5.2.0/ucd/auxiliary/WordBreakProperty-5.2.0d12.txt
       
       and the relevant entry is:
       
    0027 ; MidNumLet # Po APOSTROPHE
    002E ; MidNumLet # Po FULL STOP

    3. The impact of Word_Break=MidNumLet on the default word breaking algorithm
       documented in UAX #29 is defined in WB6 and WB7 in that document.
       As of today, the document for Unicode 5.2 beta review is:
       
    http://www.unicode.org/reports/tr29/tr29-14.html

       And the relevant summary of those rules is: "Do not break letters across
       certain punctuation." In the case of these two characters, the
       point of making them Word_Break=MidNumLet is so that the default
       algorithm would not break across contractions or elisions that
       use U+0027 (or U+2019) and would not break at a full stop used
       in common constructions like decimal number representations: "25.6%"
       and so on.
       
       What that means, in turn, is that a default implementation of UAX #29
       word breaking should identify word break boundaries in the string
       "bu jy.daw" as #bu# #jy.daw#, since the "." in "jy.daw" is between
       letters and would inhibit determination of a word break.
       
    4. This treatment of U+002E FULL STOP in UAX #29 for word breaking behavior
       is a relatively recent tweak to the algorithm. The apostrophe was
       Word_Break=MidLetter as of Unicode 5.0, but full stop was not. The
       introduction of Word_Break=MidNumLet and addition of full stop to
       that class came in Unicode 5.1. What *that* means is that for this
       particular edge case involving full stop, a conformant implementation
       of UAX #29 default word breaking behavior would behave differently
       for a Unicode 5.0 implementation than a Unicode 5.1 (or later)
       implementation. So a Unicode 5.0 implementation of default word
       breaking behavior would break around a full stop and in particular
       would break "bu jy.daw" as #bu# #jy#.#daw#
       
    5. I know this is getting long-winded ;-) but what the last point means is that
       any default titlecasing algorithm which itself is based on default
       word boundary determination will end up titlecasing "bu jy.daw"
       differently, depending on which version of Unicode it implements.
       
    6. As a general principle, all discussion of Unicode casing behavior
       should cease and desist from referring to UTR #21. As the
       web site clearly indicates, UTR #21 has been superseded (as of
       Unicode 4.0). This kind of discussion about default casing
       behavior in the standard should definitely be referring instead
       to Section 3.13 "Default Case Algorithms" in the standard itself.
       For the Unicode 5.0 version online, see:
       
    http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf

    7. Titlecasing is, in general, inherently quite variable. Different
       typographical traditions follow different rules, so best practice
       requires being able to adjust it for specific conventions. Unicode
       default titlecasing is an approximation at best, and there is
       no way it can or should be expected to be correct for all strings
       for all situations. In particular, for specialized orthographies
       that use punctuation characters such as U+002E FULL STOP in
       unusual contexts, there simply is not way for general software
       to simply "get it right" out of the box for all users, because
       these usages are inherently contradictory regarding issues such
       as word boundaries.
       
    8. And finally, Asmus is correct. There is no way that the UTC
       will clone a U+002E FULL STOP character in an attempt to create
       a new character that would guarantee correct titlecasing for
       Gwoyeu Romatzyh. Tailoring of algorithms is the answer for such
       a requirement. The good news, however, is that for *default*
       titlecasing behavior, if applications are implementing the
       Unicode default algorithms and move to Unicode 5.1 (or later),
       you should end up with the titlecasing you want for
       Gwoyeu Romatzyh without having to tailor it for the neutral
       tone mark.
       
    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Jul 15 2009 - 14:43:09 CDT