RE: Slots for Cyrillic Accented Vowels

From: Peter Constable (petercon@microsoft.com)
Date: Tue May 24 2011 - 19:27:02 CDT

  • Next message: Roozbeh Pournader: "RE: Slots for Cyrillic Accented Vowels"

    Uniscribe normalization is reasonably robust for Latin, Greek and Cyrllic. But it’s simply a fact that NFC normalization can have undesirable effects on various other scripts. In particular, the canonical ordering algorithm used in Unicode normalization can be a problem for various scripts. For example, in Biblical Hebrew, marks will get re-ordered into a sequence that is decidedly not what makes sense for users—the set of general classes (>= 200) and fixed-position classes (< 200) used for Hebrew lead to that result. There are issues for other scripts as well.

    These are issues inherent to normalization itself, regardless of the software in use. In those cases, Roozbeh’s point applies: emitting NFC “into the wild” can be as much of problem as emitting NFD.

    The only places where Unicode normalization is totally safe are those places for which it was created: not transforming data that will get persisted or transmitted to other users and processes, but in internal processing for comparing strings for the kinds of equivalences that Unicode normalization defines.


    Peter


    From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On Behalf Of Roozbeh Pournader
    Sent: Tuesday, May 24, 2011 4:28 PM
    To: Phillips, Addison
    Cc: Christoph Päper; Unicode Discussion
    Subject: RE: Slots for Cyrillic Accented Vowels

    On Mon, 2011-05-23 at 08:17 -0700, Phillips, Addison wrote:

    [...] you generally should not emit NFD "into the wild"

    In the real world, of course, you should actually not emit NFC either. A famous case that comes to bite me again and again, is that some XP-era Microsoft applications don't render canonically equivalent strings the same way, so if you normalize something, you lose its preferred display and semantics. For example, the sequence <ARABIC LETTER SEEN, ARABIC SHADDA, ARABIC FATHA>, which is a kind of very normal and rather common sequence in Arabic, will be displayed wrongly in Windows XP's Uniscribe if one actually normalizes it (to either NFC or NFD), becoming <SEEN, FATHA, SHADDA>, which is displayed wrongly in both Notepad and Word 2003 under Windows XP.

    Roozbeh



    This archive was generated by hypermail 2.1.5 : Tue May 24 2011 - 19:29:50 CDT