RE: Slots for Cyrillic Accented Vowels

From: Roozbeh Pournader (roozbeh@htpassport.com)
Date: Tue May 24 2011 - 20:20:30 CDT

  • Next message: Peter Constable: "RE: Slots for Cyrillic Accented Vowels"

    Peter,

    I agree with you that problems exist, but I don't agree that
    applications that support Unicode have a license to treat canonically
    equivalent strings in different ways, especially when displaying them.

    My point in writing the email, was that the software "out there" still
    needs to catch up to some parts of Unicode, and if you may be working
    with that software, make sure you know their shortcomings before you
    assume that it's a perfect world out there and convert everything to
    NFC. Not that it's better to expect strings in the "logical" order of
    what some of the users expect things to be and not work in other cases,
    especially when the string is in NFC.

    And I'm not saying these for philosophical reasons or as a standards
    purist. I am writing and using various pieces of software every day that
    would be much much simpler if I simply could assume that commonly
    available software could show normalized strings the right way, instead
    of expecting them to be in a script-specific software-specific
    normalized order that I would need to figure out for each piece of
    software I can't change.

    In this world, several of my "scripts" need to support different tastes
    of normalization and string comparison. In the future world were at
    least everyone can handle NFC strings, I can normalize early in the game
    and do much simpler comparisons everywhere else.

    Roozbeh

    PS: For the record, I don't expect Microsoft to go and fix Windows XP
    and Word 2003 this late in the process. This is a reality for people
    like me today. I'm just hoping that it will be a smaller problem for
    people who are writing similar code a few years from now when some of
    these older applications go out of usage. (Or possibly, I'm just
    nagging!)

    On Wed, 2011-05-25 at 00:27 +0000, Peter Constable wrote:
    > Uniscribe normalization is reasonably robust for Latin, Greek and
    > Cyrllic. But it’s simply a fact that NFC normalization can have
    > undesirable effects on various other scripts. In particular, the
    > canonical ordering algorithm used in Unicode normalization can be a
    > problem for various scripts. For example, in Biblical Hebrew, marks
    > will get re-ordered into a sequence that is decidedly not what makes
    > sense for users—the set of general classes (>= 200) and fixed-position
    > classes (< 200) used for Hebrew lead to that result. There are issues
    > for other scripts as well.
    >
    >
    >
    > These are issues inherent to normalization itself, regardless of the
    > software in use. In those cases, Roozbeh’s point applies: emitting NFC
    > “into the wild” can be as much of problem as emitting NFD.
    >
    >
    >
    > The only places where Unicode normalization is totally safe are those
    > places for which it was created: not transforming data that will get
    > persisted or transmitted to other users and processes, but in internal
    > processing for comparing strings for the kinds of equivalences that
    > Unicode normalization defines.
    >
    >
    >
    >
    >
    > Peter
    >
    >
    >
    >
    >
    > From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]
    > On Behalf Of Roozbeh Pournader
    > Sent: Tuesday, May 24, 2011 4:28 PM
    > To: Phillips, Addison
    > Cc: Christoph Päper; Unicode Discussion
    > Subject: RE: Slots for Cyrillic Accented Vowels
    >
    >
    >
    >
    > On Mon, 2011-05-23 at 08:17 -0700, Phillips, Addison wrote:
    >
    >
    >
    > [...] you generally should not emit NFD "into the wild"
    >
    >
    > In the real world, of course, you should actually not emit NFC either.
    > A famous case that comes to bite me again and again, is that some
    > XP-era Microsoft applications don't render canonically equivalent
    > strings the same way, so if you normalize something, you lose its
    > preferred display and semantics. For example, the sequence <ARABIC
    > LETTER SEEN, ARABIC SHADDA, ARABIC FATHA>, which is a kind of very
    > normal and rather common sequence in Arabic, will be displayed wrongly
    > in Windows XP's Uniscribe if one actually normalizes it (to either NFC
    > or NFD), becoming <SEEN, FATHA, SHADDA>, which is displayed wrongly in
    > both Notepad and Word 2003 under Windows XP.
    >
    > Roozbeh
    >
    >



    This archive was generated by hypermail 2.1.5 : Tue May 24 2011 - 20:23:12 CDT