Re: compatibility between unicode 2.0 and 3.0

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Feb 03 2003 - 21:02:27 EST

  • Next message: Rick McGowan: "Re: Public Review Issues update"

    Erik Ostermueller asked:

    > We have a large amount of C++ that currently has Unicode 2.0 support.
    >
    > Could you all help me figure out what types of operations will fail
    > if we attempt to pass Unicode 3.0 thru this code?
    >
    > I can start the list off with
    >
    > -sorting
    > -searching for text

    This depends greatly on what implementation you did for
    sorting and searching, and how it handles unassigned code points
    in your Unicode 2.0 code. If the code was designed to be
    forward compatible, it should do reasonable things with
    unassigned code points, and getting Unicode 3.0 data which
    is actually using those code points should not disturb your
    existing code. But, on the other hand, if you have built
    in a bunch of range checks or have used tables which cannot
    gracefully handle the appearance of unassigned code points
    in your data, then it could well blow up.

    The Unicode Collation Algorithm was not defined until after
    Unicode 2.0, and was first synched with Unicode 2.1. It has
    also been considerably updated since then -- the current version
    is aimed at Unicode 3.1. You should take a look at the
    current version to check for gotchas you may have in your
    current code.

    > -text comparison

    I assume here you are not talking about language-specific
    collation comparisons, but just Unicode analogs of strcmp()
    and the like. If so, those should behave well -- they aren't
    usually programmed in ways which make them sensitive to
    particular code point assignments.

    > -other character classification (isSpace, isDigit, etc...).

    Again, these depend on what kinds of forward compatibility
    assumptions your original code made. If it provides
    meaningful results for unassigned code points in Unicode 2.0,
    then tossing Unicode 3.0 data at such APIs shouldn't cause
    any problem to existing code, other than not getting the
    right results for Unicode 3.0 additions until you have
    modified and updated your property tables.

    >
    > I'm understand that these operations probably won't work in ALL cases.
    > But how about basic plumbing code -- creating and copying string?

    Constructors and copy constructors ought to work fine, unless
    you've done something odd.

    What you should be more concerned about, however, is
    how your code is going to get from Unicode 3.0 to
    Unicode 3.1 (or higher), because then you will have to
    deal with supplementary characters. Any assumptions that
    characters don't lie outside the range U+0000..U+FFFF
    will be broken. Whether this will be a small problem
    or a big problem for your code depends on whether you
    are effectively processing Unicode in UTF-8, UTF-16,
    or UTF-32 (or combinations of those). The biggest hit,
    when moving from Unicode 3.0 to Unicode 3.1 (or higher)
    is for UTF-16 APIs. See Unicode Technical Note #7,
    Migrating Software to Supplementary Characters, for some
    ideas:
    http://www.unicode.org/notes/tn7/

    --Ken

    >
    > As I mentioned in my last post, I've enjoyed
    > listening in on this forum -- I've learned a whole lot.
    >
    > Thanks,
    >
    > --Erik Ostermueller
    >



    This archive was generated by hypermail 2.1.5 : Mon Feb 03 2003 - 21:40:57 EST