Re: Fw: Unicode & space in programming & l10n

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Sep 22 2006 - 17:09:37 CDT

  • Next message: Steve Summit: "Re: Fw: Unicode & space in programming & l10n"

    > Not quite. Unsigned int is only guaranteed a range of 0 to 0xffff and
    > therefore it can't normalise the string <U+FAD5> - the normalised form is
    > <U+25249> in all four normalisations.

    It *can*, if you abstract your type definitions correctly.

    > Of course, unsigned int is good
    > enough to hold UTF-16 code *units*, which might just be what Mike meant.
    > (I.e., the type supports UTF-16, but not UTF-32.)

    It is perfectly fine for UTF-32, if you do this correctly. For
    example:

    typedef unsigned short UShort16;
    typedef unsigned int UInt32;

    typedef UShort16 utf16char;
    typedef UInt32 utf32char;

    Put that stuff in a fundamental header file, and use "utf32char"
    everywhere you mean a UTF-32 code unit and "utf16char" everywhere
    you mean a UTF-16 code unit, instead of "unsigned int" anywhere
    in the code.

    At that point, you can safely port your entire code to *any*
    platform, with at most one compiler-specific #ifdef in your
    fundamental header file.

    > Of course, you may be able to create Unicode string constants - it all
    > depends what data structure is used. FFFF-terminated arrays would work,
    > e.g.
    >
    > static const unsigned int[] remark = {
    > LATIN_L, LATIN_o, LATIN_o, LATIN_k, EXCLAMATION_MARK, 0xffff}

    For C/C++ programmers, it is, of course, much easier to go with
    NULL-terminated arrays, as then all your 16-bit and 32-bit string
    processing can be cloned almost exactly on your 8-bit string
    processing routine logic.

    Using a non-character as a string terminator isn't worth the
    trouble, because it means your Unicode strings are less portable
    to other people's libraries. And if you need to use arbitrary
    buffers of Unicode character data, including embedded NULLs
    and noncharacters, then you are better off using separate tracking
    of buffer length, anyway.

    --Ken



    This archive was generated by hypermail 2.1.5 : Fri Sep 22 2006 - 17:12:57 CDT