Re: Unicode & space in programming & l10n

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Sep 20 2006 - 18:33:22 CDT

  • Next message: Steve Summit: "Re: Unicode & space in programming & l10n"

    William Poser wrote:

    > I'm confused as to the sense in which C and C++
    > "don't support the Unicode character model".

    Before Philippe weighs in with the inevitable opus ;-),
    I think the main point is that neither C nor C++ have
    a native CHARACTER datatype that is based on Unicode.
    And for many years partisans of C and C++ have claimed
    that that was a *good* thing, because it meant that
    programs could be written "portably", to not care
    what charset they were running under.

    Personally I always considered that a misconstrual of
    what it meant to write portable code, but that is
    perhaps for another thread...

    > It is
    > very easy to manipulate objects of type wchar_t,
    > arrays thereof, linked lists thereof, and so forth.

    > The main theoretical difficulty that I see with Unicode
    > processing in C is that you can't be sure that a wchar_t
    > is at least 21 bits wide. This is of course a general
    > defect of the C standard, which does not specify
    > object sizes. In practice, however, I haven't myself
    > encountered problems with this or heard of them.

    Recently, in part at the urging of the Unicode Consortium,
    the C and C++ standards have finally added data types which are
    guaranteed 16-bit and 32-bit widths and which are tied
    nominally to 10646/Unicode character semantics, although
    there is no built-in support for anything more than just
    the fixed-width nature of the data type -- and you would
    have to have a very recent compiler to actually recognize
    them.

    Personally, I consider manipulating Unicode characters
    as wchar_t to be a mistake, because of the portability
    issues. I've been writing and maintaining C libraries for
    Unicode support for years, but never use wchar_t *anywhere*
    in that code. Instead, I use fundamental declared datatypes that
    I can guarantee data width for via the compiler-specific
    makefiles.

    With this approach I have complex libraries that support
    UTF-8, UTF-16, *and* UTF-32 flawlessly across all Unix
    platforms and Windows and a variety of oddball platforms
    on occasion, for 32-bit and 64-bit processors, with internal
    change of form (e.g. UTF-16 <--> UTF-32) on an as-needed
    basis, depending on what kind of text processing is needed.
    And that code has an absolute minimum of platform-specific
    conditional compilation -- all of it related to concerns
    like file paths and such that have nothing to do with
    Unicode processing per se.
     
    > For the present, at least, there is also a good reason
    > to use C IN PREFERENCE to high level languages for
    > processing Unicode, for some applications. The
    > high-level languages that I know of all limit
    > Unicode support to the BMP. That is true of Python
    > and Tcl, for example. In contrast, in C there
    > is no such limitation.

    I concur with this, if you are rolling your own support.
    C also has very good performance characteristics, both
    for what you can do in minimizing memory usage and in
    maximizing speed.

    On the other hand, for most people looking for Unicode
    support, the most practical approach is to make use of
    a big, full-featured Unicode library -- most notably
    the Open Source ICU library, available both in C and
    Java versions. The developers of such libraries have
    already done really outstanding jobs of optimizing
    their behavior and keep them up-to-date and compliant
    with the current standard.

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Sep 20 2006 - 18:37:14 CDT