Re: Getting A Newb Started

From: William J Poser (wjposer@ldc.upenn.edu)
Date: Mon Jul 07 2008 - 14:29:32 CDT

  • Next message: Michael Everson: "Re: Normalisation and directionality (was: how to add all latin (and greek) subscripts)"

    There seem to be religious views on this question, but my own practice is
    to use UTF-32 internally in almost all cases. Yes, it takes more memory
    than UTF-8, but the modest additional memory usage doesn't really matter
    much. On the other hand, dealing with UTF-32 is much easier and less error
    prone than dealing with UTF-8. Every four bytes is a character. You can do
    simple array arithmetic, simple calculations of how much memory you need
    to allocate, etc.

    Of course, if you are willing and able to do everything
    using library functions and have a suitable UTF-8 library, then
    you need not worry about the complications of operating on UTF-8.
    So the choice depends in part on what kind of processing you are doing.

    If you don't want to use ICU, I don't know of a single cross-platform
    library that covers everything, but there are some that cover a lot.
    For example, for regular expressions I recommend the TRE library:
    http://www.laurikari.net/tre/. It is lightweight, robust, provides POSIX
    regular expressions as well as extensions such as the best approximate
    matching facilities that I am aware of, and has both multibyte and
    wide character APIs.

    I don't write for MS Windows so I don't worry too much about the fact
    that wchar_t is only two bytes (is this true on Vista, by the way?), but
    the fact that you can't assume that a wchart_t is large enough to hold
    any Unicode character is indeed a real problem. I'd like to see all of
    the wcs functions redone for defined sizes, e.g. uint32_t.

    Bill



    This archive was generated by hypermail 2.1.5 : Mon Jul 07 2008 - 14:31:23 CDT