Re: Getting A Newb Started

From: Kenneth Whistler (
Date: Mon Jul 07 2008 - 14:22:01 CDT

  • Next message: William J Poser: "Re: Getting A Newb Started"


    > Things I've learned:
    > the wchar_t definition is platform dependent. That makes it completely
    > useless, which also makes all standard wide-char functions useless.

    I concur with that. I've been building platform-independent
    Unicode support code for years now -- and the first thing I do
    is define a 16-bit UTF-16 data type and a 32-bit UTF-32 data type
    and junk all the wchar_t's and anything that depends on them.

    > Using wchar_t would double or quadruple the memory usage of my app.

    Assuming you were using well-defined 16-bit and 32-bit types --
    I presume you would still claim that, but that seldom is actually
    the case unless you have a very unusual application that is
    completely dominated by raw text storage size and whose data
    content is, in turn, dominated by ASCII characters.

    > Only Asian countries benefit directly from defaulting to a wide
    > character for internal app use, where the west would benefit from UTF-8
    > internal use.

    That is basically false. It overvalues sheer memory storage
    issues (or byte stream lengths) over the other advantages that
    processing in Unicode internally can provide.

    > There seems to be *no* standard or de-facto standard libraries to deal
    > with unicode. It's either added ad-hoc to existing libraries or written
    > from scratch for each app.

    If you mean no C or C++ standard libraries, then that is pretty
    much the case, because the standard libraries for those weren't
    designed with Unicode in mind, don't scale well to Unicode, and
    are stuck with the wchar_t problems.

    > I've seen ICU from IBM which seems pretty good but it was built with the
    > "java mentality"

    The "java mentality" is one of the aspects of Unicode that
    you may have to get used to. Unicode *is* the character set,
    not one among many.

    > and doesn't conform very well to linux practices and
    > seems to be overkill for most apps (20M Runtime Download?!?!). It also
    > doesn't have a compiler for it's resource files. Also, sticking all
    > character strings in a resource file doesn't sit well with me.

    It is work, but ICU is open source, and you can (and probably should)
    pare away all the stuff you don't want to deal with. If you
    remove the collation and localization parts of the source, you can
    compact it down to a much tighter runtime. It is up to you what
    kinds of functionality you need.

    > And So,
    > My question(s) to you all are:
    > If ICU is big, bloated and doesn't follow conventions, and glib doesn't
    > handle all the things necessary (string manipulation), is there a good
    > library that handles unicode well and doesn't come along with megs of
    > unnecessary things? (Glib has tons of stuff I wouldn't be using)

    Same thing with ICU. Pare away the stuff you don't need.

    And you'll increasingly find, I think, that ICU *is* the convention
    that others tend to follow. It is built that way for a reason.

    > I would like to use UTF-8 internally within my app as it seems much less
    > memory intensive,

    This is a mistake, I believe. UTF-8 works well if you are basically
    streaming data or working on strings with operations (copy, length,
    etc.) that don't care much about their internal content. Once you
    start parsing strings for content, UTF-8 gets much more complicated
    than UTF-16, and you start regressing in code complexity. Tables
    are also much more of a nightmare to construct, store, and access
    in UTF-8, if you try to do it directly in UTF-8. UTF-16 is much
    more tractable for table lookup.

    > but that would also mean I have to rewrite every
    > single string and char manipulation function myself to deal with UTF-8
    > (wow what a chore!). Is there a better way to deal with that?
    > Am I barking up the wrong tree here? A lot of people say to use UTF-16
    > internally and convert to UTF-8 for output...

    Yep. I recommend that.
    > I appreciate any help you can give me.

    I'm sure you'll get a flood of further advice of all sorts
    from the denizens of this list, shortly. ;-)


    This archive was generated by hypermail 2.1.5 : Mon Jul 07 2008 - 14:24:27 CDT