Re: Getting A Newb Started

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jul 07 2008 - 14:22:01 CDT

Next message: William J Poser: "Re: Getting A Newb Started"

Previous message: John Hudson: "Normalisation and directionality (was: how to add all latin (and greek) subscripts)"
Maybe in reply to: J: "Getting A Newb Started"
Next in thread: William J Poser: "Re: Getting A Newb Started"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> Things I've learned:
> the wchar_t definition is platform dependent. That makes it completely
> useless, which also makes all standard wide-char functions useless.

I concur with that. I've been building platform-independent
Unicode support code for years now -- and the first thing I do
is define a 16-bit UTF-16 data type and a 32-bit UTF-32 data type
and junk all the wchar_t's and anything that depends on them.

> Using wchar_t would double or quadruple the memory usage of my app.

Assuming you were using well-defined 16-bit and 32-bit types --
I presume you would still claim that, but that seldom is actually
the case unless you have a very unusual application that is
completely dominated by raw text storage size and whose data
content is, in turn, dominated by ASCII characters.

> Only Asian countries benefit directly from defaulting to a wide
> character for internal app use, where the west would benefit from UTF-8
> internal use.

That is basically false. It overvalues sheer memory storage
issues (or byte stream lengths) over the other advantages that
processing in Unicode internally can provide.

> There seems to be *no* standard or de-facto standard libraries to deal
> with unicode. It's either added ad-hoc to existing libraries or written
> from scratch for each app.

If you mean no C or C++ standard libraries, then that is pretty
much the case, because the standard libraries for those weren't
designed with Unicode in mind, don't scale well to Unicode, and
are stuck with the wchar_t problems.

> I've seen ICU from IBM which seems pretty good but it was built with the
> "java mentality"

The "java mentality" is one of the aspects of Unicode that
you may have to get used to. Unicode *is* the character set,
not one among many.

> and doesn't conform very well to linux practices and
> seems to be overkill for most apps (20M Runtime Download?!?!). It also
> doesn't have a compiler for it's resource files. Also, sticking all
> character strings in a resource file doesn't sit well with me.

It is work, but ICU is open source, and you can (and probably should)
pare away all the stuff you don't want to deal with. If you
remove the collation and localization parts of the source, you can
compact it down to a much tighter runtime. It is up to you what
kinds of functionality you need.

> And So,
> My question(s) to you all are:
> If ICU is big, bloated and doesn't follow conventions, and glib doesn't
> handle all the things necessary (string manipulation), is there a good
> library that handles unicode well and doesn't come along with megs of
> unnecessary things? (Glib has tons of stuff I wouldn't be using)

Same thing with ICU. Pare away the stuff you don't need.

And you'll increasingly find, I think, that ICU *is* the convention
that others tend to follow. It is built that way for a reason.

> I would like to use UTF-8 internally within my app as it seems much less
> memory intensive,

This is a mistake, I believe. UTF-8 works well if you are basically
streaming data or working on strings with operations (copy, length,
etc.) that don't care much about their internal content. Once you
start parsing strings for content, UTF-8 gets much more complicated
than UTF-16, and you start regressing in code complexity. Tables
are also much more of a nightmare to construct, store, and access
in UTF-8, if you try to do it directly in UTF-8. UTF-16 is much
more tractable for table lookup.

> but that would also mean I have to rewrite every
> single string and char manipulation function myself to deal with UTF-8
> (wow what a chore!). Is there a better way to deal with that?
>
> Am I barking up the wrong tree here? A lot of people say to use UTF-16
> internally and convert to UTF-8 for output...

Yep. I recommend that.

> I appreciate any help you can give me.

I'm sure you'll get a flood of further advice of all sorts
from the denizens of this list, shortly. ;-)

--Ken

Next message: William J Poser: "Re: Getting A Newb Started"
Previous message: John Hudson: "Normalisation and directionality (was: how to add all latin (and greek) subscripts)"
Maybe in reply to: J: "Getting A Newb Started"
Next in thread: William J Poser: "Re: Getting A Newb Started"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jul 07 2008 - 14:24:27 CDT