Unicode string functions

From: Theo (delete@softhome.net)
Date: Mon Aug 06 2001 - 18:16:12 EDT


Hi people,

this is my first post to this list.

I want to write many string functions for dealing with unicode. I am
writing both an XML engine, and an XML editor (Which I have
imaginatively called "XML Engine" and "XML Editor"). Now, of course XML
1.0 specifies that I need to deal with Unicode, which my engine doesn't
currently do.

I've written a lot of string functions before, most of my programs deal
with data in fact, especially string data. I find dealing with strings
quite fun for some reason, and I've written a plugin for my language
(REALbasic) that speeds up many things to do with strings, and adds
loads of new string features to REALbasic.

Anyhow. I find this UniCode system rather confusing at first. I hope it
will be easy to deal with, but I am not sure.

Please correct me on the basic facts of Unicode here. I am just
repeating them in order to see if anyone tells me I am wrong :o). Ok:

From what I can tell, UniCode is simply a numbering system, and
encoding system for these numbers (a very crude summary, but it is
enough for me, for now). Unicode allows for "compression", in UTF8, and
UTF16, so that you can fit lower ascii into 1 byte, while other numbers
that aren't lower ascii, may need many more bytes. UTF32 represents the
numbers "as is", no compression needed.

Now, I have heard that dealing with UTF8/UTF16 is a real nuisance,
because say, to extract characters 6 though 8, of a string, you need to
start at the beginning, then loop the entire way through, detecting
which bytes are "multiple bytes".

I'd rather avoid dealing with UTF compression schemes, for both RAM
sake (more code), speed sake (more to do), and bugs sake (complex code
is buggier code).

OK, so I have an easy solution. My language REALbasic, has a
"TextConverter" object, that lets you convert from a huge range of text
encoding schemes! This feature is provided by Apple and MicroSoft, and
REALbasic simply gives you access to it. So, I can convert from
UTF8/UTF16 to UTF32, and use some string functions I will write, that
specifically deal with UTF32.

OK, so in short:

If I am writing some string functions for UTF32 (with XML in mind), are
there stumbling blocks I may come across? I know almost nothing about
Unicode, can I just treat each char as a 4 byte character, and not care
about any other UniCode special features? If there are special features
(0 width characters, etc), what do I have to do about them?

The string functions I'll need are:

*Character set functions (searches through a string, based upon if each
character of this string is inside a character set, and returns the
position of the first character found that is, or isn't in the
character set.)

*String searching functions (searches through one string, for another)

*Character getting functions (Get one character at a time)

*string extraction functions (get a segment of one string, and create a
new string that is a copy of it)

*string creation functions (create a string, that is just a number of
characters long, each of the same character value)

And then of course, I'll have to reconvert back to UTF8/UTF16 on
saving! Quite a bit of work, but nothing compared to what I have done
so far.

Is it possible I forgoe all characters that can't be expressed in UTF16
without taking more than 2 bytes? That way, I can halve the RAM needed
for large XML documents. I'd rather not code two string function
versions, one is enough. So it is 2 bytes, or 4 bytes. I read on
unicode.org that all characters of almost all languages can be
contained in 2 bytes.

What languages need 4 bytes or more to describe? If they are some
unheard of tribe with 100 living members, perhaps I can just forgo
them?

--
    This email was probably cleaned with Email Cleaner, by:
    Theodore H. Smith - Macintosh Consultant / Contractor.
    My website: <www.elfdata.com/>



This archive was generated by hypermail 2.1.2 : Mon Aug 06 2001 - 19:37:50 EDT