From: D. Starner (shalesller@writeme.com)
Date: Mon Dec 06 2004 - 20:03:52 CST
(Sorry for sending this twice, Marcin.)
"Marcin 'Qrczak' Kowalczyk" writes:
> UTF-8 is poorly suitable for internal processing of strings in a
> modern programming language (i.e. one which doesn't already have a
> pile of legacy functions working of bytes, but which can be designed
> to make Unicode convenient at all). It's because code points have
> variable lengths in bytes, so extracting individual characters is
> almost meaningless (unless you care only about the ASCII subset, and
> sequences of all other characters are treated as non-interpreted bags
> of bytes). You can't even have a correct equivalent of C isspace().
That's assuming that the programming language is similar to C and Ada.
If you're talking about a language that hides the structure of strings
and has no problem with variable length data, then it wouldn't matter
what the internal processing of the string looks like. You'd need to
use iterators and discourage the use of arbitrary indexing, but arbitrary
indexing is rarely important.
You could hide combining characters, which would be extremely useful if
we were just using Latin and Cyrillic scripts. You'd have to be flexible,
since it would be natural to step through a Hebrew or Arabic string as if the
vowels were written inline, and people might want to look at the combining
characters (which would be incredibly rare if your language already
provided most standard Unicode functions.)
-- ___________________________________________________________ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm
This archive was generated by hypermail 2.1.5 : Mon Dec 06 2004 - 20:06:21 CST