From: Philippe Verdy (firstname.lastname@example.org)
Date: Sun Nov 30 2003 - 16:55:39 EST
> As long as you are sure that this will not leak out into the outside
> world, you are free to use the UTF-8 mechanism internally to represent
> any type of 31-bit data you like, including this private replacement for
> allkeys.txt. (You do know about allkeys.txt, don't you? And the fact
> that UCA is heavily customizable?)
Yes I know allkeys.txt, and the fact that UCA is highly customizable.
This is still too much complex to handle a lot of languages consistently,
and I prefer having rules that define a hierarchy tree of languages for
sorting or collating, so that a single reset of a language root will move
all its collation keys along with related characters that are normally
logically collated with them, even if they are not used in typical
orthograph of that language.
Also UCA still does not order very precisely all the characters in the
[variable] section: this is a mix of characters mostly sorted by script
type and then by code points, but many of them can be rearranged with
> It would seem to make sense primarily for retaining ASCII compatibility
> and representing smaller values in fewer bytes than larger values, so
> you would want to be sure these are your design goals too.
Unfortunately, this is IMPOSSIBLE! I need code positions between
successive ASCII positions. All I can do is to preserve 1 byte for
the ASCII character in the encoding scheme for the code position, but
other bytes will be prepended and appended.
Due to this constraint, any ASCII character will really be represented
by at least 3 bytes, and this is not intended to be used for interchange
of text, just for internal representation during processing, for lookup
tables or to extract some binary coded character properties (I have more
properties than those listed in Unicode, simply because I have inserted
properties needed for UCA and tailored collation).
> But things like this do have a tendency to leak into the outside world,
> and if this ever happens with your collation keys, you will have
> unleashed something like CESU-8 that fails the "duck test": it walks and
> talks like UTF-8, but it's not.
Be sure this won't leak out. Simply because this internal encoding is
strictly for internal processing as an intermediate step. It is not
efficient enough to make it a true encoding, simply because it uses 1
code per function, instead of packing several functions into bitfields.
As I have not determined the correct size of these bitfields, I need some
intermediate solution to pack them a little, and the UTF-8 TES (not the
UTF-8 CES used by Unicode)venient for now, until I change it to a better
encoding, which may or may not leak out (I am not sure that I need to
make the encoding accessible from an interface, except for debugging).
After all, the intermediate tables computed by the ICU builder are
completely internal, and their format is not guaranteed to be supported
elsewhere: these tables use their own encoding and convention, and are
strictly bound strictly with the internal implementation of the ICU
runtime. That's the same thing for me.
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE! http://www.ellaforspam.com
This archive was generated by hypermail 2.1.5 : Sun Nov 30 2003 - 17:46:37 EST