From: Marcin 'Qrczak' Kowalczyk (firstname.lastname@example.org)
Date: Tue Apr 05 2005 - 03:09:09 CST
"Arcane Jill" <email@example.com> writes:
> In particular, I have played around with writing code-generators, of
> the ilk which Ken mentioned in another post on this thread, and I
> /never/ assumed that all (or indeed, any) generated codepoints would
> be 16-bits wide. That would be a really dumb thing to do. Why is
> anyone even mentioning this as a possibility?
Since code produced by my generator is embedded in every program
compiled by my compiler, the primary goal is small data and code size.
I can live with updating the code when UCD changes some assumptions.
I mean just tables which give raw decomposition data. Strings are
represented by ISO-8859-1 and UTF-32, there is no BMP bias in
interfaces - only in some internally used tables.
The representation I used before for canonical decomposition:
- An array of 256 pointers to arrays of 256 pairs of 16-bit words
gives decompositions of BMP characters. A pair is 0,0 for no
decomposition, X,0 for a single-char decomposition and X,Y for
two-char decomposition. All-zero pages are shared.
- An array of 32-bit words gives single-character decomposition
for 542 characters starting from U+2F800.
- The remaining 13 characters with decompositions are treated by
a switch statement in the code.
A change needed for Unicode 4.1:
- When 0xFFFF is stored in the place for a single-character
decomposition, an additional switch statement finds the real
decomposition. This affects 6 characters.
I claim that it was not a bad idea to use 16-bit entries in the
Compatibility decomposition is another story. The length may be longer
(up to 18) but currently only BMP characters are produced (including
the range of 1024 characters with some holes starting from U+1D400,
the only non-BMP characters having compatibility decompositions),
so my code doesn't currently include mechanism for producing non-BMP
-- __("< Marcin Kowalczyk \__/ firstname.lastname@example.org ^^ http://qrnik.knm.org.pl/~qrczak/
This archive was generated by hypermail 2.1.5 : Tue Apr 05 2005 - 03:11:42 CST