RE: Unicode to UTF-8

From: Marco.Cimarosti@icl.com
Date: Thu Mar 16 2000 - 05:35:27 EST


John Cowan wrote:
> James E. Agenbroad wrote:
> > 00B1 0000 0000 1011 0001 1100 0010 1011 0001 C2B1
> 1100 0000 1011 0001 C0B1

James' value C2,B1 was correct, and it can be obtained even using your own
table:

> U+0080 to U+07FF: 110- ---- 10-- ----
                    1100 0010 1011 0001

Probably, the confusion was caused by that 1011 nibble, recurring twice in
the same position -- by pure coincidence.

By the way, thank you for this manual method: I will print and store it in
my wallet near my blood group.

By the way #2, have you notice that the least-significant 4 bits in UTF-8
are always the same as the scalar corresponding value? As these 4 bits
correspond exactly to the rightmost hex digit, we could simply ignore the
last digit.

This allows building a table of scalar value to UTF-8 values containing only
4096 entries (well, provided we ignore all the "new" code points U-00010000
to U-0010FFFF). E.g.:

        ...
        U+00B? = C2,B?
        ...
        U+266? = E2,99,A?
        ...

Such a list could be easily be folded in one's wallet, in order to be able
to easily calculate UTF-8 conversion also when on a desert island with no
computers.

Ciao. Marco



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:00 EDT