UTF-64 [warning: contains bits & bytes humor] (was RE: [OT] bits and bytes)

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Tue May 29 2001 - 15:45:47 EDT


I originally thought could be a way of storing Unicode text in databases.

However, after some thinking, I decided that idea was completely bogus, so I
though to turn it into a joke for geeks. But it wasn't even amusing, so it
went in the "Deleted Items" folder.

However, I see that illogical ideas seem quite popular these days in the
field of database, even "despite the logic of the arguments presented
against" them, so perhaps someone will like it.

Christopher JS Vance wrote (on May 18, 2001):
> On DEC-10, with a 36-bit word, a byte was anywhere between 1 and 36
> bits. They typically packed 5 ASCII-7 characters into a word with the
> extra bit unused.

So that's what the "packed" keyword in Pascal was for!

I was wondering: could something like this be revived in the age of 64-bit
words and Unicode?

A block of 64 bits can fit 9 ASCII 7-bit characters (0.888889 octets per
character: more performing than DEC-10's packed ASCII!), or 3 Unicode 21-bit
characters (2.666667 octets per character, which is not so bad for a
millionaire character set).

Both options leave one bit free (9*7 = 3*21 = 63), and that 64th bit can be
used to distinguish two options, so that both can coexist in the same text
stream. So, let's say that high bit 0 identifies 9*7 blocks, and high bit 1
identifies 3*21 blocks.

E.g., a string like "Good day \U0010300\U0010305\U0010304" can be packed in
only two 64-bit blocks, or 16 octets (a big save compared to the 48 octets
needed in UTF-32, the 30 octets needed in UTF-16, or even the 21 octets
needed in UTF-8):

        "Good day ": 9 characters * 7 bits = 1 block
        "\U0010300\U0010305\U0010304":

Of course, I have been slightly cheating choosing a phrase that has exactly
9 7-bit characters and 3 21-bit characters. In the reality, boundaries
between runs of characters in different ranges occur wherever they please.
This causes that some characters in ASCII range have to be encoded in 3*21
blocks:

E.g., a string like "Good night \U0010300\U0010305\U0010304" is not so
lucky:

        "Good nigh": 9 characters * 7 bits = 1 block
        "t \U0010300": 3 characters x 21 bits = 1 block
        "\U0010305\U0010304": (2 characters + 1 unused position) * 21 bits =
1 block

Notice that one position is unused in the last block. For this reason, a bit
combination must be reserved as a padding code.

This is not a big problem, because the highest Unicode character is
0x10FFFD, much less than the highest 21-bit number. Code 0x1FFFFF is one
nice choice for the filler value.

The basic rules for encoding Unicode with these 64-bit blocks could then be:

        1) If there are 9 more characters to encode from the current
position, and all of them are less than U+0080, pack them in a 9*7 block and
move the current position 10 positions forward. Go back to point 1.

        2) Else, if there are 3 more characters to encode from the current
position, pack them in a 2*21 block and move the current position 4
positions forward. Go back to point 1.

        3) Else, if there are 1 or 2 more characters to encode from the
current position, pack them in a 2*21 block, padding the unused 21-bit
positions with 0x1FFFFF. The encoding process is ended.

        4) Else the encoding process is ended.

For the joy of those who collect unconventional and/or aborted UTF's, I will
name this "UTF-64".

UTF-64 has a single CES (let's say big-endian). The reason is that, if you
don't know where the high bit is, there is no way of making sense of those
64-bit pack.

Of course, if super-intelligent Aliens will arrive on our planet, bearing a
writing system with billions characters, I will withdraw this proposal and
donate the name "UTF-64" to the Unicode Consortium.

_ Marco



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT