Re: Name Compression (was: Longest Names)

From: Torsten Mohrin (mohrin@sharmahd.com)
Date: Wed May 10 2000 - 10:25:36 EDT


BTW, in case somebody is interested...

In SC UniPad we use a compressed name table. The names are compressed
by encoding the words either in one or two bytes. The separators
(space and hyphen-minus) are encoded in a special way. It works as
follows:

Scalar values are assigned to all the words (ca. 3700 in Unicode 3.0)
depending on their frequency. The most frequent word ("LETTER") has
value 0, "WITH" has value 1 and so on.

The compressed character name is a byte stream. The lower 6 bits of
each byte are used to encode the scalar values. Values from 0 to 63
can be encoded in 1 byte, values from 64 to 4095 must be encoded in 2
bytes. Although not necessary at the moment, this theoretically works
with a 3rd byte (for the values from 4096 to 262143).

The upper 2 bits are used to encode the separators and to signal a
second byte or the end of the name:

00: this is the last byte of the name
    (no additional string terminator is needed)
01: the word is followed by a space
10: the word is followed by a hyphen minus
11: a second byte follows to encode values greater than 63

For example:

name: CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I
values: 0x0017 0x0008 0x0000 0x05E7 0x02FF 0x0035

-> 0x57 0x48 0x40 0xE7 0x97 0xFF 0x4B 0x35

 0x57 01 010111 "CYRILLIC "
 0x48 01 001000 "CAPITAL "
 0x40 01 000000 "LETTER "
 0xE7 0x97 11 100111 10 010111 "BYELORUSSIAN-"
 0xFF 0x4B 11 111111 01 001011 "UKRAINIAN "
 0x35 00 110101 "I\0"
    
All the uncompressed names take ca. 262,000 bytes (ASCII, including
'\0' for each name). The compressed names take ca. 59,000 bytes. The
word table takes ca. 29,000 bytes. Together this is 1/3 of the
uncompressed size.

Comments and suggestions are welcome.

--
Torsten Mohrin
Sharmahd Computing GmbH, Hannover, Germany
Phone: +49-511-13780, Fax: +49-511-13450
http://www.sharmahd.com, mohrin@sharmahd.com



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT