From: Doug Ewell (dewell@adelphia.net)
Date: Sat May 07 2005 - 12:36:30 CDT
N. Ganesan <naa dot ganesan at gmail dot com> wrote:
> Can you tell a little more on SCSU. Any pointers, URLs to how it works
> on texts, say Tamil unicode text? Tamil letters are not conjuncts,
> something similar in this sense to Latin script of Europe.
As Philippe said, SCSU is just a specification for encoding sequences of
Unicode code points. It does not make any difference whether a given
code point represents a conjunct, or even whether there is a character
assigned to that code point at all. SCSU does not change the encoding
model used for Tamil; it simply specifies a different, usually more
efficient, way of converting Unicode code points into bytes for storage
and transmission.
The quick answer is that a text consisting of only characters in the
Tamil and Basic Latin blocks can be encoded in only one byte per
character. Tamil characters in the range U+0B80 to U+0BFF are encoded
using the bytes from 80 to FF, while Basic Latin characters are encoded
using the bytes from 00 through 7F. There is an initial two-byte
sequence to specify that the high range is to be used for Tamil, and
not, say, Cyrillic or Devanagari.
The Tamil word
வணக்கம்
is encoded using the bytes
1B 17 B5 A3 95 CD 95 AE CD
The initial 1B 17 switches into "Tamil mode," so to speak, and the
remaining bytes are simply the Unicode code points. Basic Latin
(including CR and LF, but not all control characters) would be encoded
using their ASCII values; for example, an ordinary space would be 20.
If other blocks are used besides Tamil and Basic Latin, there is
additional overhead to switch between the blocks.
Philippe Verdy <verdy underscore p at wanadoo dot fr> replied:
> It could be a valid UTF because it preserves all codepoints in an
> original string, without even altering its normalization form (so no
> code point are reordered, even if the original string is not in any
> normalized form), and also because it still allows encoding invalid
> code points.
All text compression schemes must be lossless.
> But, unlike UTF-8, UTF-16, UTF-32 standard encoding schemes...
> SCSU does NOT guarantee a unique encoding for the same represented
> codepoints: there are several alternatives, which allow SCSU
> compressors to be implemented with simple algorithms, or with more
> complex algorithms with better compression level;
This is described in UTN #14. There is a mirror to the relevant section
at http://users.adelphia.net/~dewell/compression.html#scsu.
It is possible to build an "OK" compressor or a "really good" compressor
within the same spec. This is also true for some types of non-text
compression.
> however the SCSU decompressor is fully predictive and can be parsed
> into only one valid sequence of codepoints from a valid SCSU
> compressed stream.
It would have to be, if compression is to be lossless.
> This means that you can't check the "equality" of two encoded SCSU
> streams, without first decompressing them to streams of code points.
> (You can safely check encoded strings for equality with UTF-8, UTF-16,
> UTF-32, UTF-EBCDIC, and CESU-8).
I guess it depends on whether this is a common and desirable thing to
do. In some cases it might make more sense to decompress the stream
first before comparing anything. The tradeoff is that a relatively
simple SCSU encoder can be implemented. The world's greatest SCSU
encoder would probably have to implement full-text lookahead and
language-specific predictive algorithms, but thankfully they don't have
to.
-- Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
This archive was generated by hypermail 2.1.5 : Sat May 07 2005 - 12:39:13 CDT