From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Dec 05 2004 - 16:44:37 CST
From: "Doug Ewell" <dewell@adelphia.net>
> Here is a string, expressed as a sequence of bytes in SCSU:
>
> 05 1C 4D 6F 73 63 6F 77 05 1D 20 69 73 20 12 9C BE C1 BA B2 B0 2E
>
> See how long it takes you to decode this to Unicode code points.  (Do
> not refer to UTN #14; that would be cheating. :-)
 Without looking at it, it's easy to see that this tream is separated in
 three sections, initiated by 05 1C, then 05 1D, then 12. I can't remember
 without looking at the UTN what they perform (i.e. which Unicode code 
points
 range they select), but the other bytes are simple offsets relative to the
 start of the selected ranges. Also the third section is ended by a regular
 dot (2E) in the ASCII range selected for the low half-page, and the other
 bytes are offsets for the script block initiated by 12.
 Immediately I can identify this string, without looking at any table:
 "Mossov?" is ??????.
 where " is some openining or closing quotation mark and where each ?
replaces a character that I can't decipher only through my
 defective memory. (I don't need to remember the details of the standard
 table of ranges, because I know that this table is complete in a small and
 easily available document).
 A computer can do this much better than I can (also it can even "know" much
 better than I can what corresponds to a given code point like U+6327, if it
 is effectively assigned; I'll have to look into a specification or to use a
 charmap tool, if I'm not used to enter this character in my texts).
 The decoder part of SCSU still remains extremely trivial to implement, 
given
 the small but complete list of codes that can alter the state of the
 decoder, because there's no choice in its interpretation and because the 
set
 of variables to store the decoder state is very limited, as well as the
 number of decision tests at each step. This is a basic "finite state 
automata".
 Only the encoder may be a bit complex to write (if one wants to generate 
the
 optimal smallest result size), but even a moderate programmer could find a
 simple and working scheme with a still excellent compression rate (around 1
 to 1.2 bytes per character on average for any Latin text, and around 1.2 to
 1.5 bytes per character for Asian texts which would still be a good
 application of SCSU face to UTF-32 or even UTF-8).
This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 16:48:34 CST