Re: Nicest UTF

From: Doug Ewell (dewell@adelphia.net)
Date: Sun Dec 05 2004 - 22:14:08 CST

Next message: Doug Ewell: "Re: Nicest UTF"

Previous message: Doug Ewell: "SCSU as internal encoding (was: Re: Nicest UTF)"
Maybe in reply to: Theodore H. Smith: "Nicest UTF"
Next in thread: Doug Ewell: "Re: Nicest UTF"
Reply: Doug Ewell: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

>> Here is a string, expressed as a sequence of bytes in SCSU:
>>
>> 05 1C 4D 6F 73 63 6F 77 05 1D 20 69 73 20 12 9C BE C1 BA B2 B0 2E
>> M o s s o v SP i s SP .
>
> Without looking at it, it's easy to see that this tream is separated
> in three sections, initiated by 05 1C, then 05 1D, then 12. I can't
> remember without looking at the UTN what they perform (i.e. which
> Unicode code points range they select), but the other bytes are simple
> offsets relative to the start of the selected ranges. Also the third
> section is ended by a regular dot (2E) in the ASCII range selected for
> the low half-page, and the other bytes are offsets for the script
> block initiated by 12.

05 is a static-quote tag which modifies only the next byte. It doesn't
really initiate a new section; it's intended for isolated characters
where initiating a new section would be wasteful. The sequences <05 1C>
and <05 1D> encode the matching double-quote characters U+201C and
U+201D respectively.

12 switches to a new dynamic window -- in this case, window 2, which is
predefined to point to the Cyrillic block -- so it does select a range
as you said. Also, the ASCII bytes do represent Basic Latin characters.

> Immediately I can identify this string, without looking at any table:
>
> "Mossov?" is ??????.
>
> where each ? replaces a character that I can't decipher only through
> my defective memory. (I don't need to remember the details of the
> standard table of ranges, because I know that this table is complete
> in a small and easily available document).

Actually "Moscow," not "Mossov" -- but as you said, this is not
important because a computer would have gotten this arithmetic right.
The actual string is:

“Moscow” is Москва.

> The decoder part of SCSU still remains extremely trivial to implement,
> given the small but complete list of codes that can alter the state of
> the decoder, because there's no choice in its interpretation and
> because the set of variables to store the decoder state is very
> limited, as well as the number of decision tests at each step. This is
> a "finite state automata".

I think "extremely trivial" is overstating the case a bit. It is
straightforward and not very difficult, but still somewhat more complex
than a UTF. (There had better not be any choice in interpretation, if
we want lossless decompression!)

BTW, the singular is "automaton."

> Only the encoder may be a bit complex to write (if one wants to
> generate the optimal smallest result size), but even a moderate
> programmer could find a simple and working scheme with a still
> excellent compression rate (around 1 to 1.2 bytes per character on
> average for any Latin text, and around 1.2 to 1.5 bytes per character
> for Asian texts which would still be a good application of SCSU face
> to UTF-32 or even UTF-8).

UTN #14 contains pseudocode for an encoder that beats the Japanese
example in UTS #6 (by one byte, big deal) and can be easily translated
into working code.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

Next message: Doug Ewell: "Re: Nicest UTF"
Previous message: Doug Ewell: "SCSU as internal encoding (was: Re: Nicest UTF)"
Maybe in reply to: Theodore H. Smith: "Nicest UTF"
Next in thread: Doug Ewell: "Re: Nicest UTF"
Reply: Doug Ewell: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 22:17:37 CST