Re: Nicest UTF

From: Doug Ewell (dewell@adelphia.net)
Date: Sun Dec 05 2004 - 22:14:08 CST

  • Next message: Doug Ewell: "Re: Nicest UTF"

    Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

    >> Here is a string, expressed as a sequence of bytes in SCSU:
    >>
    >> 05 1C 4D 6F 73 63 6F 77 05 1D 20 69 73 20 12 9C BE C1 BA B2 B0 2E
    >> M o s s o v SP i s SP .
    >
    > Without looking at it, it's easy to see that this tream is separated
    > in three sections, initiated by 05 1C, then 05 1D, then 12. I can't
    > remember without looking at the UTN what they perform (i.e. which
    > Unicode code points range they select), but the other bytes are simple
    > offsets relative to the start of the selected ranges. Also the third
    > section is ended by a regular dot (2E) in the ASCII range selected for
    > the low half-page, and the other bytes are offsets for the script
    > block initiated by 12.

    05 is a static-quote tag which modifies only the next byte. It doesn't
    really initiate a new section; it's intended for isolated characters
    where initiating a new section would be wasteful. The sequences <05 1C>
    and <05 1D> encode the matching double-quote characters U+201C and
    U+201D respectively.

    12 switches to a new dynamic window -- in this case, window 2, which is
    predefined to point to the Cyrillic block -- so it does select a range
    as you said. Also, the ASCII bytes do represent Basic Latin characters.

    > Immediately I can identify this string, without looking at any table:
    >
    > "Mossov?" is ??????.
    >
    > where each ? replaces a character that I can't decipher only through
    > my defective memory. (I don't need to remember the details of the
    > standard table of ranges, because I know that this table is complete
    > in a small and easily available document).

    Actually "Moscow," not "Mossov" -- but as you said, this is not
    important because a computer would have gotten this arithmetic right.
    The actual string is:

    “Moscow” is Москва.

    > The decoder part of SCSU still remains extremely trivial to implement,
    > given the small but complete list of codes that can alter the state of
    > the decoder, because there's no choice in its interpretation and
    > because the set of variables to store the decoder state is very
    > limited, as well as the number of decision tests at each step. This is
    > a "finite state automata".

    I think "extremely trivial" is overstating the case a bit. It is
    straightforward and not very difficult, but still somewhat more complex
    than a UTF. (There had better not be any choice in interpretation, if
    we want lossless decompression!)

    BTW, the singular is "automaton."

    > Only the encoder may be a bit complex to write (if one wants to
    > generate the optimal smallest result size), but even a moderate
    > programmer could find a simple and working scheme with a still
    > excellent compression rate (around 1 to 1.2 bytes per character on
    > average for any Latin text, and around 1.2 to 1.5 bytes per character
    > for Asian texts which would still be a good application of SCSU face
    > to UTF-32 or even UTF-8).

    UTN #14 contains pseudocode for an encoder that beats the Japanese
    example in UTS #6 (by one byte, big deal) and can be easily translated
    into working code.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 22:17:37 CST