Fw: Nicest UTF

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Dec 05 2004 - 16:44:37 CST

  • Next message: D. Starner: "Re: Unicode for words?"

    From: "Doug Ewell" <dewell@adelphia.net>
    > Here is a string, expressed as a sequence of bytes in SCSU:
    > 05 1C 4D 6F 73 63 6F 77 05 1D 20 69 73 20 12 9C BE C1 BA B2 B0 2E
    > See how long it takes you to decode this to Unicode code points. (Do
    > not refer to UTN #14; that would be cheating. :-)

     Without looking at it, it's easy to see that this tream is separated in
     three sections, initiated by 05 1C, then 05 1D, then 12. I can't remember
     without looking at the UTN what they perform (i.e. which Unicode code
     range they select), but the other bytes are simple offsets relative to the
     start of the selected ranges. Also the third section is ended by a regular
     dot (2E) in the ASCII range selected for the low half-page, and the other
     bytes are offsets for the script block initiated by 12.

     Immediately I can identify this string, without looking at any table:

     "Mossov?" is ??????.

     where " is some openining or closing quotation mark and where each ?
    replaces a character that I can't decipher only through my
     defective memory. (I don't need to remember the details of the standard
     table of ranges, because I know that this table is complete in a small and
     easily available document).

     A computer can do this much better than I can (also it can even "know" much
     better than I can what corresponds to a given code point like U+6327, if it
     is effectively assigned; I'll have to look into a specification or to use a
     charmap tool, if I'm not used to enter this character in my texts).

     The decoder part of SCSU still remains extremely trivial to implement,
     the small but complete list of codes that can alter the state of the
     decoder, because there's no choice in its interpretation and because the
     of variables to store the decoder state is very limited, as well as the
     number of decision tests at each step. This is a basic "finite state

     Only the encoder may be a bit complex to write (if one wants to generate
     optimal smallest result size), but even a moderate programmer could find a
     simple and working scheme with a still excellent compression rate (around 1
     to 1.2 bytes per character on average for any Latin text, and around 1.2 to
     1.5 bytes per character for Asian texts which would still be a good
     application of SCSU face to UTF-32 or even UTF-8).

    This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 16:48:34 CST