Fw: Nicest UTF

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Dec 05 2004 - 16:44:37 CST

Next message: D. Starner: "Re: Unicode for words?"

Previous message: Peter Kirk: "Re: No Invisible Character - NBSP at the start of a word"
Maybe in reply to: Theodore H. Smith: "Nicest UTF"
Next in thread: Doug Ewell: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "Doug Ewell" <dewell@adelphia.net>
> Here is a string, expressed as a sequence of bytes in SCSU:
>
> 05 1C 4D 6F 73 63 6F 77 05 1D 20 69 73 20 12 9C BE C1 BA B2 B0 2E
>
> See how long it takes you to decode this to Unicode code points. (Do
> not refer to UTN #14; that would be cheating. :-)

Without looking at it, it's easy to see that this tream is separated in
three sections, initiated by 05 1C, then 05 1D, then 12. I can't remember
without looking at the UTN what they perform (i.e. which Unicode code
points
range they select), but the other bytes are simple offsets relative to the
start of the selected ranges. Also the third section is ended by a regular
dot (2E) in the ASCII range selected for the low half-page, and the other
bytes are offsets for the script block initiated by 12.

Immediately I can identify this string, without looking at any table:

"Mossov?" is ??????.

where " is some openining or closing quotation mark and where each ?
replaces a character that I can't decipher only through my
defective memory. (I don't need to remember the details of the standard
table of ranges, because I know that this table is complete in a small and
easily available document).

A computer can do this much better than I can (also it can even "know" much
better than I can what corresponds to a given code point like U+6327, if it
is effectively assigned; I'll have to look into a specification or to use a
charmap tool, if I'm not used to enter this character in my texts).

The decoder part of SCSU still remains extremely trivial to implement,
given
the small but complete list of codes that can alter the state of the
decoder, because there's no choice in its interpretation and because the
set
of variables to store the decoder state is very limited, as well as the
number of decision tests at each step. This is a basic "finite state
automata".

Only the encoder may be a bit complex to write (if one wants to generate
the
optimal smallest result size), but even a moderate programmer could find a
simple and working scheme with a still excellent compression rate (around 1
to 1.2 bytes per character on average for any Latin text, and around 1.2 to
1.5 bytes per character for Asian texts which would still be a good
application of SCSU face to UTF-32 or even UTF-8).

Next message: D. Starner: "Re: Unicode for words?"
Previous message: Peter Kirk: "Re: No Invisible Character - NBSP at the start of a word"
Maybe in reply to: Theodore H. Smith: "Nicest UTF"
Next in thread: Doug Ewell: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 16:48:34 CST