Re: Nicest UTF

From: Doug Ewell (
Date: Sun Dec 05 2004 - 02:16:53 CST

  • Next message: Tim Finney: "Unicode for words?"

    Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

    >> I appreciate Philippe's support of SCSU, but I don't think *even I*
    >> would recommend it as an internal storage format. The effort to
    >> encode and decode it, while by no means Herculean as often perceived,
    >> is not trivial once you step outside Latin-1.
    > I said: "for immutable strings", which means that these Strings are
    > instanciated for long term, and multiple reuses. In that sense, what
    > is really significant is its decoding, not the effort to encode it
    > (which is minimal for ISO-8859-1 encoded source texts, or Unicode
    > UTF-encoded texts that only use characters from the first page).
    > Decoding SCSU is very straightforward, even if this is stateful (at
    > the internal character level). But for immutable strings, there's no
    > need to handle various initial states, and the states associated with
    > each conponent character of the string has no importance (strings
    > being immutable, only the decoding of the string as a whole makes
    > sense).

    Here is a string, expressed as a sequence of bytes in SCSU:

    05 1C 4D 6F 73 63 6F 77 05 1D 20 69 73 20 12 9C BE C1 BA B2 B0 2E

    See how long it takes you to decode this to Unicode code points. (Do
    not refer to UTN #14; that would be cheating. :-)

    It may not be rocket science, but it is not trivial.

    -Doug Ewell
     Fullerton, California

    This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 02:19:07 CST