RE: UTF-c

From: Doug Ewell (doug@ewellic.org)
Date: Tue Feb 22 2011 - 09:31:20 CST

Next message: Doug Ewell: "Posting attachments (was: Re: [unicode] UTF-c)"

Previous message: Philippe Verdy: "Re: [unicode] UTF-c"
Maybe in reply to: Thomas Cropley: "UTF-c"
Next in thread: Philippe Verdy: "Re: UTF-c"
Reply: Philippe Verdy: "Re: UTF-c"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

>> Both the complexity of SCSU, and the importance of the complexity of
>> SCSU, continue to be highly overrated.
>
> It is complex because of the codepage switching mechanism of SCSU and
> its internal "magic" codepage tables.

OK, at least that's a different "complexity" argument from the usual
ones. But Cropley's "Alphabet" table is certainly no improvement over
the SCSU tables in this regard.

>> Part of the apparent simplicity of Cropley's algorithm, as viewed
>> from his "Preliminary Proposal" HTML page, is that it omits a proper
>> description of the code-page switching mechanism, as well as the
>> "magic number" definitions of the code pages and the control bytes
>> needed to introduce them. These are present in the sample code, but
>> to see them, you have to paw through the UTF-8 conversion code and
>> UI.
>
> Yes, but he did not implement any codepage swithcing mechanism at all
> (the only thing is that it created in fact a single code for producing
> dozens of distinctg encodings, eachi one requiring its distinctive
> "BOM-like" prefix (but ill-designed in my opinion).

So a given document in this encoding can encode only one additional
64-block with one byte per character? Then it's not a replacement for
anything. Even ISO 2022 lets you switch blocks.

>>> and it is still very highly compressible with standard compression
>>> algorithms while still allowing very fast processing in memory in
>>> its decompressed encoded form :
>>
>> I see no metrics or sample data to back this up.
>
> I've played the code myself on this. It's easy to read, but lots of
> improvements can be done on it (also to secure it, because it's not
> safe the way it is implemented).

I'm focusing on Cropley's algorithm, partially defined as it is by his
sample code (which is always a red flag for a specification), not his
coding skills. What sort of numbers do your tests show for compression
speed and size?

> Forget the FS/GS/RS/US hack for his "BOM", it's not even needed in his
> proposal (and it would also make his encoding incompatible with MIME
> restrictions and with lots of transport protocols), just like the
> magic table hwich mooks more like a sample and is not evolutive enough
> to be really self-adaptive to various texts and to newer versions of
> the Unicode standard and new scripts (SCSU also has the same latter
> caveat, which has also been known since long in ISO 2022 using similar
> magic values for code page switching, one good reason for which it
> became unmaintainable).

"Unmaintainable," at least in the case of SCSU, is not the same as "we
choose not to maintain it." And indeed, for all its problems, ISO 2022
was maintained continuously through 2004 via its International Register.

> A good candidate for replacement of UTF-8 should not need any magic
> table, and should be self-adaptive.

Which is one of many reasons why Cropley's algorithm cannot be
considered as a replacement for UTF-8, if such a thing is even possible.
It can be considered as a new compression scheme, but then it has to
measure up favorably to the existing ones, and I don't see any real
improvements there.

> It is possible to do that, but code switching also has its own caveats
> (notably in fast search/compare algorithms, such as those used in
> "diff" programs and in versioning systems for code repositories,
> because of multiple and unpredictable binary representations of the
> same characters: it's something that immediately disqualifies SCSU if
> code switching can occur at any place in encoded texts).

Section 1 of the SCSU spec says, "It is not intended as processing
format or as general purpose interchange format." There's little value
in beating up SCSU for something it is not meant to do.

(That said, I've been playing with an implementation I call "line-safe
SCSU" that is fully conformant to UTS #6, but adds the constraint that
the state of the SCSU machine (modes and windows) must be reset at the
end of each line. That removes at least some of the nondeterminism.)

Now if Cropley's algorithm is being presented as a replacement or
alternative to UTF-8, then it does need to be evaluated on criteria like
these, and Suzuki-san's observations become very relevant.

Some readers know that I created lots of encodings like this, about 10
years ago. Since that time, UTF-8 has extended its lead as the dominant
Unicode interchange format, and the usage profile for compressing
Unicode text has continued to move toward general-purpose compression,
with SCSU still available as the only unencumbered, byte-based Unicode
compression format. An initiative like Cropley's needs to be "better
enough" than either UTF-8 or SCSU to displace either one, and this one
very simply is not.

--
Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s

Next message: Doug Ewell: "Posting attachments (was: Re: [unicode] UTF-c)"
Previous message: Philippe Verdy: "Re: [unicode] UTF-c"
Maybe in reply to: Thomas Cropley: "UTF-c"
Next in thread: Philippe Verdy: "Re: UTF-c"
Reply: Philippe Verdy: "Re: UTF-c"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Feb 22 2011 - 09:35:02 CST