Re: Compression through normalization

From: Doug Ewell (dewell@adelphia.net)
Date: Sat Dec 06 2003 - 12:49:52 EST

Next message: Peter Jacobi: "Transcoding Tamil in the presence of markup"

Previous message: Peter Kirk: "Re: Compression through normalization"
In reply to: Philippe Verdy: "RE: Compression through normalization"
Next in thread: Peter Kirk: "Re: Compression through normalization"
Reply: Peter Kirk: "Re: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

> First C10 only restricts modifications just to preserve all the
> semantics of the encoded text in any context. There are situations
> where this restriction does not apply: when performing text
> transformations (such as folding, or even substringing, which may or
> may not respect canonical equivalence: case folding applied to
> substrings does not as concatenation of folding of substrings does not
> always return canonically equivalent results, even if canonical
> cluster bounds are preserved when substringing).

Of course. C10 talks about a process that "purports not to modify the
interpretation" of its input. I'm not sure how explicit transformations
like folding and substringing came up.

> Compression of an existing text is not viewed as being a text
> transformation, so the intent of C10 should be observed, but only if
> the compressor claims that _it preserves_ canonical equivalence.

It had better do at least that. Any text compressor that transforms its
input in a way not allowed by C10 would be a complete failure.

> But as C10 does not mandate any normalized form (just canonical
> equivalence of the results), I don't think that it requires that a
> compressor should produce its result in either NFC or NFD form

Right. I know that. But Mark and Ken said it should, and so I'm trying
to find out if this *SHOULD* be in C10, or alternatively how I would
expect to find out about this extra recommendation if I weren't on the
Unicode mailing list.

Note that this is a devil's advocate argument. I'm not necessarily
disagreeing with Mark and Ken's recommendations. I'm just trying to
reconcile the differences between what they say (NFC or NFD only) and
what C10 says (any canonically equivalent sequence).

> Instead I think that it's up to the next process to determine which
> normalization form best fits its need: if the compressor was desinged
> to recompose to NFC, and then the next process prefers NFD, the last
> renormlization in the compressor will be superfluous.
>
> So for me, a compressor can choose its own noramlization on input and
> apply it before compressing, and the decompressor needs nothing else
> than just decompressing and keeping the string in the form that was
> accepted or forced on input by the compressor and encoded in the
> compressed stream.
>
> With this view, normalization of strings should not be done on output
> from a process but on its input.

Yes. Correct. To paraphrase Peter Kirk, according to C9, if my Unicode
text is going through Process A (which outputs NFD) and on to Process B
(which wants NFC input), A does not have to convert the text to NFC just
to appease B. Instead, it is up to B to do the work of converting to
NFC.

So, in a compressor/decompressor environment, the compressor is the one
that has to do any normalization work. The decompressor just
decompresses. This is consistent with what I wrote a few days ago about
wanting this normalized-compression bit to work with existing
decompressors.

> One can save unnecessary normalizations across multiple processes in
> one system, provided that the strings produced on output are reliable
> marked (out of band with some meta-data) to indicate their current
> supported normalization forms (i.e. if the string is already in one or
> more of the 4 standardized normalization forms): this just requires 8
> bits of information on output of any process, in addition to the
> output string, with 2 bits per form to mean: YES, NO, UNKNOWN/UNTESTED
> (and posibly MAYBE, if the input is also MAYBE and the process does
> not force or checks any normalization form on input, this test being
> left for the next process if it needs it).
>
> If this is not indicated (in the output from an external and not
> directly supported process), then a fast-check on input may be used if
> this saves work.

Subprocesses within a closed system may be able to make certain
assumptions for efficiency. Process B, for example, may know that its
only source of input is Process A, which is guaranteed always to produce
NFC. For any other situation, the algorithm described in Annex 8 of UAX
#15 should be employed (not re-invented).

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

Next message: Peter Jacobi: "Transcoding Tamil in the presence of markup"
Previous message: Peter Kirk: "Re: Compression through normalization"
In reply to: Philippe Verdy: "RE: Compression through normalization"
Next in thread: Peter Kirk: "Re: Compression through normalization"
Reply: Peter Kirk: "Re: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Dec 06 2003 - 13:42:12 EST