Re: Compression through normalization

From: Peter Kirk (peterkirk@qaya.org)
Date: Sat Dec 06 2003 - 13:58:26 EST

Next message: Doug Ewell: "Re: Compression through normalization"

Previous message: Peter Jacobi: "Transcoding Tamil in the presence of markup"
In reply to: Doug Ewell: "Re: Compression through normalization"
Next in thread: Doug Ewell: "Re: Compression through normalization"
Reply: Doug Ewell: "Re: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 06/12/2003 09:49, Doug Ewell wrote:

> ...
>
>>But as C10 does not mandate any normalized form (just canonical
>>equivalence of the results), I don't think that it requires that a
>>compressor should produce its result in either NFC or NFD form
>>
>>
>
>Right. I know that. But Mark and Ken said it should, and so I'm trying
>to find out if this *SHOULD* be in C10, or alternatively how I would
>expect to find out about this extra recommendation if I weren't on the
>Unicode mailing list.
>
>
My feeling is that it should NOT be in C10. The conformance rules do not
mention normalisation, which is a convenience and a recommendation but
not a conformance issue - that is, apart from C14-C16 which do not
specify that normalisation should be done, only how it must be done if
it is done. I would also argue that adding such a rule to C10 would
conflict with C9 or at least with the principle underlying it, that for
conformance purposes all canonically equivalent forms are equal and
indistinguishable.

>Note that this is a devil's advocate argument. I'm not necessarily
>disagreeing with Mark and Ken's recommendations. I'm just trying to
>reconcile the differences between what they say (NFC or NFD only) and
>what C10 says (any canonically equivalent sequence).
>
>
>
I would agree with these recommendations as long as they remain
recommendations rather than conformance requirements, and are perhaps
added to section 5.6 of TUS (4.0), which already includes the sentence
(in the context of canonically equivalent alternative spellings):

> Implementations that are “liberal” in what they accept, but
> “conservative” in what they issue, will have the fewest compatibility
> problems.

>>...
>>With this view, normalization of strings should not be done on output
>>from a process but on its input.
>>
>>
>
>Yes. Correct. To paraphrase Peter Kirk, according to C9, if my Unicode
>text is going through Process A (which outputs NFD) and on to Process B
>(which wants NFC input), A does not have to convert the text to NFC just
>to appease B. Instead, it is up to B to do the work of converting to
>NFC.
>
>
A more accurate paraphrase would be that process A may output any
canonically equivalent form, maybe one which is not a normalisation form
at all e.g. because it has composed composition exclusions.

>So, in a compressor/decompressor environment, the compressor is the one
>that has to do any normalization work. The decompressor just
>decompresses. This is consistent with what I wrote a few days ago about
>wanting this normalized-compression bit to work with existing
>decompressors.
>
>
>
For "any normalization work" I would prefer to say "any canonically
equivalent transformation work". The compressor performs whatever
canonically equivalent transformations it chooses, perhaps because they
produce the best compression. The decompressor is only obliged to
decompress; C9 implies that it is not obliged to normalise and that no
other process can rely on it doing so - although it is free to do so and
may choose to do so if that is expected to improve overall efficiency.
Mark and Ken's recommendation is indeed that it should do so, and I have
no quarrel with that.

> ...
>
>Subprocesses within a closed system may be able to make certain
>assumptions for efficiency. Process B, for example, may know that its
>only source of input is Process A, which is guaranteed always to produce
>NFC. ...
>
Does C9 actually allow this? Well, perhaps within a closed system, but
then standardisation and so Unicode is irrelevant to data transfer
between sub-processes within a closed system. Outside a closed system,
Process B's best assumption for efficiency may be that Process A has
*probably* normalised, and so it is worth performing first a quick check
before a full normalisation.

>... For any other situation, the algorithm described in Annex 8 of UAX
>#15 should be employed (not re-invented).
>
>
>

-- 
Peter Kirk
peter@qaya.org (personal)
peterkirk@qaya.org (work)
http://www.qaya.org/

Next message: Doug Ewell: "Re: Compression through normalization"
Previous message: Peter Jacobi: "Transcoding Tamil in the presence of markup"
In reply to: Doug Ewell: "Re: Compression through normalization"
Next in thread: Doug Ewell: "Re: Compression through normalization"
Reply: Doug Ewell: "Re: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Dec 06 2003 - 14:55:54 EST