From: Asmus Freytag (email@example.com)
Date: Thu Sep 20 2007 - 07:07:25 CDT
On 9/19/2007 6:04 PM, Philippe Verdy wrote:
> Asmus Freytag [mailto:firstname.lastname@example.org] wrote:
>> You realize, also, that it is not (in the general case) possible to
>> apply normalization piece-meal. Because of that, breaking the text into
>> runs and then normalizing can give different results (in some cases),
>> which makes pre-processing a dicey option.
> That's not my opinion.
The result that for many strings s and t, NFxx(s) + NFxx(t) != NFxx(s +
t) is not a matter of opinion. For these strings, you cannot normalize
them separately and then concatenate, and expect the result to be the
normalized from of the two strings. UAX#15 is rather clear about that.
> At least the first step of the conversion (converting
> to NFC) is very safe and preserves differences, using standard programs
> (which are widely available, so this step represents norisk). Detecting
> compatible characters and mapping them to annoted forms can be applied after
> this step in a very straightforward thing.
I had written:
> > Since none of the common libraries that implement normalization forms
> > perform the necessary mappings to markup out of the box, anyone
> > contemplating such a scheme would be forced to implement either a
> > pre-processing step, or their own normalization logic. This is a
> > downright scary suggestion, since such an approach would lose the
> > benefit of using well-tested implementation. Normalization is tricky
> > enough that one should try to not implement if from scratch if all
> > possible.
your approach confirms what I suspect. By suggesting an approach like
this, you are advocating de-novo implementation of normalization
transformation. By the way, NFC would be a poor starting point for your
scheme, since all normalization forms start with an (implied) first step
of applying *de*composition. But you can't even start with NFD, since
the minute you decompose any compatibility characters in your following
step, you can in principle create sequences that denormalize the
existing NFD string around it. The work to handle these exceptions,
amounts to a full implementation of normalization, logically speaking.
In other words, you've lost the benefit of your library.
It's precisely the fact that normalization is unexpectedly tricky in its
details that anything other than using an established library to apply
NFC or NFD, should not be contemplated in a *practical* implementation.
Ken's suggestions on how to deal with the overall situation were so much
less speculative and so much more to the point. Further, they spoke from
a point of real experience. His suggestion, to recap it here, was to
apply NFC for internal storage, and to apply the necessary foldings when
analyzing or presenting the data (depending on the view). By using
foldings that are themselves designed to preserve NFC (see UTS#30) an
implementer can be in control of how the data is massaged, without
having to re-implement and re-test normalization.
> But I won't recommend applying
> blindly a NFKC/D transformation ..
Well at least we agree on that much.
This archive was generated by hypermail 2.1.5 : Thu Sep 20 2007 - 07:09:53 CDT