RE: Normalization in panlingual application

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Sep 19 2007 - 20:04:00 CDT

Next message: Philippe Verdy: "RE: New Public Review Issue: Proposed Update UTS #18"

Previous message: Asmus Freytag: "Re: Normalization in panlingual application"
In reply to: Asmus Freytag: "Re: Normalization in panlingual application"
Next in thread: Asmus Freytag: "Re: Normalization in panlingual application"
Reply: Asmus Freytag: "Re: Normalization in panlingual application"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Asmus Freytag [mailto:asmusf@ix.netcom.com] wrote:
> Envoyé : jeudi 20 septembre 2007 02:41
> À : verdy_p@wanadoo.fr
> Cc : 'Jonathan Pool'; unicode@unicode.org
> Objet : Re: Normalization in panlingual application
>
> On 9/19/2007 5:21 PM, Philippe Verdy wrote:
> > Asmus Freytag wrote:
> >
> >> What is an invalid distinction is defined by your application. If you
> >> case-fold, case is an invalid distinction. If your goal is to be able
> to
> >> represent text faithfully, then the "K" series of normalizations has no
> >> place in your design (It's too haphazard - for example, also, 5¼ would
> >> be turned into 51/4, which is decidedly not the same thing).
> >>
> >
> > This is only a problem is the converted text has to be plain-text. If
> the
> > target of the project is to allow building rich-text documents from the
> > corpus, then conversion using NFKC becomes possible,
> A conversion that takes plain text to rich text, per definition, is not
> one of the Unicode Normalization Forms, but, at best, a higher-level
> protocol built on top a Normalization form. More on that below.
> > for example using some
> > XML annotation, the conversion would be something like
> > "5<fraction>1/4</fraction>" and will in NFKC form (also in NFC form by
> > definition, as well as, here, in NFC and NFD forms).
> >
> Since none of the common libraries that implement normalization forms
> perform the necessary mappings to markup out of the box, anyone
> contemplating such a scheme would be forced to implement either a
> pre-processing step, or their own normalization logic. This is a
> downright scary suggestion, since such an approach would lose the
> benefit of using well-tested implementation. Normalization is tricky
> enough that one should try to not implement if from scratch if all
> possible.
>
> You realize, also, that it is not (in the general case) possible to
> apply normalization piece-meal. Because of that, breaking the text into
> runs and then normalizing can give different results (in some cases),
> which makes pre-processing a dicey option.

That's not my opinion. At least the first step of the conversion (converting
to NFC) is very safe and preserves differences, using standard programs
(which are widely available, so this step represents norisk). Detecting
compatible characters and mapping them to annoted forms can be applied after
this step in a very straightforward thing. But I won't recommend applying
blindly a NFKC/D transformation from a large corpus of texts in different
formats with different conventions without asking outself why some
differences were encoded like this, using compatible characters (not in
NFKC/D form).

> > Such annotation changes the nature of texts, by turning them not in
> linear
> > suites of characters by mapping a structure on top of this.
> Precisely the point I made above.
> > Given the
> > expected usage, parsing texts to mapa structure on top of them will
> probably
> > be a bonus, as it will ease later reuse of the converted corpus, within
> > different contexts.
> >
> >
> That may be, but the better choice is to normalize according to NFC/NFD
> and then apply particular foldings, and/or substitutions of markup. The
> "K" series of normalization forms, by default, just "k"orrupt the data.

That's exactly what I suggested: avoiding blind usage of the K series for
large corpus of texts. The particular foldings need to be applied by first
looking into the corpus how much they affect the text, so this should be
done by testing differences, counting them, extracting the differences to
check the results, and then apply it to the selected sub-corpus, step by
step. Such thing can't be automated as it requires some knowledge of the
language or subject of the texts and reviewing by humans. Using NFKD blindly
is very risky for large corpus as there's no way to revert the lossy
changes...

Computing statistics about the usage of compatible characters remaining in
the converted text will help manage the work to do, by helping making
subselections. Also, if you keep in the corpus some other metadata
indicating text source and usage or subject, this may help making
subselections in a more relevant way. This way, any mapping needs not be
applied to the whole corpus, but only where it is relevant. It's quite easy
to compute such statistics, because compatible characters are known
extensively, and there are not so many ones in Unicode. A very basic filter
can be made to count their occurrences in the indexed texts.

NFKC has its use, but in limited contexts: as the last chance step used when
humane review is not possible, such as in full text search engines, to
extend searches to nearly equivalent texts with approximative matches. In
such usage, no text is actually produced and stored in NFKC, the corpus is
not modified, but NFK(C/D) forms are still used to generate only indexing
keys. I think it is its only safe application, provided that exact searches
of canonically equivalent strings remains possible in a separate search
entry.

Next message: Philippe Verdy: "RE: New Public Review Issue: Proposed Update UTS #18"
Previous message: Asmus Freytag: "Re: Normalization in panlingual application"
In reply to: Asmus Freytag: "Re: Normalization in panlingual application"
Next in thread: Asmus Freytag: "Re: Normalization in panlingual application"
Reply: Asmus Freytag: "Re: Normalization in panlingual application"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Sep 19 2007 - 20:05:39 CDT