Re: Rendering Raised FULL STOP between Digits

From: Asmus Freytag <>
Date: Fri, 22 Mar 2013 18:01:14 -0700

On 3/22/2013 12:08 PM, Karl Williamson wrote:
> On 03/21/2013 04:48 PM, Richard Wordingham wrote:
>> For linguistic analysis, you need the normalisation appropriate to the
>> task.

Linguistic analysis (in general) being a hugely complex undertaking,
mere normalization pales in comparison, so wrapping normalization into
the processing isn't going to make it that much more complicated..

>> This is a case where Unicode normalisation generally throws away
>> information (namely, how the author views the characters),

Canonical normalization is supposed to take care of distinctions that
fit within the same view of the character by the author and concern
principally distinctions that could be said to be "artifacts of the

The same is emphatically NOT true for COMPATIBILITY normalization.

>> whereas in
>> analysing Burmese you may want to ignore the order of non-interacting
>> medial signs even though they have canonical combining class 0. I have
>> found it useful to use a fake UnicodeData.txt to perform a non-Unicode
>> normalisation using what were intended to be routines for performing
>> Unicode normalisation. Fake decompositions are routinely added to the
>> standard ones when generating the default collation weights for the
>> Unicode Collation Algorithm - but there the results still comply with
>> the principle of canonical equivalence.

This description seems to capture an "implementation technique" that
could be a shortcut - assuming that normalization wasn't a separate,
up-front pass. Some algorithms may have needs to normalize data in ways
that might make adding the standard Unicode Normalization aspects into
them attractive from a performance point of view (even if not from a
maintenance point of view).

>> However, distinguishing U+00B7 and U+0387 would fail spectacularly
>> of the text had been converted to form NFC before you received it.

That's a claim for which the evidence isn't yet solid and if it could be
made solid would make that claim very interesting.
> This is the first time I've heard someone suggest that one can
> "tailor" normalizations. Handling Greek shouldn't require having to
> fake UnicodeData.txt. And writing normalization code is complex and
> tricky, so people use pre-written code libraries to do this. What
> you're suggesting says that one can't use such a library as-is, but
> you would have to write your own. I suppose another option is to
> translate all the characters you care about into non-characters before
> calling the normalization library, and then translate back afterwards,
> and hope that the library doesn't use the same non-character(s)
> internally.

"Handling" Greek in the context of run-of-the-mill algorithms should
probably not be done by folding Normalization into them (for the
excellent reasons given). But for some performance sensitive and rather
complex types of detailed linguistic analysis I might accept the
suggestion as a possible shortcut (over a two-pass process). Given the
existence of such a shortcut, "modifying" the normalization part of the
combined algorithm is an interesting suggestion as an implementation

"Tunneling" through an existing normalization library would be a hack,
which should never be necessary except where normalization is broken
(see compatibility Han characters).

However, even if standard canonical decompositions can be mistaken,
tunneling isn't really a fool-proof answer, because it assumes that data
didn't get normalized en route. There's nothing that reliably prevents
that from happening in a distributed system (unless all the parts are
under your tight control, which would seem make it a distributed system
in name only).
> And the question I have is under what circumstances would better
> results be obtained by doing this normalization? I suspect that the
> answer is only for backward compatibility with code written before
> Unicode came into existence. If I'm right, then it would be better
> for most normalization routines to ignore/violate the Standard, and
> not do this normalization.
Let's get back to the interesting question:

Is it possible to correctly process text that uses 00B7 for ANO TELEIA,
or is this fundamentally impossible? If so, under what scenario?

Received on Fri Mar 22 2013 - 20:03:52 CDT

This archive was generated by hypermail 2.2.0 : Fri Mar 22 2013 - 20:03:53 CDT