Re: Rendering Raised FULL STOP between Digits from Karl Williamson on 2013-03-22 (Unicode Mail List Archive)

From: Karl Williamson <public_at_khwilliamson.com>
Date: Fri, 22 Mar 2013 13:08:01 -0600

On 03/21/2013 04:48 PM, Richard Wordingham wrote:
> For linguistic analysis, you need the normalisation appropriate to the
> task. This is a case where Unicode normalisation generally throws away
> information (namely, how the author views the characters), whereas in
> analysing Burmese you may want to ignore the order of non-interacting
> medial signs even though they have canonical combining class 0. I have
> found it useful to use a fake UnicodeData.txt to perform a non-Unicode
> normalisation using what were intended to be routines for performing
> Unicode normalisation. Fake decompositions are routinely added to the
> standard ones when generating the default collation weights for the
> Unicode Collation Algorithm - but there the results still comply with
> the principle of canonical equivalence.
>
> However, distinguishing U+00B7 and U+0387 would fail spectacularly
> of the text had been converted to form NFC before you received it.

This is the first time I've heard someone suggest that one can "tailor"
normalizations. Handling Greek shouldn't require having to fake
UnicodeData.txt. And writing normalization code is complex and tricky,
so people use pre-written code libraries to do this. What you're
suggesting says that one can't use such a library as-is, but you would
have to write your own. I suppose another option is to translate all
the characters you care about into non-characters before calling the
normalization library, and then translate back afterwards, and hope that
the library doesn't use the same non-character(s) internally.

And the question I have is under what circumstances would better results
be obtained by doing this normalization? I suspect that the answer is
only for backward compatibility with code written before Unicode came
into existence. If I'm right, then it would be better for most
normalization routines to ignore/violate the Standard, and not do this
normalization.
Received on Fri Mar 22 2013 - 14:11:29 CDT

This archive was generated by hypermail 2.2.0 : Fri Mar 22 2013 - 14:11:30 CDT