Re: Rendering Raised FULL STOP between Digits from Richard Wordingham on 2013-03-22 (Unicode Mail List Archive)

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Sat, 23 Mar 2013 01:04:08 +0000

On Fri, 22 Mar 2013 13:08:01 -0600
Karl Williamson <public_at_khwilliamson.com> wrote:

> This is the first time I've heard someone suggest that one can
> "tailor" normalizations.

I think the officially acceptable term is 'folding'. One would
not be 'tailoring a Unicode normalisation', but subverting the
code to do what you need. However, in my cases I've also wanted
rearrangement as though the characters had what I consider useful
canonical combining classes.

> Handling Greek shouldn't require having to fake UnicodeData.txt.

It does if you have problems with the normalisations. Now it can be
argued that the problem is with you if you have difficulty treating
U+003B SEMICOLON as indicating a question, but there are many ways of
doing most tasks.

> And
> writing normalization code is complex and tricky, so people use
> pre-written code libraries to do this. What you're suggesting says
> that one can't use such a library as-is, but you would have to write
> your own.

From your description of what you were doing, I assumed you were in
charge, rather than the subcontractor being in charge. However, some
utilities have the nasty habit of hiding the key data where users can't
get at it. One very legitimate reason for changing the data is to test
a proposed change to the standard. Myself, I've been pleasantly
surprised at how quick it is to parse UnicodeData.txt or even to loop
through all codepoints.

> I suppose another option is to translate all the
> characters you care about into non-characters before calling the
> normalization library, and then translate back afterwards, and hope
> that the library doesn't use the same non-character(s) internally.

With over two planes of Private Use Area at your disposal, you needn't
resort to non-characters.

> If I'm right, then it would be better
> for most normalization routines to ignore/violate the Standard, and
> not do this normalization.

It is certainly true that normalising everything can be a bad idea.
Normalising CJK compatibility characters is a very good way of
preventing round-tripping! As to normalisation in general, if one's
input were normalised immediately upon receipt, one would not be able
to memorise how many deletions were needed to cancel a key stroke, and
some input methods would go badly wrong. In general, one
Unicode-compliant process cannot instruct another to do something like
'delete the last 5 characters' - sometimes a process needs to not be
Unicode-compliant.

Richard.
Received on Fri Mar 22 2013 - 20:04:08 CDT

This archive was generated by hypermail 2.2.0 : Fri Mar 22 2013 - 20:06:27 CDT