From: Asmus Freytag (firstname.lastname@example.org)
Date: Wed Oct 15 2003 - 11:48:47 CST
I'm going to answer some of Peter's points, leaving aside the interesting
digressions into Java subclassing etc. that have developed later in the
At 04:19 AM 10/15/03 -0700, Peter Kirk wrote:
>I note the following text from section 5.13, p.127, of the Unicode
>>Canonical equivalence must be taken into account in rendering multiple
>>accents, so that any two canonically equivalent sequences display as the same.
This statement goes to the core of Unicode. If it is followed, it
guarantees that normalizing a string does not change its appearance (and
therefore it remains the 'same' string as far as the user is concerned.)
>The word "must" is used here. But this is part of the "Implementation
>Guidelines" chapter which is generally not normative. Should this sentence
>with "must" be considered mandatory, or just a recommendation although in
>certain cases a "particularly important" one?
If you read the conformance requirements you deduce that any normalized or
unnormalized form of a string must represent the same 'content' on
interchange. However, the designers of the standard wanted to make even
specialized uses, such as 'reveal character codes' explicitly conformant.
Therefore you are free to show to a user whether a string is precomposed or
composed of combining characters, e.g. by using a different font color for
each character code.
The guidelines are concerned with the average case: displaying the
characters as *text*.
[The use of the word 'must' in a guideline is always awkward, since that
word has such a strong meaning in the normative part of the standard.]
>>Rendering systems should handle any of the canonically equivalent orders
>>marks. This is not a performance issue: The amount of time necessary to
>>marks is insignificant compared to the time necessary to carry out other
The interesting digressions on string libraries aside, the statement made
here is in the context of the tasks needed for rendering. If you take a
rendering library and add a normalization pass on the front of it, you'll
be hard-pressed to notice a difference in performance, especially for any
So we conclude: "rendering any string as if it was normalized" is *not* a
However, from the other messages on this thread we conclude: normalizing
*every* string, *every time* it gets touched, *is* a performance issue.
A few things: Unicode supports data that allow to perform a 'Normalization
Quick Check', which simply determines whether there is anything that might
be affected by normalization. (For example, nothing in this e-mail message
is affected by normalization, no matter to which form, since it's all in
With a quick check like that you should be able to reduce the cost of
normalization dramatically --unless your data consists of data that needs
normalization throughout. Even then, if there is a chance that the data is
already normalized, verifying that is faster than normalizing (since
verification doesn't re-order).
Then, after that, as others have pointed out, if you can keep track of a
normalized state, either by recordkeeping or by having interfaces inside
which the data is guaranteed to be normalized, then you cut your costs furhter.
This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST