Comments on Case Misalignments
Date: Tue, 08 May 2001 12:47:05 -0700
From: Asmus Freytag <email@example.com>
Subject: Re: UTC Agenda Item: Case Misalignments
At 09:33 PM 5/7/01 -0700, you wrote:
>I would like the following document to be added to the docket for the next
I looked at your document and checked with existing implementations to see
how far they adhere to the capitalization that we imply in the
compatibility mappings. So far, I have found four deviations:
1) 2121 TEL
There are two variations of this symbol. One in CAPs and one in SMALLCAPS.
The latter is quite common and its decomposition would be T e l, not T E L.
(See Ken Lunde's book for more info)
2) 3396 SQUARE ML
This is consistently depicted as Ml (with script l) in Windows fonts that
show it. Ken Lunde's boook shows it as ml (with script l) for the
KanjiTalk set. The use of the capital M seems a mistake if you consider
the rules of the SI, but it's hard to argue using SI rules here, since the
script l is also not SI conformant.
3) 33AC SQUARE GPA
This is consistently depicted as GPA (not GPa) in those fonts that show
it. If we assert that it is Giga Pascal, then by SI rules it would have to
be GPa. I have no idea whether GPA (as used in English) is a common enough
abbreviation in EA context, but the fact is that the fonts show it as GPA
not GPa. I have not been able to find examples in Ken's book.
4) 33D7 SQUARE PH
This is consistently depicted as pH (not PH) in those fonts that show it.
pH would be the correct spelling for pH values, ph would be the correct
spelling for the obsolete unit phot, and PH is neither. Asserting the
correctness of our decomposition PH would strike me as particularly
incongruous in this instance. Especially if you consider that Unicode 3.0
images this character as pH (!). In fact, had we not prematurely frozen
the compatibility mappings, we could have corrected that one.
There is a danger here going to far down the road relying on the
compatibility mappings. When it comes to compatibility characters, and
most of the functionally cased characters are, implementors have happily
mapped the code provided to the best matching existing characters,
accepting a certain variability, and variance from the compatibility
mappings. Asserting specific behavior based on these mappings could create
I have also found some omissions:
i) 2114 L B BAR
this should probably be treated as if it decomposed to l b
ii) 2118 CAPITAL SCRIPT P
This is the lowercase calligraphic p (weierstrass symbol) it should be
treated as a lowercase form of p for sorting
These seem to be based on the fact that the input to the definition is the
compatibility mappings, and that these characters do not have such
mapping. [Aside: 210F has a rather unmotivated compatibility mapping to
0127: the former is not a font variant of the latter, but adds a stroke to
There may be other characters that would need to be treated 'as if' they
were cased in a particular way - finding missing ones is always harder
than reviewing the proposed set, of course.