Re: Question on U+33D7

From: Ken Whistler <kenw_at_sybase.com>
Date: Thu, 23 Feb 2012 17:37:23 -0800

On 2/23/2012 2:44 PM, António Martins-Tuválkin wrote:
>> It is defined as
>> > "33D7;SQUARE PH;So;0;L;<square> 0050 0048;;;;N;SQUARED PH;;;;"
>> > in UnicodeData.txt, but it is shown as "pH" in code chart. Should it be
>> > "0070 0048" or "PH"?
> It should certainly be "pH", i.e., "<square>0070 0048</square>",
> because that's the peculiar casing in widespread (universal, really)
> use for this basic Chemistry concept (AFAIK it means "power of
> Hidrogen"). See< http://en.wikipedia.org/wiki/pH#History>.
>
> While there's no surprise at "PH" Unicode names being all caps, I’m
> surprised that the decomposition mapping is wrongly set to 0050 0048
> instead of to 0070 0048.

O.k., folks, I guess it's time for everybody to gather around the fire
for another
episode of "Every Character Has a Story".

First, to answer Matt Ma's original question, no, the decomposition
should *not*
be "<square> 0070 0048". The reason for that is simple: no matter what
the glyph
looks like, or what people think the character might mean, the
decomposition mapping
is immutable -- constrained by the stability guarantees for Unicode
normalization.
U+33D7 had that decomposition mapping as of Unicode 3.1, which defines the
base for normalization stability, so right or wrong, come hell or high
water, it
stays that way forever.

But that begs the question of how it got to be that way in the first
place. To answer
that, we have to dig deeper into the history of the encoding.

If you will now pull down your copies of Unicode 1.0 off the shelf and
turn to p. 362,
you will see that U+33D7 was included in Unicode 1.0. Lo and behold, the
glyph shown in the charts for U+33D7 is "PH", with a capital "P", rather
than a lowercase "p". (The character was also named "SQUARED PH", rather
than the current "SQUARE PH", but the explanation for that will have to wait
for another evening.)

Unicode 1.0 didn't have any formal decompositions, but Unicode 1.*1* did.
In Unicode 1.1, on p. 75, the decomposition for U+33D7 is given as
"[0050] & [0048]", reflecting the glyph shown for the character in
Unicode 1.0.

It was Unicode 2.0 which changed the glyph for U+33D7 to "pH", on the
assumption
that the character must have been intended as a East Asian square symbol
representation of the chemical symbol "pH". The decomposition for U+33D7 was
not adjusted, however, although its format was shifted to "<square> +
0050 P + 0048 H"
in the charts. Now tracking down the details of the decision process that
was involved in changing the glyph for U+33D7 for Unicode 2.0 is pretty
difficult. The development of the suite of fonts for printing Unicode
2.0 was a pretty
wild and wooly process, as that was the first attempt to print the
entire set of charts
with outline fonts. Unicode 1.0 had been printed with a bitmap font
developed
at Xerox in the early early days. Some of the glyph changes between
Unicode 1.0
and 2.0 "just happened", despite the care which was taken to try to
check everything.

I'm pretty sure that the glyph change for U+33D7 was discussed by the
editors
at some point (in either late 1995 or very early 1996), but at that
stage in the
development of the standard that kind of thing was usually not recorded on
an item-by-item basis. Remember, there was a *lot* going on then which was
much more important to the UTC than the glyph for some East Asian
compatibility
character that nobody used: the design of UTF-8 for example!

Speaking of use of the character, where *did* it come from exactly, and what
was it intended for? Well, that is also problematical. *Most* of the
characters
in the CJK Compatibility block in the range U+3380..U+33DD can easily be
traced to KS X 1001:1992 (then known as KS C 5601) or CNS 11643.
But U+33D7, U+33DA, and U+33DB are anomalous. They didn't have any
mappings (that I knew about) as of Unicode 1.0. They may have come from
some early draft of a Korean standard, or from some Asian company private
registry of character extensions, or maybe just from a paper copy of
"character stuff" sitting around at Xerox circa 1989. Nobody really
seemed to
be sure what they were -- they were just more ill-advised squared East
Asian squared abbreviation "dreck" that was added to the pile and not
examined very carefully, because everybody knew that such symbols for
SI units (and other scientific and math symbols of their ilk, such as
"ln" for
natural logarithm) should just be spelled out with regular characters.

We can presume, in hindsight, that U+33D7 *may* have been originally
intended
as an East Asian character set abbreviation symbol for the chemical concept
of "pH". U+33D9 was presumably intended for "parts per million", although
I don't recall that anybody has actually bothered to think about that,
and if
they had, they might have suggested that the glyph for *that* symbol also
be changed, to the more usual lowercase "ppm". And U+33DA "PR"?
Who knows? My guess would be an abbreviation for "per radian", as
in 57.2957 degrees per radian, but your guess is as good as mine. I suppose
it could have been intended for a "picoroentgen", but that seems even less
likely. Maybe it was an escapee from a font for a periodic table and was
intended for "Praseodymium".

Beyond that, it is just turtles all the way down. And speaking of turtles...
But, no, that is also a topic for another evening. ;-)

--Ken
Received on Thu Feb 23 2012 - 19:40:51 CST

This archive was generated by hypermail 2.2.0 : Thu Feb 23 2012 - 19:40:52 CST