Re: U+01AB LATIN SMALL LETTER T WITH PALATAL HOOK

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jul 07 1999 - 18:35:24 EDT


John Cowan asked:

>
> Why does this character not have a canonical decomposition into
> LATIN SMALL LETTER T and COMBINING PALATALIZED HOOK BELOW?
>
> This point appears in Jonathan's "Atomic Theory of Unicode"
> document.
>

Basically because it was a unique case. The cedillas and ogoneks
appear on many encoded characters, and are also independently
encoded as separate pieces in some character encodings -- including
bibliographic encodings in which they are intended to be used
as combining characters. This is why the canonical equivalences
were made for all the letters with cedillas and ogoneks.

But U+0321 COMBINING PALATALIZED HOOK BELOW is unique to Unicode.
And U+01AB LATIN SMALL LETTER T WITH PALATAL HOOK is also unique
to Unicode. Both are derived from Pullum and Ladusaw's discussion
of phonetic symbols. U+01AB is an obsolete form, of little use--
though it does appear occasionally in old works. The recommended
IPA diacritic to indicate palatalization is U+02B2 MODIFIER LETTER
SMALL J, so that a palatalized t would be represented as
U+0074 U+02B2, rather than U+01AB, as documented in the standard.
Americanist usage is U+0074 U+02B8. Slavicist usage is U+0074 U+02B9.

However, we were aware that t was not the only letter to which the
palatalized hook had ever been applied. In an obsolete Latin
transcription for palatalized Slavic consonants, Pullum and Ladusaw
cite letters showing the palatalized hook applied to p, b, m, f,
v, t, d, n, l, r, s, and z. Rather than also encode all these
extreme low-frequency, rather useless letters, we encoded the
palatalized hook as a combining mark, so all of these forms were
representable in the standard.

The situation is somewhat similar to that for the retroflex hook.
U+0322, except that there are a number of current use characters
with this diacritic, and only a few oddball unencoded forms to
which the diacritic has been applied in corpora.

What you are seeing is the explicit drawing of the line. We were
aware of all these relations 10 years ago when the repertoire was
first pulled together. We were aware of these relations in 1993
when the canonical equivalences were first formalized in Unicode 1.1.
The question came up: which attached diacritics below should be
included in canonical decompositions and which should not. An
explicit decision was made: include cedilla and ogonek; exclude
palatal and retroflex hooks. This was a matter of expediency and
likely usefulness, amidst a mass of other decisions that had to
be made for equivalences. But the line was drawn, and we've lived
with it since 1993. Was the decision completely consistent? Probably
not. Was the decision arbitrary? Somewhat. Can we expect that
anybody could have produced a non-arbitrary, consistent decision?
I doubt it.

In dealing with a standard as complex as one whose scope is *all*
of the writing systems of the world, living and dead, there are
going to be hundreds of little crannies where sweeping, principled
consistency is unattainable. Writing systems are just too damn
diverse for that. They are the accumulated results of thousands
of clever innovators over the centuries finding ways to modify marks
to express new distinctions in new contexts. If you cut the Unicode
Standard loose from that deep historical context of the development
of the writing systems, and also sever it from its recent legacy
context of existing character encodings, no doubt systematizers
can find more consistent ways to take it apart and put it all back
together again -- but that is the project for the *next* standard,
not for *this* standard.

In the meantime, for implementations' sake, it is more important
to hold to the stability of most of the somewhat arbitrary, but
explicit decisions that have been made in dealing with all the edge
cases.

To paraphrase Jimmy Breslin, there are 49,194 characters in the
standard, and every one has a story. You've just heard the story
of U+01AB.

--Ken

Jonathan Coxhead's complete discussion of this issue is:

> COMBINING PALATALIZED HOOK BELOW
> --------- ----------- ---- -----
>
> The following decomposition is missing. I imagine this is an error.
>
> LATIN SMALL LETTER T WITH PALATAL HOOK = LATIN SMALL LETTER T +
> COMBINING PALATALIZED HOOK BELOW

Discussed above.

>
> The absence of the following may also be an error---I don't know enough
> to be sure.
>
> LATIN CAPITAL LETTER N WITH LEFT HOOK = LATIN CAPITAL LETTER N +
> COMBINING PALATALIZED HOOK BELOW
> LATIN SMALL LETTER N WITH LEFT HOOK = LATIN SMALL LETTER N +
> COMBINING PALATALIZED HOOK BELOW

Not an error either. This is a good example of another kind of edge case.
The n with left hook, a standard representation for a palatal nasal, has
its diacritic conceptually related to the U+0321 palatal hook, but the
form with the hook on the *left* leg of the n is not the same as what you
get applying U+0321 to an n, which results in a left hook on the *right*
leg of the n. Those two forms are not interchangeable.

>
> COMBINING RETROFLEX HOOK
> --------- --------- ----
>
> Some decompositions involving this character are also missing:
>
> LATIN CAPITAL LETTER T WITH RETROFLEX HOOK (LATIN CAPITAL LETTER
> T)
> LATIN SMALL LETTER D WITH TAIL (LATIN SMALL LETTER D)
> LATIN SMALL LETTER EZH WITH TAIL (LATIN SMALL LETTER EZH)
> LATIN SMALL LETTER L WITH RETROFLEX HOOK (LATIN SMALL LETTER L)
> LATIN SMALL LETTER N WITH RETROFLEX HOOK (LATIN SMALL LETTER N)
> LATIN SMALL LETTER R WITH TAIL (LATIN SMALL LETTER R)
> LATIN SMALL LETTER T WITH RETROFLEX HOOK (LATIN SMALL LETTER T)
> LATIN SMALL LETTER Z WITH RETROFLEX HOOK (LATIN SMALL LETTER Z)

The correct list is:

01AE;LATIN CAPITAL LETTER T WITH RETROFLEX HOOK;Lu;0;L;;;;;N;LATIN CAPITAL LETTER T RETROFLEX HOOK;;;0288;
0256;LATIN SMALL LETTER D WITH TAIL;Ll;0;L;;;;;N;LATIN SMALL LETTER D RETROFLEX HOOK;;0189;;0189
026D;LATIN SMALL LETTER L WITH RETROFLEX HOOK;Ll;0;L;;;;;N;LATIN SMALL LETTER L RETROFLEX HOOK;;;;
0273;LATIN SMALL LETTER N WITH RETROFLEX HOOK;Ll;0;L;;;;;N;LATIN SMALL LETTER N RETROFLEX HOOK;;;;
027B;LATIN SMALL LETTER TURNED R WITH HOOK;Ll;0;L;;;;;N;LATIN SMALL LETTER TURNED R HOOK;;;;
027D;LATIN SMALL LETTER R WITH TAIL;Ll;0;L;;;;;N;LATIN SMALL LETTER R HOOK;;;;
0282;LATIN SMALL LETTER S WITH HOOK;Ll;0;L;;;;;N;LATIN SMALL LETTER S HOOK;;;;
0288;LATIN SMALL LETTER T WITH RETROFLEX HOOK;Ll;0;L;;;;;N;LATIN SMALL LETTER T RETROFLEX HOOK;;01AE;;01AE
0290;LATIN SMALL LETTER Z WITH RETROFLEX HOOK;Ll;0;L;;;;;N;LATIN SMALL LETTER Z RETROFLEX HOOK;;;;
02B5;MODIFIER LETTER SMALL TURNED R WITH HOOK;Lm;0;L;<super> 027B;;;;N;MODIFIER LETTER SMALL TURNED R HOOK;;;;

0322;COMBINING RETROFLEX HOOK BELOW;Mn;202;NSM;;;;;N;NON-SPACING RETROFLEX HOOK BELOW;;;;

By the way, the naming inconsistencies drive other systematizers nuts,
too. This set should all be named "XXX WITH RETROFLEX HOOK", but it
isn't. Why?, you ask. Well, that's another story for another time...

>
> (The forms "with tail" are speculation on my part, but the visual
> appearances match. This seems to be enough for combining marks, as in
> the case of umlaut vs diaeresis.)
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT