From: Asmus Freytag (firstname.lastname@example.org)
Date: Mon Jan 09 2006 - 17:24:29 CST
On 1/9/2006 11:30 AM, Jukka K. Korpela wrote:
> On Mon, 9 Jan 2006, Kent Karlsson wrote:
>>> Theoretically, U+0132 is a compatibility character with U+0049 U+004A
>>> as the compatibility decomposition.
>> It has the *standardised* (non-theoretical) decomposition: <compat> 0049
> The word "Theoretically" meant that I first considered how things are
> in principle, by the Unicode standard.
"Is theoretically", and 'is defined as' are different, "in principle",
but it seems the latter is what you meant.
>>> Being a compatibility decomposable
>>> character, it is not recommended except in the representation
>> No, it does not say that.
> "Compatibility decomposable characters are a subset of compatibility
> characters included in the Unicode Standard to represent distinctions
> in other base standards. They support transmission and processing of
> legacy data. Their use is discouraged other than for legacy data or
> other special circumstances."
> Definition D21 in section 3,
>> There are exceptions to that interpretation
>> of compatibility characters (and compatibility decomposable characters),
>> the IJ LIGATURE and the LONG S are among them. I think it is perfectly
>> fine to recommend their use in situations like this
> I think so too; we seem to agree on the practical point. But I
> discussed what the standard says (in a somewhat odd place, but the
> same general idea can be seen elsewhere in the standard, too).
I agree that the language as written is too strong. The problem is that
such statements are perfectly fine for a large set of these characters,
but totally inappropriate for the bulk of them - unless that is,if your
definition of 'special circumstances' is totally elastic.
The history of this statement is interesting. It was first introduced in
2.0, without any discouragement expressed. The latter was added in 3.0,
but in 4.0, the reservation for 'special circumstances' was added.
The number of compatibility characters in the standard has changed over
3.0.0 2237 (of which <font> 37 , <super/sub> 63, <compat> 660)
3.1.0 3230 (of which <font> 1028 , <super/sub> 63, <compat> 662)
3.2.0 3282 (of which <font> 1037 , <super/sub> 64, <compat> 669)
4.1.0 3363 (of which <font> 1038, <super/sub> 124, <compat> 673)
4.1.0 3422 (of which <font> 1041, <super/sub> 169, <compat> 673)
5.0.0 beta 3424 (of which <font> 1043, <super/sub> 169, <compat> 673)
In other words, over time, about 1,000 new compatibility characters with
<font> type decompositions have been added and about 100 with
<super/sub>. These are characters that form an integral part of
mathematical and phonetic notation, use that is certainly 'specialized'
compared to general text use, but perhaps ill-described by the use of
the tersm 'special circumstance' in the text.
>> ZWJ could be used to "recommend" the use of a typographic ligature, but
>> should not (IMO) be used to form *orthographic* ligatures
> Such a distinction does not exist in the Unicode standard, and as you
> mention, the IJ ligature would be a borderline case anyway.
Typographically there is a clear difference between a ligature and a
digraph. ZWJ - if implemented - in a Latin rendering engine would attempt
to locate a ligated glyph. If the font lied and presented the digraph as
a ligature, you might get what you want. However, since there is a
*code point* for the IJ, fonts would most likely not offer a ligature glyph.
Therefore, the use of ZWJ would have no effect, other than to introduce
potential problems in all those rendering engines that do not
support it for Latin.
> Especially considering the classification of the ij ligature as a
> letter in Dutch, we might say that it should really have been defined
> as a primary (non-compatibility) character, much the same way as the
> oe ligature and the ae ligature (which is now even called "letter ae",
> not a ligature, though it's still effective used as a ligature, too).
> But it's too late to change that now. (Maybe some official statement,
> constituting an explicit exception to the principle of avoiding
> compatibility decomposable characters, would be in order.)
The problem with the IJ is that you end up with both usages, as i+j will
give the intended result in many cases, and since an IJ key is lacking
on most keyboards, i+j is what people will enter. As the exmaple shows,
i+j will not give the intended result in some cases, so people will use
ij or IJ to ensure that the case or spacing is what they want. About the
only thing that can be done is document that thoroughly so that search
engines and databases can do the right thing. (For example, I assume,
but have not verified, that i+j and ij in fact sort the same in the DUCET).
This archive was generated by hypermail 2.1.5 : Mon Jan 09 2006 - 17:25:36 CST