Re: encoding ext Latin for PNG

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Mar 10 2000 - 15:16:37 EST


Peter,

>
> The first question relates to the characters 0x8D and 0x8E, L/l
> with equal sign overlay. These are not currently defined in
> Unicode, neither is there a combining equal sign overlay
> character. Would it be preferable to propose addition of one
> combining character or of a pair of composite characters (with
> no canonical decomposition)?

In my opinion, it is preferable to just ask for the two characters
as units. As John Cowan pointed out, the overlays are not used in
any canonical decompositions. And adding a double bar combining overlay
would just raise questions about equivalency for various currency
signs, for example.

The overlay diacritics were right on the fuzzy border area for
definition of canonical equivalence -- and the decision was made,
way back when, not to use them in canonical equivalences.

>
> The second question relates to the following pairs of
> characters:
>
> 0x8F, 0x90 L/l with tilde overlay
> 0x9A, 0x9B U/u with middle bar
> 0xD0, 0xF0 L/l with middle bar
>
> For each of these pairs, the lower case character - and only
> the lower case character - is already defined in the standard:
>
> U+026B LATIN SMALL LETTER L WITH MIDDLE TILDE
> U+0289 LATIN SMALL LETTER U BAR
> U+019A LATIN SMALL LETTER L WITH BAR
>
> All three of these characters could potentially have canonical
> decompositions using existing characters, but in fact none of
> these three characters has a canonical decomposition.
>
> The upper case counterparts to all three could be encoded using
> combining sequences as follows:
>
> L with tilde overlay: 004C + 0334
> U with middle bar: 0055 + 0335
> L with middle bar: 004C + 0335
>
> (It's not entirely clear that U+0335 is the appropriate
> combining mark for the latter two; the distinction between
> U+0335 and U+0336 appears to be purely visible. U+0335 seems to
> me to be the better choice here. I think it would be good to
> clarify which should be used for cases like this.)

U+0335 is the one intended for the kind of letter overlay you get for
barred-i, barred-o, barred-u, barred-l, etc.

>
> The question is this: Is there any potential problem having a
> Ll character with no decomposition that gets case mapped to an
> Lu character that is defined only as a (decomposed) sequence?

Yes. A naive case mapping would results in U-bar (0055 + 0335) --> 0075 + 0335,
which would *look* like u-bar (0289) in a proper rendering, but not
be canonically equivalent to it.

> The alternative would be to propose the upper case characters
> as additions to the standards, but if added they would
> certainly have to be added without canonical decompositions.

Yes, I think this is cleaner. It would be parallel to the way case
extensions for IPA characters adapted into African orthographies was
done. It would also result in less special treatment for case pairs,
which is likely to mean more correct behavior (in the long run, after
implementations supporting the new characters finally roll out).

>
> >By policy, Unicode doesn't do canonical decompositions that
> >involve overlays, probably because the exact position of the
> >overlay varies too much depending on the underlying letter.
>
> If there's a problem with positioning of overlays, does that
> constitute an argument for encoding LATIN CAPITAL LETTER U BAR,
> etc rather than encoding these are combining sequences? But
> then, what's the point of having the combining overlay
> characters at all?

The combining overlays were added to the standard *before* canonical
equivalence tables were proposed and worked out for all characters in
the standard. The implications of the combining overlays were not
clear to everyone at the time. And the interaction between canonical
decompositions and normalization was unknown. They were encoded
nevertheless since they do exist as diacritics for the Latin script.

It is, of course, possible to use them to represent characters -- like
the PNG examples -- that do not have unitary encoded characters. (But
with the kinds of case-mapping and normalization issues you are worried
about.)

At this point the main uses for a combining overlay character that
I know about are:

   A. Metacitation: "The character i-bar has a combining-bar-overlay."
        (You can do this with 0020 + 0335, for example.)

   B. Collation weighting: The overlay diacritics all get distinct
        secondary weightings in the collation tables. Particular
        tailorings of the table can choose to treat overlay diacritics
        for particular characters as if they were secondary, diacritic
        weights for generating collation keys. Having these characters
        weighted in the table makes this a little easier to do.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT