Re: Weighting ae and oe (was the unbelievably wandering Re: Decimal separator...)

From: Jim Allan (
Date: Tue May 20 2003 - 01:42:26 EDT

  • Next message: Marco Cimarosti: "Hebrew "aposthophe""

    Ken Whistler posted:

    > The difficulty for _ae_, which many people who opine about this
    > issue tend to overlook, is that the Unicode Standard also
    > includes, from Nordic standards, a number of accented _ae_
    > characters as precomposed characters. These make the table
    > considerably more complicated if the default treatment for
    > _ae_ is to weight it as an <a,e> sequence, since you then
    > have to figure out what to do with the accented forms, for which
    > you have just drained the base character weighting.
    > In any case, inconsistent as it is for these two characters,
    > the allkeys.txt table was constructed as it is for a reason,
    > (or several reasons, actually),
    > and I'm disinclined to suggest that its handling of _ae_
    > and _oe_ should be restructured, since that ripples out to
    > cause further destabilization of tailorings based on the
    > current values in the table.

    The webpage presents
    examples of various Old Norse letters/ligatures, many with diacritics,
    not yet implemented in Unicode, though a proposal is being drawn up (see

    Among these characters is the conjoined _oe_ character with both single
    acute and with double acute.

    There are also other two-character ligature-type combinations with
    diacritics centered over the total combined character.

    Though not counted as letters in any alphabet, except for the conjoined
    _ae_ and conjoined _oe_, they carry diacritics in a manner which
    indicates those who penned them considered them single characters.

    The easiest answer might be to count all these characters as full
    letters *of a kind*, a *kind* of letter not usually recognized as part
    of an alphabet, and not decomposable, first because Unicode desires no
    more decomposable characters, second because I don't think anyone wants
    to have to add rules for determing whether a combining acute accent
    follwing (for example) _aa_ ought to fall between the two letters or
    over the last one depending on whether the two characters are ligated.

    The characters, like the conjoined _oe_ in current Unicode, might still
    be identified as ligatures in their official names.

    As for the default sorting of these characters ...

    It would seem that there are four different kinds of Latin base letters
    of ligature form that can carry diacritics:

    1. Base letters that almost never thought to be anything but a single
    letter: _w_ and some linguistic character combinations. These should
    sort according to some regular alphabetical sequencing (which for the
    linguistic characters is a somewhat arbitrary but reasonable arrangement
    first devised in Unicode itself).

    2. Base letters like the conjoined _ae_, conjoined _oe_ and U+0223 (in
    origin an _ou_ ligature) which in some environments are considered
    simple ligatures rather than letters and in some are not, which are
    somtimes counted as part of the alphabet by their users and sometimes
    are not, and which should sort accordingly. This means whatever Unicode
    decides for the default, tailoring will often be necessary for
    individual use.

    3. Base letters found in Old Norse which are never counted as part of an
    alphabet and sort as though broken into their parts(?). I suppose some
    such sort order as _aa_, _a_, _a_, conjoined _aa_, conjoined _aa_
    might make sense(?). I think might belong to this class.

    Something will eventually have to be defined to take account of this. :-0

    I hope there are not also cases of conjoined letters where a diacritic
    may sometimes be applied to one of the parts of the conjoined letters,
    and sometimes to the letter as a whole.

    Currently conjoined _ae_ is assigned to class 1 and conjoined _oe_ is
    assigned to class 3, allowing class 2 to be omitted.

    Class 3 is the difficult one.

    Jim Allan.

    This archive was generated by hypermail 2.1.5 : Tue May 20 2003 - 02:38:03 EDT