RE: Character identities

From: Kent Karlsson (
Date: Thu Oct 31 2002 - 09:03:44 EST

  • Next message: Doug Ewell: "[OT] Göthe (was: Re: RE: Character identities)"

    Let me take a few comparable examples;

    1. Some (I think font makers) a few years ago argued
       that the Lithuanian i-dot-circumflex was just a
       glyph variant (Lithuanian specific) of i-circumflex,
       and a few other similar characters.

       Still, the Unicode standard now does not regard those as
       glyph variants (anymore, if it ever did), and embodies
       that the Lithuanian i-dot-circumflex is a different
       character in its casing rules (see SpecialCasing.txt).
       There are special rules for inserting (when lowercasing)
       or removing (when uppercasing) dot-aboves on i-s and I-s
       for Lithuanian. I can only conclude that it would be
       wrong even for a Lithuanian specific font to display an
       i-circumflex character as an i-dot-circumflex glyph,
       even though an i-circumflex glyph is never used for

    2. The Khmer script got allocated a "KHMER SIGN BEYYAL".
       It stands (stood...) for "any abbreviation of the
       Khmer correspondence to etc."; there are at least four
       different abbreviations, much like "etc", "etc.", "&c",
       "et c.", ... It would be up to the font maker to decide
       exactly which abbreviation, and would vary by font.

       However, it is now targeted for deprecation for precisely
       that reason: it is *not* the font (maker) that should
       decide which abbreviation convention to use in a document,
       it is the *"author"* of the document who should decide.
       Just as for the Latin script, the author decides how to
       abbreviate "et cetera". The way of abbreviating should stay
       the same *regardless of font*. Note that the font may be
       chosen at a much later time, and not for wanting to
       change abbreviation convention. That convention one
       may want to have the same throughout a document also
       when using several different fonts in it, not having to
       carefully consider abbreviation conventions when choosing

    3. Marco would even allow (by default; I cannot get away
       from that caveat since some (not all) font technologies
       do what they do) displaying the ROMAN NUMERAL ONE THOUSAND
       C D (U+2180) as an M, and it would be up to the font
       designer. While the glyphs are informative, this glyphic
       substitution definitely goes too far. If the author
       chose to use U+2180, a glyph having at least some
       similarity to the sample glyph should be shown, unless
       and until someone makes a (permanent or transient)
       explicit character change.

    4. Some people write è instead of é (I claim they cannot
       spell...). So is it up to a font designer to display
       é as è if the font is made for a context where many
       people does not make a distinction? Can a correctly
       spelled name (say) be turned into an apparent misspelling
       by just choosing such a font? And that would be a Unicode

    5. I can't leave the ö vs. ø; these are just different
       ways of writing "the same" letter; and it is not
       the case that ø is used instead of ö for any
       7-bit reasons. It is conventional to use ø for ö
       in Norway and Denmark for any Swedish name (or
       word) containing it. The same goes for ä vs. æ.
       Why shouldn't this one be up to the font makers too?
       If the font is made purely for Norwegian, why not
       display ö as ø, as is the convention? This is
       *exactly* the same situation as with ä vs. a^e.

    I say, let the *"author"* decide in all these cases, and
    let that decision stand, *regardless of font changes*.
    [There is an implicit qualification there, but I'm
    tired of writing it.]

    > Kent Karlsson wrote:
    > > > I insist that you can talk about character-to-character
    > > > mappings only when
    > > > the so-called "backing store" is affected in some way.
    > >
    > > No, why? It is perfectly permissible to do the equivalent
    > > of "print(to_upper(mystring))" without changing the backing
    > > store ("mystring" in the pseudocode); to_upper here would
    > > return a NEW string without changing the argument.
    > And that, conceptually, is a character-to-glyph mapping.

    Now I have lost you. How can it be that? The "print"
    part, yes. But not the to_upper part; that is a
    character-to-character mapping, inserted between the
    "backing store" and "mapping characters to glyphs".
    It is still an (apparent) character-to-character
    mapping even if it is not stored in the "backing store".

    > In my mind, you are so much into the OpenType architecture,
    > and so much used
    > to the concept that glyphization is what a font "does", that
    > you can't view the big picture.

    Now I have lost you again. Some fonts (in some font
    technologies) do more that "pure" glyphization. This
    is why I have been putting in caveats, since many people
    seem to think that all fonts *only* do glyphisation,
    which is not the case.

    But to be general I was referring to such mappings regardless
    of if that is built into some font (using character code points
    or, as in OT/AAT, using glyph indices) or (better) were external
    to the font.

    I was trying to use general formulations, but I cannot
    avoid having caveats for certain mappings that certain
    technologies do (since those are so popular). But I would
    agree that those particular forms of mappings *should not*
    be done by fonts (but they are), and instead be done
    externally of the fonts (even when transient, as part
    of the "rendering"). An advantage would be that if
    a particular (named) mapping was asked for (to_upper say),
    it would be done the same way regardless of which font
    is chosen. But alas...

                    Kind regards
                    /kent k

    This archive was generated by hypermail 2.1.5 : Thu Oct 31 2002 - 09:57:02 EST