RE: Character identities

From: Kent Karlsson (
Date: Tue Oct 29 2002 - 06:08:22 EST

  • Next message: John Cowan: "Re: Unicode plane 14 language tags."

    > -----Original Message-----
    > From: Marco Cimarosti []
    > Sent: den 28 oktober 2002 16:23
    > To: 'Kent Karlsson'; Marco Cimarosti
    > Cc:
    > Subject: RE: Character identities
    > Kent Karlsson wrote:
    > > > > For this reason it is quite impermissible to render the
    > > > > combining letter small e as a diaeresis
    > > >
    > > > So far so good. There would be no reason for doing such a thing.
    > > ...
    > > > > or, for that matter, the diaeresis as a combining
    > > > > letter small e (however, you see the latter version
    > > > > sometimes, very infrequently, in advertisement).
    > > >
    > > > This is the case I though we were discussing, and it is a
    > > > very different case.
    > >
    > > No, the claim was that diaresis and overscript e are the same,
    > The claim was that dieresis and overscript e are the same in *modern*
    > *standard* German. Or, better stated, that overscript e is
    > just a glyph
    > variant of dieresis, in *modern* *standard* German typeset in Fraktur.

    Well, we strongly disagree about that then. Marc and I clearly see them
    as different. More about this below.

    > Sorry if I haven't stated this clearly enough.

    You have several times. No need to emphasise it anymore. We still
    don't agree.

    > > Some of them (overscript e in particular) should be(come)
    > > quite commonly occurring in any Fraktur Unicode font.
    > "Commonly" sounds funny near "Fraktur"...

    We were talkning about Fraktur fonts (which may not be all that

    > > > Using such a character to encode 21st century advertisements
    > > > is doomed to cause problems:
    > > >
    > > > 1) The glyph for U+0364 is more likely found in the font
    > > > collection of the
    > > > Faculty of Germanic Studies that on the PC of people wishing
    > > > to read the
    > > > advertisement for "Ye Olde Küster Pub". So, most people will
    > > > be unable to
    > > > view the advertisement correctly.
    > > >
    > > > 2) The designer of the advertisement will be unable to use
    > > > his spell-checker and hyphenator on the advertisement's text.
    > >
    > > Advertisements should invariably be final spell-checked and
    > > hyphenated by humans! Automated spell checkers and hyphenators
    > > for German (as well as Scandinavian languages) have (so far)
    > > not been good enough even for running text that you want to
    > > publish...
    > This has no connection with this discussion.

    Well, you brought it up. I'm usually rather picky about spelling,
    so a spell checker can only suggest "corrections", often to be
    rejected as wrong or even silly.

    > However, IMHO, the presence U+0364 (COMBINING LATIN SMALL
    > LETTER E) in a
    > modern German or Swedish text is just a plain spelling error,
    > and even the
    > naivest spellchecker should flag it as such.

    So what? Naïve spell checkers flag all kinds of correctly spelled

    > > Most modern use of Fraktur seem to use diaeresis or double
    > > acute for this.
    > U+0308 (COMBINING DIAERESIS) should be the only "umlaut" to
    > be found in
    > modern German text. What that diacritic *looks* like (two
    > dots, an "e", a
    > double acute, a macron, Mickey Mouse's ears), is a choice of the font
    > designer.

    Not quite. Please note that some characters are defined to have
    very specific glyphs, e.g. the estimated sign, there is no shape
    variability except for size. Others are "glyphically allocated/
    unified", like the diacritics, and some glyphic variability is
    expected. But a diaeresis is two dots (of some shape, and it would
    be a margin case to have them elongated), never a tilde, macron
    or overscript e. Those are other characters, not just a glyph
    variation. Other characters have more glyphic variability
    (informally) associated with them, like A, but some of them
    have compatibility variants that have a somewhat more restricted
    glyphic variability, like the Math Fraktur A in plane 1.

    Some scripts have by tradition some very "strong" ligatures;
    "strong" in the sense that may be hard to recognise the ligated
    pieces in the result glyph. That does not mean that you can
    legitimately use an M glyph for One Thousand C D, just because
    they "mean" the same. Nor does that mean that diacritics can be
    substituted for each other, asking for a diaeresis and get a tilde.
    Yes, it is common practice with many to use a tilde instead of
    a diaresis in handwriting, but it is still character substitution,
    not a glyphic variant (since that is the way diacritics are
    allocated in Unicode).

    > > (But the web designer could use a dynamically
    > > downloaded font fragment, if there is worry that all glyphs
    > > might not be supported by the fonts used by the vast majority
    > > of the target audience.)
    > This too has no connection with this discussion, and is OT. Unicode is
    > concerned with how text is *encoded* the details of fonts and display
    > technology are out of scope.

    We were talking about fonts.

    > What Unicode really mandates is that the encoding should not change to
    > obtain a certain graphic effect.

    You can do any character mappings you like before you apply any
    font, or make it into graphics...

    > > And overscript small e will also vary with the font,
    > > looking like a shrunken ordinary e glyph of (ideally) the same font.
    > > But never like two dots (in the default mode of a Unicode font).
    > You haven't yet defined your meaning of "Unicode font" and,
    > now, you add a
    > new fancy term: "default mode"!
    > What's a "default mode"? Unicode does not require fonts to
    > have any kind of
    > "modes". You seem to be talking about the "features", which
    > may exist in
    > *some* font technologies (e.g., Open Type), and are not a
    > requirement for
    > Unicode.

    I was trying to be general (not fancy) and not just talk about
    Opentype. But yes, I meant (at least) the case where no
    "features" (or similar) are invoked. But to be more precise,
    I would allow purely typographic "features" though, like
    degree of ligation, lowercase digits (sometimes incorrectly
    called "old form" digits), or different angles of acutes.

    What I was aiming at excluding were "features" that implicitly
    involve character mappings, like the "hist" someone mentioned,
    or "smallcaps" (which *implicitly* involves a mapping to uppercase,
    and then the use of x-height glyphs for the uppercase letters).
    Or any "feature" ("hist"?) that map diaeresis to (say) overscript e.
    (I know, there is no literal character mapping involved in AAT or
    OT fonts, it either goes directly to glyph indices or maps glyph
    indices to glyph indices, but the *net effect* is a font internal
    characters to characters mapping.)

    A font that by default (that is ordinary English, not a fancy
    term) maps lowercase letters to uppercase (or smallcap) glyps,
    is not a Unicode font (whatever the technology). If it by
    special invocation ("features", "modes", call-it-whatever) does
    an implicit (or explicit) character mapping, then that is what
    it does: a character mapping paired with a mapping to glyphs.
    Likewise for a font that (implicitly or explicitly) does other
    character to character mappings (like diaeresis to overscript e)
    should not do so by default ("in default mode") if they are
    Unicode fonts.

    OT: Personally, I think it is a bad idea to try to make fonts
    do (in effect) character mappings (e.g. lowercase to uppercase
    for smallcaps). Those mappings, I think, should be done outside
    of fonts. But the contrary seems to be in fashion for certain
    mappings. They should not be done by default though.

    > > > graphic designer to change the *encoding* of their text in
    > > > order to get the desired result.
    > >
    > > A graphic designer is likely to turn the whole thing into 2-d
    > > or 3-d graphics, probably distorted, possibly animated, to get
    > > the desired result! At which point the original, or intemediary,
    > > encoding of any text elements is not very relevant to the
    > > end result.
    > We are not talking about printed text or picture containing
    > text: we are
    > talking about *electronic* text *encoded* in Unicode. Or else
    > we are OT.

    You were talking about "desired effect" in ads. That is often not
    achievable without involving graphics... (You brought that up, not me!)

    > Well, the first and only time I have seen that "Thousand C D"
    > was on the
    > Unicode charts... However, if I'd be asked which glyph is
    > more appropriate
    > for that character, I would say: the same as capital "M".

    No, definitely not! They look very different, and I am sure
    anyone (except you) using Thousand C D would never want it
    displayed as an M. (If so, then you've done a character mapping!
    Or perhaps you want to do a morphing ;-)

    > > > The difference must be preserved when it
    > > > is useful -- e.g., U+0308 should not look like U+0364 in a
    > >
    > > "should not" --> "must never"
    > OK. U+0308 must never look like U+0364 in a font designed for
    > publishing books on the history of German.

    Not only then.

    > However, this is just a requirement of common sense, *not* of
    > the Unicode Standard.

    Everything about Unicode and fonts is about common sense.
    It is very hard to make formal requirements in this area. So
    all requirements on fonts are informal, and are not rigidly stated.

    > Perhaps for an "Unicode font in default mode" all this true.
    > You are the
    > only person who knows, since you seem the inventor of this term...

    It is plain English, Marco!

    > "When rendered in the context of a language or script,
    > like ordinary
    > letters, combining marks may be subjected to systematic
    > stylistic variation.
    > [...] U+030C COMBINING CARON is normally rendered as an
    > apostrophe when used
    > with certain letterforms. U+0325 COMBINING COMMA BELOW is
    > sometimes rendered
    > ("7.9 Combining Marks", "Glyphic Variation", page 180)

    These are particular forms of ligation, done for typographic
    reasons. Not at all like the cases we were talking about.
    (Replacing comma below by a cedilla (or the other way around)
    is a bad idea though. Users (apparently) care!)

    > "Each character in these code charts is shown with a
    > representative
    > glyph. ...

    Yes. But that does not mean you can use arbitrary glyphs.
    And, as I mentioned, some characters are glyphically more
    constrained than others. This does not rule out decorative
    or handwriting fonts in any way. But if you go to far,
    you're loosing connection with Unicode. So if you make
    a font where diaeresis is mapped to Micky Mouse ears, you
    should not call it a Unicode font, even if you can apply
    that font to Unicode text ("has a Unicode cmap" as it is
    called in some font technologies). (Snowcaps are easier
    to appreciate... ;-)

                    /Kent K

    This archive was generated by hypermail 2.1.5 : Tue Oct 29 2002 - 06:53:21 EST