RE: Character identities

From: Marco Cimarosti (
Date: Tue Oct 29 2002 - 10:14:58 EST

  • Next message: Doug Ewell: "Re: Unicode plane 14 language tags."

    Kent Karlsson wrote:
    > > The claim was that dieresis and overscript e are the same
    > in *modern*
    > > *standard* German. Or, better stated, that overscript e is
    > > just a glyph
    > > variant of dieresis, in *modern* *standard* German typeset
    > in Fraktur.
    > Well, we strongly disagree about that then. Marc and I
    > clearly see them as different. More about this below.

    We could simply agree to disagree, weren't it for the fact that we both
    claim that each other's view violates the principles of Unicode.

    I have tried to show that glyphic variation is part the principles of
    Unicode, as per TUS 3.0. You might wish to point us to where the current
    Unicode Standard support your view, or contradicts mine.

    > > However, IMHO, the presence U+0364 (COMBINING LATIN SMALL
    > > LETTER E) in a
    > > modern German or Swedish text is just a plain spelling error,
    > > and even the
    > > naivest spellchecker should flag it as such.
    > So what? Naïve spell checkers flag all kinds of correctly spelled
    > words!

    Yes but, IMHO, in this case they would be right: I never heard that U+0364
    (COMBINING LATIN SMALL LETTER E) is part of the spelling of modern German or

    > Not quite. Please note that some characters are defined to have
    > very specific glyphs, e.g. the estimated sign, there is no shape
    > variability except for size.

    A small set of *symbols* like the estimate sign and some dingbats are an
    exception to the rule that Unicode encodes character but not glyphs.

    > Others are "glyphically allocated/
    > unified", like the diacritics, and some glyphic variability is
    > expected. But a diaeresis is two dots (of some shape, and it would
    > be a margin case to have them elongated), never a tilde, macron
    > or overscript e.

    Would you care to go in Germany and have a look at shop signs? The umlaut is
    more often a straight line than not. But this doesn't make it a "macron":
    there is no macron in German.

    > Those are other characters, not just a glyph variation.

    So I was wrong: German orthography uses macrons! Can you please explain the
    German pronunciation of "ā", "ō" and "ū"?

    > Other characters have more glyphic variability
    > (informally) associated with them, like A, but some of them
    > have compatibility variants that have a somewhat more restricted
    > glyphic variability, like the Math Fraktur A in plane 1.

    More *symbol* characters which escape the general rule.

    > Some scripts have by tradition some very "strong" ligatures;
    > "strong" in the sense that may be hard to recognise the ligated
    > pieces in the result glyph. That does not mean that you can
    > legitimately use an M glyph for One Thousand C D, just because
    > they "mean" the same.

    Perhaps. It could have been a poor example. But the opposite is much more
    important: you cannot use a character in place of another which "means" a
    different thing just because you want a different "look".

    > Nor does that mean that diacritics can be
    > substituted for each other, asking for a diaeresis and get a tilde.

    Substituting diacritics for each other is what *you* seem to suggest!

    > Yes, it is common practice with many to use a tilde instead of
    > a diaresis in handwriting, but it is still character substitution,
    > not a glyphic variant (since that is the way diacritics are
    > allocated in Unicode).

    So, German orthography uses tildes too! Can you please explain the German
    pronunciation of "ã", "õ" and "ũ"?

    > > What Unicode really mandates is that the encoding should
    > not change to
    > > obtain a certain graphic effect.
    > You can do any character mappings you like before you apply any
    > font, or make it into graphics...

    There can be no character-to-character mapping inside a font or a display
    engine! Applications are allowed to do character-to-character mappings only
    when they want to *change* the text in some way (e.g., a case conversion, a
    transliteration, etc.), not when they want to display it.

    Displaying Unicode only implies character-to-glyph mappings. Internally,
    there can be some glyph-to-glyph mapping, but never a character-to-character
    mapping. Even character-to-character mappings done on a temporary copy of
    the text are, conceptually, a step on the character-to-glyph mapping.

    This fundamental error spreads throughout all your post, and makes it
    impossible to go into the details without keeping on saying: you can't do
    any character-to-character mappings during display; you can't do any
    character-to-character mappings during display; you can't do any...

    > I was trying to be general (not fancy) and not just talk about
    > Opentype. But yes, I meant (at least) the case where no
    > "features" (or similar) are invoked.

    Who tells you that there are any "features" to be invoked? There is no
    similar requirement in Unicode!

    > What I was aiming at excluding were "features" that implicitly
    > involve character mappings, [...]

    You see? "You can't do any character-to-character mappings during display."
    For simplicity, I will simply cut off all passages where you assume this.

    > A font that by default (that is ordinary English, not a fancy
    > term)

    Who tells you that our font has more than one "mode"? You are arbitrarily
    generalizing the architecture of *some* OpenType fonts.

    > > We are not talking about printed text or picture containing
    > > text: we are
    > > talking about *electronic* text *encoded* in Unicode. Or else
    > > we are OT.
    > You were talking about "desired effect" in ads. That is often not
    > achievable without involving graphics... (You brought that
    > up, not me!)

    No, Marc (with no final "-o") brought that up. To achieve the desired
    graphical effect, you choose a font which ensures that effect, possibly
    turning on any "feature" or "option" or "mark up" that can help achieving
    this. You should NOT be forced to change the content of the text!

    > > However, this is just a requirement of common sense, *not* of
    > > the Unicode Standard.
    > Everything about Unicode and fonts is about common sense.

    No, sir. There is a big book and several technical reports that explain
    (among other things) how a font (or, more generally, a display engine)
    should behave in order to be compliant to the Unicode Standard. Many of
    these rules are not "common sense" at all: e.g., the separation between
    abstract characters and visible glyphs is one of the most counter-intuitive
    concepts I ever met. But it is a key feature of Unicode, and you go nowhere
    with this standard if you refuse to understand or to consider it.

    > It is very hard to make formal requirements in this area. So
    > all requirements on fonts are informal, and are not rigidly stated.

    Many requirements are formally and explicitly described. Other thing are
    formally and explicitly left unspecified. There certainly are some gray
    areas, but I feel that this issue is not one of them.

    > > Perhaps for an "Unicode font in default mode" all this true.
    > > You are the
    > > only person who knows, since you seem the inventor of this term...
    > It is plain English, Marco!

    Nope! The expression "Unicode font" is not to be found on any English
    dictionary, nor in the Unicode Standard. It is plain English only if you
    take it as a simple determiner+noun phrase, in which case it simply means
    "font used to typeset text encoded in Unicode".

    But you clearly don't use it in that plain sense, as demonstrated by
    assumptions such as: the glyphs in an Unicode font must be such-and-such; an
    Unicode font must have two or more "modes"; an Unicode font in its "default
    mode" it is not allowed to do that-and-that...

    > > "When rendered in the context of a language or script,
    > > like ordinary
    > > letters, combining marks may be subjected to systematic
    > > stylistic variation.
    > > [...] U+030C COMBINING CARON is normally rendered as an
    > > apostrophe when used
    > > with certain letterforms. U+0325 COMBINING COMMA BELOW is
    > > sometimes rendered
    > > as U+0312 COMBINING TURNED COMMA ABOVE [...]"
    > > ("7.9 Combining Marks", "Glyphic Variation", page 180)
    > These are particular forms of ligation, done for typographic
    > reasons. Not at all like the cases we were talking about.

    They are examples of glyph variations for diacritic characters, exactly
    analogous to the case we are discussing: the different shapes that umlaut
    may take in German typography.

    > > "Each character in these code charts is shown with a
    > > representative glyph. ...
    > Yes. But that does not mean you can use arbitrary glyphs.

    I don't know if it does not mean that. However, I am not talking about
    arbitrary glyphs, but about variant glyphs that can be seen in a hour walk
    in any German city.

    > And, as I mentioned, some characters are glyphically more
    > constrained than others.

    Chapter, paragraph, page?

    _ Marco

    This archive was generated by hypermail 2.1.5 : Tue Oct 29 2002 - 11:09:55 EST