Re: What is the principle?

From: Philippe Verdy (
Date: Fri Mar 26 2004 - 16:31:28 EST

  • Next message: Philippe Verdy: "Re: Printing and Displaying Dependent Vowels"

    From: "Arcane Jill" <>
    > Ignoring all compatibility characters; ignoring everything that has gone
    > before; and considering only present and future characters (that is,
    > characters currently under consideration for inclusion in Unicode, and
    > characters which will be under consideration in the future), which of
    > the following is the PRINCIPLE which decides whether or not a character
    > is suitable:
    > (A) A proposed character will be rejected if its glyph is identical in
    > appearance to that of an extant glyph, regardless of its semantic
    > meaning, or
    > (B) A proposed character will be rejected if its semantic meaning is
    > identical to that of an extant character, regardless of the appearance
    > of its glyph, or
    > (C) A proposed character will be rejected if either (A) or (B) are true, or
    > (D) None of the above
    > ?
    > Although this is a question about the future, no clairvoyance is
    > required, since I am asking about the principle behind decisions, not
    > about specific characters.

    Response (D) unambiguously. There's no normative glyph in Unicode, which just
    specifies a single representative glyph just to exhibit the identity of the
    character and identify it between other encoded characters in the same script
    (or sometimes in other scripts as well, but there are many counter examples
    where even these representative glyphs for distinct characters will look the

    My opinion is that the main reason why a new "similar" character needs to be
    encoded is because its current normative properties can't fit with some
    linguistic usages or create false interpretation of text in some language. Or
    because the character glyph was borrowed from another script which globally
    behaves very differently (see the various symbols or letters that look like a
    Greek uppercase Lambda but have very distinct histories of use, and very
    different applications and properties, and would be used inconsistently if they
    were simply borrowed from a foreign script without gaining the new identity in
    the new script).

    If something can't be corrected by adding more glyphs substitution rules in
    fonts to render the text the way the authors want, or if any basic text handling
    produces wrong results because of a normative behavior (for example Bidi
    properties, case mappings, decompositions and canonical reordering of
    diacritics) then comes the need to add new characters.

    Look for example how various D with stroke, which look very similar or identical
    in uppercase, are given distinct codepoints: this is needed because they have
    very distinct lowercase mappings and because the lowercase versions should not
    be mixed as they have different identities.

    Another example is with some greek letters whose letterforms were borrowed into
    Latin but with distinct case mappings too: see the uppercase version of Latin
    Esh which looks very similar or identical to the Greek uppercase Sigma.

    Another example comes with the new mathematical symbols for which no case
    mappings are acceptable as lowercase and uppercase versions need to remain
    distinct symbols.

    We will probably soon see new characters added to Hebrew because of problems for
    the interpretation of Biblic texts, or simply because the currently used
    characters can't fit with any other symbol or letters borrowed from other
    scripts as they have the wrong character properties for usage in Hebrew.

    Unicode just needs to encode what is needed to preserve the identity of the
    encoded text without loosing parts of its semantics.

    Also Unicode will make efforts to ensure that a single script will be enough to
    represent the same language, at least at the lexical level (exceptions exist for
    example in Japanese which mixes several scripts in the same text: Hiragana,
    Katakana and Han, but I think that this does not affect the lexical level), so
    that a text in some language needs not to mix characters from all blocks. This
    simplifies the work as it reduces the number of code point blocks to support for
    a language (and I see it as a good reason why letters borrowed into a romanized
    text from other scripts such as Cyrillic and Greek, were added to the Latin
    block with separate code points).

    May be Unicode members have distinct views about it, but this seems what is
    needed to allow consistent handling of text in its encoded form, without
    reference to any graphical considerations such as glyph processing, positioning,
    or reordering, as this allows a renderer to use whatever font design that
    respects the character identity (see the extended differences of glyph styles
    which can exist in Latin or Arabic, for which a very rich and complex set of
    calligraphic designs have been created thoughout centuries and milleniums).

    This archive was generated by hypermail 2.1.5 : Fri Mar 26 2004 - 16:59:01 EST