Re: Missing capital H from Unicode range (see 1E96)

From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Wed Jul 06 2005 - 15:45:48 CDT

  • Next message: Bob Hallissy: "RE: Arabic encoding model (alas, static!)"

    On Wed, 6 Jul 2005, Leiter Phelix wrote:

    >> I am seeking some advice on the use of a capital H with a bar under
    >> (see lowercase character 1E96).
    >>
    >> The character seems it ought to be a valid one and is used in Hefa in
    >> Israel - this is made more likely by the inclusion of the lowercase
    >> character in the Unicode range (1E96).

    As far as I know, the only documented usage for 1E96 is in some
    transliteration systems for Semitic languages, such as transliteration of
    Arabic according to ISO 233. Although Arabic does not make case
    distinction, it is customary and normal to use mixed case in
    transliterated words and texts, using e.g. a capital letter at the start
    of a proper noun. Thus, I too find it strange that the corresponding
    capital letter has no code position in Unicode and that 1E96 has no
    uppercase mapping.

    >> Could somebody please advise me:
    >> 1) how to construct the character by using floating marks - as my
    >> results do not provide as good a representation as the lowercase
    >> version

    As Clark Cox replied, U+0048 U+0331 is the Unicode representation of
    capital H with bar under, or actually with line below, to use the Unicode
    name*). You can write the character in Unicode even though it has no code
    position of its own, and the standard _could_ specify U+0048 U+0331 as
    the uppercase mapping of U+1E96. I don't understand why it doesn't.
    *) The naming is somewhat odd, because the combining diacritic U+0331
    is named "combining macron below".

    Anyway, when you use U+0048 U+0331, you are asking programs to construct a
    rendering by adding a line under to "H", whereas for E+1E96, programs may
    use a glyph from a suitable font. So the rendering mechanisms can be
    rather different. In current software, the dynamic construction of
    characters with diacritic marks is usually qualitatively poor and does not
    really correspond to the Unicode standard's idea of such construction.

    You may well get different renderings by using U+1E96 and U+0068 U+0331,
    even though they are canonically equivalent. Programs typically render
    U+0068 U+0331 using their rather primitive method for dynamic
    construction, instead of recognizing the sequence as identical to U+1E96
    and using its glyph instead.

    I'm afraid there is not much you can do, except perhaps try another
    program and/or another font, if possible. (In a simple test, I noticed
    that using Arial Unicode MS, U+0048 U+0331 looks rather bad - the line
    under is positioned too much on the left, but in Times New Roman, it looks
    acceptable to me. Your mileage most probably varies.)

    >> 2) why this character is not in the Unicode range or whether it is/has
    >> been considered for inclusion

    As I wrote, I don't know this piece of history. But given the fact that it
    has no code position now, it is very probable that it will not be added.
    The general policy is to avoid adding new precomposed characters; we are
    supposed to use combining diacritic marks instead.

    -- 
    Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
    


    This archive was generated by hypermail 2.1.5 : Wed Jul 06 2005 - 15:46:34 CDT