BUG: Windows "Verdana font" and COMBINING DOT BELOW (was:Missing capital H from Unicode range (see 1E96))

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Aug 15 2005 - 16:42:24 CDT

  • Next message: Peter Constable: "RE: Windows "Verdana font" and COMBINING DOT BELOW (was:Missing capital H from Unicode range (see 1E96))"

    From: "Gregg Reynolds" <unicode@arabink.com>
    >> Out of Topic Note: did you notice the placement problem with the
    >> COMBINING DOT BELOW in the Verdana font on Windows XP, as shown in my
    >> previous message?
    > Yep. Verdana COMBINING DOT BELOW is definitely flakely. I looked in
    > MSWord and Babelpad.

    It's strange because I just noticed it today, and not before. It seems that
    there's been an update in my new distribution of Windows, or on Windows
    update, because in the past I could see this COMBINING DOT BELOW correctly
    placed, and used it as a way to encode Latin-based African languages that
    make lots of use of consonnants with dot below, and where the precombined
    character is most often absent of fonts, not like the decomposed combining
    dot below.

    Now if I look into some pages I composed in the past, I see that all these
    dots below consonnants appear shifted under the following letter (for
    example vowels, or even the word-separating space that follows a dotted
    word-final consonnant). So now these pages are broken in their text. I am
    sure that I tested these pages in the past with Verdana, in addition to
    Arial, Times and Courier New.

    What is strange is that the Verdana font seems to correctly *center* the
    combining dot below the following character, so that the horizontal position
    of this dot depends on the width of the following character.

    For example if I code <h, dot below, o> or <h, dot below, i>, the dot
    appears under the center of o or i, not under the left of o, and not under
    the right of i or after it.

    This means that the Verdana font was explicitly instructed to create a
    ligature of this combining dot and a base letter. But the combination was
    incorrectly encoded in the recent version, and the internal glyph
    composition tables are broken there.

    I think that Microsoft made changes in his Verdana font to support some
    other languages and mixed this dot below with other combining dots below,
    when it united its glyph with other characters (for example with the glyph
    used for the Hebrew combining meteg point).

    Such visual bug does not happen in Arial, Arial Unicode MS, Times New Roman,
    Tahoma, and Courrier New.

    If you read a plain-text email in Outlook or Outlook Express (and probably
    other mail tools as well), the rendered text will be incorrect if you have
    set up your mail reader with Verdana as the default font for the Latin
    script (because it is more cumfortable to read than the default Arial font).

    Unfortunately, Microsoft does not offer in Outlook or Outlook Express a way
    to select temporarily the visual font used to render emails, when they are
    in plain-text or when they do not specify a specific font. You have to set
    and save new preferences, before reading such email. The only thing that
    Microsoft and others proposes is to select an alternate encoding charset to
    decode the message. Why not having in the same menu an option to set another
    font to read the email (for example if the text appears unreadable because
    the default font does not render some characters correctly or lacks glyphs
    for them, selecting another font would solve the problem).

    For the same reason, I feel irritated when I have to reread an email or page
    and the mail reader or browser reguesses incorrectly its default encoding,
    and reuse the default font. Why doesn't the email reader or browser keep
    these preferences attached to the email or page, as additional meta-data in
    its local cache or mail storage?

    I also feel irritated when a all-English or all-French website is encoded 
    only with ISO-8859-1, but does not specify it in the HTML or HTTP headers. 
    When such page contains VERY FEW non-ASCII letters (notably people names 
    containing vowels with diaeresis), IE for example will use its "autodetect 
    mechanism" and will guess incorrectly that the page is encoded with Chinese 
    GB2312: it may completely break the HTML structure, or the text will not be 
    rendered correctly, showing ideographs instead of pairs or triplets of 
    Latin-1 letters.
    The problem here is that the "autodetect" mechanism has too laxist detection 
    thresholds: it can guess the page is in Chinese only because it has found 
    only 1 apparent ideograph within a page that contains tens of kilobytes of 
    plain-ASCII. Although this is not strictly related to Unicode, this just 
    shows that the autodetection of encodings has been worked a lot only for 
    Asian charsets, and not trained to support European charsets and languages 
    (including ISO-8859-* encodings).
    There's really a need to add non-Asian language/charset profiles in the 
    encoding autodetection mechanism, and to review the autodetection mechanism 
    (at least for correct determination of the encoding, even if it remains an 
    ambiguity about the effective language, which would require more advanced 
    techniques such as lexical lookups).
    Before this occurs, the charset selection will remain a nightmare for users, 
    and applications should adopt more smart behavior by letting users select 
    rendering preferences including font selection and effective encoding, and 
    store these preferences along with the page cache or mail stores (as this 
    cannot be a global configuration for all pages or emails).
    Conclusion: user preferences are good for accessibility of softwares so that 
    they will work correctly the way users want for most of the contents they 
    work with, but these global settings cannot solve all problems. 
    Internationalized softwares must be smarter and should provide ways for 
    users to override their preferences specifically for specific ressources, 
    and then remember these user decisions as a way to effectively "train" the 
    automatisms offered by such programs.

    This archive was generated by hypermail 2.1.5 : Mon Aug 15 2005 - 16:44:21 CDT