RE: Unicode biwidth fonts

From: Chris Pratley (chrispr@microsoft.com)
Date: Wed Dec 09 1998 - 15:14:58 EST


Just a note: it looks like the correct URL is :
http://www.unicode.org/unicode/reports/tr11/

Asmus, it is great that there is a report addressing this problem. It is one
of the bigger issues we had when implementing Unicode for Word97. One
particularly stark example I use is the horizontal ellipsis (U+2026). This
character is displayed quite differently in East Asian fonts from what is
used in Latin-script languages. East Asian users essentially treat the two
presentations of this character as two different characters, and tend to be
shocked if they casually apply an Asian font to some mixed latin/Asian text
and find this glyph jumping to the other "character".

In Word we ended up addressing this issue of ambiguous characters by
defining a special "hint" property that is attached to the characters when
they are first input based on some heuristics. The heuristics include the
source legacy encoding (as you suggest), as well as things like the
preceding character (for the ellipsis). We feel it is important that these
characters be fixed unambiguously at entry time to avoid surprising the user
as I mentioned above. Unfortunately, the drawback is that we have to
maintain a rich format for the text in order to maintain this property. If
the text ever goes to plain Unicode, as it may in a copy/paste scenario,
then the property is lost. Likewise, opening a plain Unicode text file is a
problem - the only heuristic you can use is the user's locale, which is
rather vague. The text file may be entirely Cyrillic text, or simply have a
single Greek symbol (such as alpha) in the midst of a great amount of Asian
text.

Another approach would be to re-evaluate the heuristics at display time,
such as with the ellipsis. You could always check the preceding character
and if it was unambiguously FW or HW, use that type of font for the
ellipsis. This leads to undesirable behavior however. You may have a
sentence of Japanese followed by a FW ellipsis. It the user types a single
HW character immediately before the ellipsis, they would be quite shocked to
see the ellipsis morph into another "character". This type of heuristic
fails for characters that are not so unabiguously connected with nearby
characters.

One other problem with implementation is that there is no easy and fast way
to know how a particular font implents these ambiguous characters. You could
query the width of all such characters, however the performance hit is too
large. Currently Word uses some heuristics based on the set of known East
Asian fonts for Windows which until recently exclusively used FW forms for
the ambiguous ranges, but this is starting to change as some newer East
Asian fonts implement the Cyrillic/Greek ranges of the legacy East Asian
encodings as HW characters. This is actually Unicode's influence I believe,
reflecting the very low usage of these FW forms in those East Asian
countries, and a growing amount of document interchange (especially web
pages) with Greek and especially Cyrillic-using colleagues who obviously are
using fonts and document layouts that assume HW for these characters. It may
be necessary to broaden the heuristic by looking at a single character in a
range and extrapolating from there.

Anyway, it is an interesting topic and definitely a challenge for
implementers as you are well aware.

Chris Pratley
Microsoft Office Program Manager

-----Original Message-----
From: Asmus Freytag [mailto:asmusf@ix.netcom.com]
Sent: Tuesday, December 08, 1998 5:39 PM
To: Unicode List
Subject: Re: Unicode biwidth fonts

At 04:20 PM 12/8/98 -0800, Markus Kuhn wrote:

First of all, please reference the TR not the DTR

http://www.unicode.org/unicode/reports/r11.html

>I guess the best convention is to declare the characters of category W
>and F in Unicode Draft Technical Report #11 <http://www.unicode.org/
>unicode/reports/dtr11.html>, i.e. the characters 1100..11F9, 3000..30FE,
>3131..33FE, 4E00..9FA5, AC00..D7A3, E000..E757, F900..FA2D, FE30..FE44,
>FE49..FE52, FE54..FE6B, FF01..FF5E, FFE0..FFE6, to be wide characters
>occopying the space of 2 ASCII characters, and all the remaining Unicode
>characters are as wide as ASCII characters in a monospaced font.

I realize that a single world wide font is restrictive. But following your
convention, could lead to some surprises, as an EA user would expect to see
all the category A characters rendered as double cells as well.
The only 'correct' answer would be to have two fonts, one with A as single
cell and the other with A as double cell and to switch between them
depending on whether you are in EA legacy mode or not.

>What I don't like about the tables in DTR11 is that class X is so large.

class X has been removed

>If there is a single code point free in the middle of a large number of
>W characters, then this code point should be reserved to future W
>characters and should not be listed as unassigned.

Unicode never assigns properties to unassigned code points.

>The distinction
>between narrow and wide characters should be possible efficiently in
>software in an if statement that fits on three lines in C without a
>table lookup. The given intervals should be a bit more generous to
>simplify implementation.

Implementations can simplify anything they want to, since EA Width is
_informative_ at this point.

A./



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:43 EDT