Re: Unicode repertoire of X11 fonts

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Sun Dec 06 1998 - 06:37:03 EST


Erik van der Poel wrote on 1998-12-05 23:23 UTC:
> I have a question about your Unicode fonts for X. As you know, the X
> Windows font names (XLFD) end with the character encoding name. In your
> example, it says:
>
> -misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646-1
>
> This tells us that the *encoding* is iso10646-1, presumably UCS-2.

Yes. (Minor terminology nitpicking: I think talking about UCS-2 doesn't
make sense in the context of a font, because glyphs in fonts have just
integer numbers and not byte sequences assigned to them, and UCS-2 (at
least for me) refers only to how to store a sequence of characters as a
byte sequence.)

> However, it does not tell us which *characters* (actually, glyphs) are
> included in the font.

Correct. When we had the discussion that led to the official
registration of *-iso10646-1 in the XLDF scheme, we specifically agreed
that X fonts with the properties

  CHARSET_REGISTRY "ISO10646"
  CHARSET_ENCODING "1"

should refer to *any* subset of ISO 10646-1. We left the option for future
standardization to indicate a subset (such as MES2) in the
CHARSET_ENCODING field.

> Is the application supposed to look at the per_char array of the
> XFontStruct to find out which characters are available? My guess is
> that's the only way...

You can do either this, or leave it to Xlib to display the DEFAULT_CHAR
specified in the font for all characters that are not in the font. The
DEFAULT_CHAR property in an X11 ISO 10646-1 font should always be set to
65533 (U+FFFD), such that the replacement character is used
automatically for glyphs that are not available.

We had a discussion to use the CHARSET_ENCODING field in order to
indicate the minimum subset of ISO 10646-1 that has been implemented
(note that one documented use of this field is specifically to indicate
subsets of character sets).

For instance, we could propose the following hierarchy of Unicode
subsets

 set 1: CP1252 (218 characters)
 set 2: CP1252+ISO6937 (349 characters, covers ISO 8859-1/2/3/4/9/10/15)
 set 3: WGL4 (653 characters, covers in addition all non-Asian MS code pages)
 set 4: MES2 (~1000 characters, forthcoming CEN/TC304 WGL4 superset
              repertoire with Greek Politonics and more Cyrillic/Math/8859-14.)

plus of course also other subsets such as the JIS repertoire, etc.

Although I was originally enthusiastic about the idea, I am not sure any
more whether it is a good idea to encode the implemented subset this way
in the XLFD. In the end, I expect the users to select the character set
that covers a sufficient subset for their needs. For instance, the
normal European and American email user will usually be perfectly happy
with a font that covers WGL4 or better MES2, which is what I expect soon
every ISO 10646-1 font to cover anyway. Some academic email users will
want to have row 22 (math) and a few other symbols covered as well,
Japanese users will probably be happy with a WGL4+JIS subset, etc.

If the used font covers already the subset that the user understands,
then the application can stay ignorant of the repertoire available in
the font and use just the X DEFAULT_CHAR mechanism for the rest. For
instance, my 6x13 font now contains much more than all level 1
characters that I can recognize. I am not able to read Tamil, Thai, or
Japanese, therefore there is no information loss for me if X replaces
these characters with the REPLACEMENT CHARACTER on display, as long as
operations like cut&paste preserves the information.

I hope that tools such as xfontsel will be extended to probe XFontStruct
and give the user an indication of which repertoire is fully implemented
(e.g., a CP1252, WGL4, or MES2 flags showing up in a corner if the
selected font covers this repertoire.).

Who is currently maintaining xfontsel and xfd?

If any of you need compact data files on character set repertoires, I
can very easily create those for you with software that I wrote for
reviewing the CEN/MES project proposals. Just let me know.

E.g., the CP1252+ISO6937 repertoire is

# Rows Positions (Cells)

  00 20-7E A0-FF
  01 00-13 16-2B 2E-4D 50-7E 92
  02 C6-C7 D8-DD
  20 13-15 18-1A 1C-1E 20-22 26 30 39-3A AC
  21 22 26 5B-5E 90-93
  26 6A

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:43 EDT