Re: [long] Use of Unicode in AbiWord

From: Erik van der Poel (
Date: Thu Mar 18 1999 - 13:26:10 EST

"Eric W. Sink" wrote:
> We know that we need the notion of a font list, as opposed to just "a
> font". In fact, since a lot of our design inspiration came from the
> CSS2 spec, a "font list" has been in our plans all along. However, we
> haven't really figured out yet how to expose such a notion in the GUI.
> Users are accustomed to a paradigm where they choose a single font for
> a single piece of text. If anyone has any ideas or examples of how to
> do this a non-user-scary fashion, we'd appreciate info or pointers.

Yes, this is a really hard problem. We (Mozilla) want to keep the UI
really simple, but at the same time, we want to give the user some
control over the font lists.

We do have an example of a priority list in our UI already: the
Languages list for the HTTP Accept-Language request header. You can find
it in Netscape under Edit.Prefs.Navigator.Languages. This is one way of
presenting a list that can be reordered, augmented, etc.

However, for fonts, we have a rather large number of lists, because
there are two axes (font type and language, where font type is serif,
sans-serif and the other CSS2 generics, and language is e.g. zh, ja, ko
(for Han disambiguation)). I don't know whether the new Mozilla (5.0)
should present all this in the "normal" UI, or in some relatively hidden
part of the UI for "Advanced" users, or in some configuration file.

> On Windows, if the current font is SHIFTJIS_CHARSET,
> then the text passed to TextOut() needs to be in that encoding.
> Likewise, in an X11 world, XDrawString[16] expects the text to match
> the encoding of the font in the GC. (Somebody tell me if I've still
> got this wrong).

Yes, for X there is a single encoding, indicated at the end of the long
"XLFD" font name.

Windows actually provides multiple ways to access fonts, and the details
are rather OS-version specific. NT4 has pretty good support for Unicode.
You can either pass 16-bit Unicode to the "W" functions (e.g.
ExtTextOutW), or multibyte text such as Shift-JIS to the "A" functions.
The regular ExtTextOut is #define'd to be the "W" or "A" version
depending on the "UNICODE" (and "_UNICODE"?) #define's. I'm skimming
over the surface here. There are many more details.

> We also think that we should switch our representation to UTF-8. On
> every platform we current plan to support, this would eliminate the
> encoding conversion step (as well as a lot of memory usage) for any
> run of text which includes only ASCII characters. For obvious
> reasons, and with no offense intended to the majority of the world who
> primarily use double-byte encoded characters, we believe this to be a
> common case worth optimizing.

I think using 16-bit Unicode is better. UTF-8 is harder to move around
in. However, the new Mozilla has taken a different approach. I still
consider it debatable (it was conceived by somebody else) but it
basically uses 8-bit strings when the text is US-ASCII (< 128) and
16-bit strings otherwise. This does allow some memory and speed
optimization, at the cost of added code complexity.

> Obviously, we're now going to need a whole bunch of code to convert
> back and forth between various encodings.

Owen Taylor (GTK) and I have also been discussing this. Mozilla 5.0 is
working on a bunch of Unicode converters. GTK is also working on Unicode
support for version 1.3. If both Mozilla and GTK have these large
tables, it would be wasteful. So it might be worth coming up with an
Open Source Unicode conversion library that can be shared by all these
Open Source projects.

> 1. Mozilla. Reading this code base is interesting, but the license
> restrictions of the NPL prevent us from using any of the actual
> code. Furthermore, it looks like Mozilla's i18n strategy for
> document content is not Unicode, but rather, a representation
> which supports a variety of encodings with tags for same.

You may have been looking at the old Mozilla code base. It has been
abandoned for the new effort, centered around "NGLayout". This new code
base is based on 16-bit Unicode internally (except for the US-ASCII hack
I mentioned earlier).

I have implemented Unicode and font lists on Windows for Mozilla 5.0 but
the code hasn't been checked in yet. I'm currently working on the same
for X. Slightly out-of-date documentation can be found at:

Erik van der Poel
Netscape, Client I18N

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:44 EDT