[long] Use of Unicode in AbiWord

From: Eric W. Sink (eric@postman.abisource.com)
Date: Thu Mar 18 1999 - 10:44:39 EST


                        17 March 1999
                        Eric W. Sink


The purpose of the note is to communicate our plans and approach for
using Unicode within the design of AbiWord. The goal is to obtain
feedback from people knowledgeable about Unicode, in order to refine
our approach to best support future internationalization efforts.

This note is being sent to the abiword-dev mailing list, as well as to
certain other places where Unicode gurus are likely to encounter it.
Given the diversity of the audience for this note, we'll start with a
few definitions:

"AbiWord" -- an Open Source (GPL), cross-platform word processor.
It's currently at version 0.5.1. Source code and lots of related info
are available at www.abisource.com.

"Eric Sink" -- The author of this note. A person who is constantly
learning about i18n and Unicode, but nonetheless feels that each new
bit of gained knowledge moves him no closer to knowing anything
useful. :-)

"Unicode" -- If you don't know what Unicode is, none of the rest of
this note will make any sense.

"/dev/null" -- The place to send any flames regarding this note. Over
time, we want AbiWord to be as internationalized as possible, and we
want to make sure that we're building on a well-designed foundation at
each step. To get there, we need to ask for guidance from people who
know more about this stuff than we do. AbiWord is a GPL word
processor, so any contributions you make will simply make AbiWord a
better app for the community. If you don't want to help, no problem,
but I don't think we have to apologize for asking.


Note that this note only addresses a small part of what i18n for
AbiWord should really mean, and it does not deal with l10n issues at
all. We're talking only about content of AbiWord documents, with the
eventual goal of supporting multilingual documents. For the purpose
of this note, we're ignoring things like locale-specific date and
time, internationalized toolbar icons and how to localized AbiWord's
menu bar.

In fact, our more pressing motivation is not really related to i18n at
all. We want to ensure that our support for things like bullets,
symbols and dingbats is done right, believing that doing so will lay
part of the foundation for future support for multilingual documents.
Since AbiWord is an Open Source project, we hope that in the future it
will benefit from the assistance of developers in the community who
can add their expertise to improve its i18n support.

So for right now, we just want to get the very basic architectural
issues right.


When we started development of AbiWord, we wanted to make sure that we
were giving early consideration to i18n issues. All of the primary
hackers on AbiWord are U.S. born and bred, and with the exception of
some pretty decent Spanish abilities, we're basically monolingual. We
have all worked on projects before which dealt with i18n and l10n
issues, but we knew we lacked the hands-on experience to confidently
design and implement support for things like CJK or Arabic.

However, we're not *completely* ignorant. We know the difference
between a character and a glyph. We know what encodings are. We know
the difference between ASCII and ISO-8859-1. We've done a lot of
reading. And one of the things we knew, right from the beginning, was
that we certainly should not assume that all characters are 1 byte.

We decided to try to build AbiWord's architectural foundations around
Unicode. We knew that real i18n support would require more work than
simply using a 16-bit quantity to represent a character. However, we
figured that doing so would at least help, and would save us work
later. Besides, choosing to implement with a Unicode-based core
seemed like a very "forward-thinking" thing to do.


So, that's how AbiWord got to where it is now. All document
characters are defined to be part of the Unicode space. Our internal
data structures always use a 16-bit integer to represent a character.

But that's about as far as we go. Our support for encodings and font
matching is nonexistent, but well-intentioned. For example, we
currently allow the user to type a string of english text, select it,
and choose Dingbats as the font. In my opinion, this is a bug. The
dingbat characters have their own character codes which should be
honored. Using a font to reinterpret a character code as a totally
unrelated glyph seems like a very non-Unicode-friendly thing to do.


We're ready to get this aspect of our design done right, to lay a
foundation for continuing to improve our support for multilingual

One of the things we've learned is that Unicode just isn't a panacea.
I don't think we ever really thought it was, but I'm starting to
believe that in order to appreciate the benefits of Unicode, it helps
to have developed a well-internationalized app using some other

Unicode tempts you to think that it is a 64k character space, the
members of which can all be treated the same. In a very naive view,
this would spare us the need to handle a whole bunch of different
character sets differently. As it happens, Unicode seems to be a
character set with a whole bunch of little character sets inside it,
each of which needs to be treated differently. In that sense, Unicode
solves very few problems -- it simply offers a reasonably consistent
approach to solving some problems which are going to be hard no matter
what happens.

Choosing a 16-bit integer to represent all characters hasn't helped us
much. We believed, incorrectly, that it would be possible to call
underlying graphics APIs to draw those characters, without an
intermediate step. This is basically untrue in our cross-platform

As it happens, part of our Unicode support works, but the result feels
somewhat accidental. We can manually construct AbiWord document files
with arbritrary Unicode characters in them, since AbiWord files are
XML-based. If we make sure that the font is set to "Lucida Sans
Unicode", our Windows NT version will display and print a surprising
number of Unicode characters, including part of Cyrillic, Greek, and
Hebrew. We want to do better than this.

We know that we need the notion of a font list, as opposed to just "a
font". In fact, since a lot of our design inspiration came from the
CSS2 spec, a "font list" has been in our plans all along. However, we
haven't really figured out yet how to expose such a notion in the GUI.
Users are accustomed to a paradigm where they choose a single font for
a single piece of text. If anyone has any ideas or examples of how to
do this a non-user-scary fashion, we'd appreciate info or pointers.

We also know now that fonts dictate the encoding of the characters
passed to the graphics API. I was always very puzzled by the fact
that the text drawing calls on various platforms do not specify what
encoding is assumed for the string passed to it. It was only recently
(gulp) that I realized that the encoding depends entirely on the
current font. On Windows, if the current font is SHIFTJIS_CHARSET,
then the text passed to TextOut() needs to be in that encoding.
Likewise, in an X11 world, XDrawString[16] expects the text to match
the encoding of the font in the GC. (Somebody tell me if I've still
got this wrong).

Our goals are made more complex by the fact that we want to be able to
support almost any platform. AbiWord currently runs on Windows
95/98/NT and Linux (as well as several other Unix-like systems).
Ports to MacOS and BeOS are both underway.

We think that staying with a Unicode-oriented design makes sense. The
alternative appears to be modifying our text representation to handle
text in a wide variety of encodings, tagged as such. This looks ugly
and worth avoiding.

We also think that the proper way to handle text output is:

1. Make sure that each character is always being rendered in a font
    which actually has that Unicode character in it, if such a font is

2. Before display, convert the characters to the encoding expected by
    the font.

This will require us to modify our font abstraction to communicate
font encodings and offer facilities for verifying the presence of
various characters in the font.

We also think that we should switch our representation to UTF-8. On
every platform we current plan to support, this would eliminate the
encoding conversion step (as well as a lot of memory usage) for any
run of text which includes only ASCII characters. For obvious
reasons, and with no offense intended to the majority of the world who
primarily use double-byte encoded characters, we believe this to be a
common case worth optimizing.

Obviously, we're now going to need a whole bunch of code to convert
back and forth between various encodings. Furthermore, we'll need
code in our graphics abstraction layer to manage fonts in a more
Unicode-friendly fashion. If we do our job right, the cross-platform
parts of our code will remain clean, with little knowledge of anything
beyond the Unicode character space, while the platform-specific layers
of our code will need to take responsibility for presenting a
Unicode-oriented API.

We've done some searching for information and/or source code which
might of use to us. We've found the following:

1. Mozilla. Reading this code base is interesting, but the license
    restrictions of the NPL prevent us from using any of the actual
    code. Furthermore, it looks like Mozilla's i18n strategy for
    document content is not Unicode, but rather, a representation
    which supports a variety of encodings with tags for same.

2. Tcl/Tk 8.1. Tcl and Tk, as of version 8.1, use Unicode for
    representation of all strings. Currently this code is in beta,
    and a review of the code reveals a state which is consistent with
    that designation. However, a lot of the code we want is there,
    including several charset conversions and management of font lists
    and Unicode character lookup within them. I take this as some
    level of confirmation that the approach I've described is a
    reasonable one. Furthermore, since the Tcl and Tk libraries are
    Open Source under a BSD-ish license, it looks very likely that
    we'll be able to leverage some of the code itself.

3. Java. Java's string types are entirely Unicode. Supposedly, as
    of JDK 1.1, text display is no longer limited to the Latin-1
    subset. If this is the case, they must have implemented a fair
    portion of the same kinds of things we'll need to do. I have not
    looked at the JDK 2 source code. Even if we do so, we won't
    actually be able to use any of it, due to license restrictions.
    However, we might learn something.

So, the approach we're going to implement, subject to the feedback we
hope to receive in response to this note, is:

1. Continue to use the Unicode character space to represent all text
    in AbiWord documents.

2. Change our internal data structures to use UTF-8 to represent that

3. Change our handling of fonts to implement a *list* of fonts, with
    the goal that a Unicode character will always be rendered with an
    appropriate glyph if an appropriate font can be found.

4. Change our rendering code to allow for the conversion (if any)
    between the font's encoding, whatever it may be, and the UTF-8
    representation of Unicode character which we'll be using for our
    data structures.


Writing this email has given me a good opportunity to collect my
thoughts. However, the primary point of writing and distributing
these thoughts is to solicit feedback from people more knowledgeable.
If you think we're on the wrong track, please let us know. If you
think we're on the right track, please let us know.

Eric W. Sink, Software Craftsman

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:44 EDT