Re: displaying Unicode text (was Re: Transcriptions of "Unicode")

From: Erik van der Poel (
Date: Thu Dec 07 2000 - 03:48:20 EST

Mark Davis wrote:
> Let's take an example.
> - The page is UTF-8.
> - It contains a mixture of German, dingbats and Hindi text.
> - My locale is de_DE.
> From your description, it sounds like Modzilla works as follows:
> - The locale maps (I'm guessing) to 8859-1
> - 8859 maps to, say Helvetica.
> - The dingbats and Hindi appear as boxes or question marks.
> This would be pretty lame, so I hope I misunderstand you!!

Sorry, I've been abbreviating quite a bit, so I left out a lot. Yes,
you've misunderstood me, but only because I abbreviated so much. Sorry.
Let me try again, with more feeling this time.

Using the example above:

- The locale maps to "x-western" (ja_JP would map to "ja", so I've
prepended "x-" for the "language groups" that don't exist in RFC 1766)

- x-western and CSS' sans-serif map to Arial

- The dingbats appear as dingbats if they are in Unicode and at least
one of the dingbat fonts on the system has a Unicode cmap subtable
(WingDings is a "symbol" font, so it doesn't have such a table), while
the Hindi might display OK on some Windows systems if they have Hindi
support (Mozilla itself does not support any Indic languages yet).

We could support the WingDings font if we add an entry for WingDings to
the following table:

We just haven't done that yet.

Basically, Mozilla will look at all the fonts on the system to find one
that contains a glyph for the current character.

The language group and user locale stuff that I mentioned earlier is
only one part of the process -- the part that deals with the user's font
preferences. I'll explain more of the rest of the process:

Mozilla implements CSS2's font matching algorithm:

This states that *for each character* in the element, the implementation
is supposed to go down the list of fonts in the font-family property, to
find a font that exists and that contains a glyph for the current
character. Mozilla implements this algorithm to the letter, which means
that fonts are chosen for each character without regard for neighboring
characters (unlike MSIE). This may actually have been a bad decision,
since we sometimes end up with text that looks odd due to font changes.

Anyway, Mozilla's algorithm has the following steps:

1. "User-Defined" font
2. CSS font-family property
3. CSS generic font (e.g. serif)
4. list of all fonts on system
5. transliteration
6. question mark

You can see these steps in the following pieces of code:

1. "User-Defined" font (FindUserDefinedFont)

We decided to include the User-Defined font functionality in Netscape 6
again. It is similar to the old Netscape 4.X. Basically, if the user
selects this encoding from the View menu, then the browser passes the
bytes through to the font, untouched. This is for charsets that we don't
already support. This step needs to be the first step, since it
overrides everything else.

2. CSS font-family property (FindLocalFont)

If the user hasn't selected User-Defined, we invoke this routine. It
simply goes down the font-family list to find a font that exists and
that contains a glyph for the current character. E.g.:

  font-family: Arial, "MS Gothic", sans-serif;

3. CSS generic font (FindGenericFont)

If the above fails, this routine tries to find a font for the CSS
generic (e.g. sans-serif) that was found in the font-family property, if
any, otherwise it falls back to the user's default (serif or
sans-serif). This is where the font preferences come in, so this is
where we try to determine the language group of the element. I.e. we
take the LANG attribute of this element or a parent element if any,
otherwise the language group of the document's charset, if
non-Unicode-based, otherwise the user's locale's language group.

4. list of all fonts on system (FindGlobalFont)

If the above fails, this routine goes through all fonts on the system,
trying to find one that contains a glyph for the current character.

5. transliteration (FindSubstituteFont)

If we still can't find a font for this character, we try a
transliteration table. For example, the euro is mapped to the 3 ASCIIs
"EUR", which is useful on some Unix systems that don't have the euro
glyph yet. Actually, this transliteration step isn't even implemented on
Windows yet.

6. question mark (FindSubstituteFont)

If we can't find a transliteration, we fall back to the last resort --
the good ol' question mark.

That's it. I hope I didn't abbreviate too much this time!


