Re: Clarification

From: Mark E. Davis (
Date: Tue Sep 28 1999 - 11:23:40 EDT

You might suggest that the interested parties join the Unicode discussion list
(instructions are at so that they can get more extensive
feedback on their comments, but here are some replies for now.


"Dr. Josef W. Meri" wrote:

> Dear Mr. Davis,
> Thank you once again for your help in the past. As I mentioned previously,
> I have discussed Unicode in an essay on developments in my field. The
> Middle East Studies Association Bulletin is also keen on reviewing new
> software and fonts and we have been in touch with font publishers and
> Microsoft. The review will also discuss a number of problematic issues for
> scholars who use transliteration fonts and Middle Eastern Languages in
> wordprocessing, e-mail and publishing, etc. I was recently made aware of a
> number of important issues which or may not have been addressed by Unicode
> (Please see below). Any insight would greatly be appreciated. Thank you.
> Yours,
> Josef
> Dr. Josef W. Meri, D.Phil.
> Department of Near Eastern Studies
> 250 Barrows Hall
> U.C. Berkeley
> Berkeley, CA 94720-1940
> Unicode has adopted modern Hebrew cantillation marks, not the ones used in
> the Biblia Hebraica Stuttgartensia or the Leningrad Codex. My comparison
> found Unicode is not complete.

It is known that Unicode is not yet complete; we welcome information from
scholars as to new characters that should be encoded. There is a proposal form
in the Unicode book and on the Unicode website.

> Only if Unicode is willing to expand its basic principles to include
> standard input of conjuncts, etc. will it be a solution for many of the
> world's languages. This willingness is not guaranteed and would require a
> major reworking of Unicode. To tout Unicode as the solution before the
> controllers of Unicode express willingness to solve the key problems is to
> undermine incentive for them to solve them.

No specifics were mentioned here, so it is difficult to know what the perceived
problem is. We have seen no evidence that the Unicode architecture is
insufficient for any of the worlds scripts. We find it often the case that
people have misperceptions about the Unicode standard because they have never
read it, and are depending on scattered material on the web or in publications.
Copies of the book should be available at any good library.

> > Although Unicode was designed to support zero-width overstriking
> > accents, Office 2000 apparently demands full-width-bearing accents and
> > always centers them over letters, undermining proper positioning of
> > non-centered accents (like vowels under Hebrew resh and dalet or any
> > combination of vowels and accents under the same letter). This and the
> > previous item pose a great dilemma for Biblical scholars who want all
> vowel points and all vowel point-accent combinations properly positioned.

While this is directed at Office 2000, not at the Unicode standard, in my
experience this is not the case. I suspect it depends primarily on the OpenType
information in the particular font; that font may not be sufficiently well
developed. Bear in mind, however, that we are seeing an evolution in the
capabilities of products; it takes considerable time and energy to add any new
features to a product, and they will not show up all at once.

> Unicode deliberately by principle does not include conjunct forms of
> > letters (although it is not always consistent on this point). With the
> > possible exception of points where they have been inconsistent with this
> > principle, this requires additional software to be made to call up
> conjunct
> > forms, and custom versions of this software will probably have to be made
> > for different applications after they come out with Unicode support. This
> > increases the burden of font support to an enormous degree. In my
> opinion,
> > this fundamental aspect of linguistic display should have been supported
> by
> > Unicode from the start. It would have been far more logical to have
> support
> > for all the graphic forms required by the world's languages rather than
> just
> > some of them and then (as Unicode has done) leave it up to font makers to
> > assign them inevitably non-standard private use area positions for most of
> > these.

There are good reasons for not encoding presentation forms for every possible
combination of characters that could conceivably have a different glyph (or
series of glyphs) in some font. Having large numbers of presentation forms is a
significant burden on all non-visual processing; comparison, sorting, character
conversion, analysis, tokenization, spell-checking, grammar-checking, &c.It may
appear to make rendering easier, but often generates problems of its own. The
existence of the huge number of compatibility Arabic ligature presentation forms
in Unicode doesn't help programmers in the rendering of Arabic (and is a waste
of space on the BMP). The set of glyphs supported by a given font design
generally does not include all those presentation forms, and often includes
glyphs that are not in that list. If those code points are interchanged you have
no guarantee that the glyphs will be available on the other side, so you get
clumps of black boxes.

Moreover, if those presentation forms are used, it requires that they be entered
into the text. Simply having a huge keyboard mapping or typing hex numbers is
impractical. One might answer that algorithms can be used to combine characters
on type-in, so that the presentation form code points are automatically
generated; but this technology can just as easily be incorporated into the
rendering side instead: generating ephemeral glyph codes based on data in the
font. That way, the set of glyphs being drawn are guaranteed to be the glyphs
that the font supports, and the font can support arbitrarily complex glyphs.

That is the choice that we made in the design of Unicode, based on considerable
experience developing systems for international text handling in many different
computing environments. While some degree of impatience is understandable, the
technology (OpenType, Apple Type Technology, etc) is available for handling
arbitrary new presentation forms automatically in fonts without encoding as
characters, and such fonts are being produced.

I have a new paper at that your sources
may find useful. While focused on discussing the different encoding forms for
Unicode (UTF-8, UTF-16, etc.), it also touches on the distinctions between
character, glyph, code point, and code unit.


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT